Python – A better way to load MongoDB data to a DataFrame using Pandas and PyMongo

pandaspymongopython

I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error.

MemoryError:    

My code looks like this:

cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

I've tried the methods in the following answers, which at some point create a list of all the elements of the database before loading it.

However, in another answer which talks about list(), the person said that it's good for small data sets, because everything is loaded into memory.

In my case, I think it's the source of the error. It's too much data to be loaded into memory. What other method can I use?

Best Solution

I've modified my code to the following:

cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

By adding the fields parameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.