Python – Efficiently writing large Pandas data frames to disk


I am trying to find the best way to efficiently write large data frames (250MB+) to and from disk using Python/Pandas. I've tried all of the methods in Python for Data Analysis, but the performance has been very disappointing.

This is part of a larger project exploring migrating our current analytic/data management environment from Stata to Python. When I compare the read/write times in my tests to those that I get with Stata, Python and Pandas are typically taking more than 20 times as long.

I strongly suspect that I am the problem, not Python or Pandas.

Any suggestions?

Best Solution

Using HDFStore is your best bet (not covered very much in the book, and has changed quite a lot). You will find performance is MUCH better than any other serialization method.