I am trying to insert a pandas DataFrame into a Postgresql DB (9.1) in the most efficient way (using Python 2.7).
Using "cursor.execute_many" is really slow, so is "DataFrame.to_csv(buffer,…)" together with "copy_from".
I found an already much! faster solution on the web (http://eatthedots.blogspot.de/2008/08/faking-read-support-for-psycopgs.html) which I adapted to work with pandas.
My code can be found below.
My question is whether the method of this related question (using "copy from stdin with binary") can be easily transferred to work with DataFrames and if this would be much faster.
Use binary COPY table FROM with psycopg2
Unfortunately my Python skills aren't sufficient to understand the implementation of this approach.
This is my approach:
import psycopg2 import connectDB # this is simply a module that returns a connection to the db from datetime import datetime class ReadFaker: """ This could be extended to include the index column optionally. Right now the index is not inserted """ def __init__(self, data): self.iter = data.itertuples() def readline(self, size=None): try: line = self.iter.next()[1:] # element 0 is the index row = '\t'.join(x.encode('utf8') if isinstance(x, unicode) else str(x) for x in line) + '\n' # in my case all strings in line are unicode objects. except StopIteration: return '' else: return row read = readline def insert(df, table, con=None, columns = None): time1 = datetime.now() close_con = False if not con: try: con = connectDB.getCon() ###dbLoader returns a connection with my settings close_con = True except psycopg2.Error, e: print e.pgerror print e.pgcode return "failed" inserted_rows = df.shape data = ReadFaker(df) try: curs = con.cursor() print 'inserting %s entries into %s ...' % (inserted_rows, table) if columns is not None: curs.copy_from(data, table, null='nan', columns=[col for col in columns]) else: curs.copy_from(data, table, null='nan') con.commit() curs.close() if close_con: con.close() except psycopg2.Error, e: print e.pgerror print e.pgcode con.rollback() if close_con: con.close() return "failed" time2 = datetime.now() print time2 - time1 return inserted_rows