Python – Reading Files in HDFS (Hadoop filesystem) directories into a Pandas dataframe


I am generating some delimited files from hive queries into multiple HDFS directories. As the next step, I would like to read the files into a single pandas dataframe in order to apply standard non-distributed algorithms.

At some level a workable solution is trivial using a "hadoop dfs -copyTolocal" followed by local file system operations, however I am looking for a particularly elegant way to load the data that I will incorporate into my standard practice.

Some characteristics of an ideal solution:

  1. No need to create a local copy (who likes clean up?)
  2. Minimal number of system calls
  3. Few lines of Python code

Best Solution

It looks like the pydoop.hdfs module solves this problem while meeting a good set of the goals:

I was not not able to evaluate this, as pydoop has very strict requirements to compile and my Hadoop version is a bit dated.