Python – Using foreach with a Spark RDD in python

apache-sparkforeachpythonrdd

I'm trying to take a very large RDD running on a cluster and write it to a .csv. It's so large that .collect() breaks, so instead I'd like to save the RDD into pieces on each node and then somehow bring them together, as order doesn't matter. My thought is to use foreach with the CSV printer function, so that each part writes it's values, then I can gather the parts together manually, perhaps by FTP.

I'm a Spark user with some experience, but to date I've never been able to make the RDD's foreach method do anything useful. When I try to run the example given in the documentation,

>>> def f(x): print x
>>> sc.parallelize([1, 2, 3, 4, 5]).foreach(f)

I get nothing back on my console. I believe this is because the 'print' is being performed on the separate nodes rather than the namenode where the console is located. In that case though, I really can't see much point in the foreach function at all!

How can I get the results of my for each back to the name node without first calling the collect() function?

NB. I'm also open to using the saveAsTextFile() RDD function, but again I've not been able to make it work! It seems to create a folder instead of a text file, although this is perhaps because they exist on each node as well instead of centrally?

Best Solution

The answer here: View RDD contents in Python Spark?

The reason claimed is that it's not a Python 2.6 function.