sc.textFile(path) allows to read an HDFS file but it does not accept parameters (like skip a number of rows, has_headers,…).
in the "Learning Spark" O'Reilly e-book, it's suggested to use the following function to read a CSV (Example 5-12. Python load CSV example)
import csv import StringIO def loadRecord(line): """Parse a CSV line""" input = StringIO.StringIO(line) reader = csv.DictReader(input, fieldnames=["name", "favouriteAnimal"]) return reader.next() input = sc.textFile(inputFile).map(loadRecord)
My question is about how to be selective about the rows "taken":
- How to avoid loading the first row (the headers)
- How to remove a specific row (for instance, row 5)
I see some decent solutions here: select range of elements but I'd like to see if there is anything more simple.