Scala – Spark: Repartition strategy after reading text file

apache-sparkpartitionscala

I have launched my cluster this way:

/usr/lib/spark/bin/spark-submit --class MyClass --master yarn-cluster--num-executors 3 --driver-memory 10g --executor-memory 10g --executor-cores 4 /path/to/jar.jar

The first thing I do is read a big text file, and count it:

val file = sc.textFile("/path/to/file.txt.gz")
println(file.count())

When doing this, I see that only one of my nodes is actually reading the file and executing the count (because I only see one task). Is that expected? Should I repartition my RDD afterwards, or when I use map reduce functions, will Spark do it for me?

Best Answer

It looks like you're working with a gzipped file.

Quoting from my answer here:

I think you've hit a fairly typical problem with gzipped files in that they cannot be loaded in parallel. More specifically, a single gzipped file cannot be loaded in parallel by multiple tasks, so Spark will load it with 1 task and thus give you an RDD with 1 partition.

You need to explicitly repartition the RDD after loading it so that more tasks can run on it parallel.

For example:

val file = sc.textFile("/path/to/file.txt.gz").repartition(sc.defaultParallelism * 3)
println(file.count())

Regarding the comments on your question, the reason setting minPartitions doesn't help here is because a gzipped file is not splittable, so Spark will always use 1 task to read the file.

If you set minPartitions when reading a regular text file, or a file compressed with a splittable compression format like bzip2, you'll see that Spark will actually deploy that number of tasks in parallel (up to the number of cores available in your cluster) to read the file.

Related Topic