I have launched my cluster this way:
/usr/lib/spark/bin/spark-submit --class MyClass --master yarn-cluster--num-executors 3 --driver-memory 10g --executor-memory 10g --executor-cores 4 /path/to/jar.jar
The first thing I do is read a big text file, and count it:
val file = sc.textFile("/path/to/file.txt.gz")
println(file.count())
When doing this, I see that only one of my nodes is actually reading the file and executing the count (because I only see one task). Is that expected? Should I repartition my RDD afterwards, or when I use map reduce functions, will Spark do it for me?
Best Answer
It looks like you're working with a gzipped file.
Quoting from my answer here:
You need to explicitly repartition the RDD after loading it so that more tasks can run on it parallel.
For example:
Regarding the comments on your question, the reason setting
minPartitions
doesn't help here is because a gzipped file is not splittable, so Spark will always use 1 task to read the file.If you set
minPartitions
when reading a regular text file, or a file compressed with a splittable compression format like bzip2, you'll see that Spark will actually deploy that number of tasks in parallel (up to the number of cores available in your cluster) to read the file.