So, I'm running this simple program on a 16 core multicore system. I run it
by issuing the following.
spark-submit --master local[*] pi.py
And the code of that program is the following.
#"""pi.py"""
from pyspark import SparkContext
import random
N = 12500000
def sample(p):
x, y = random.random(), random.random()
return 1 if x*x + y*y < 1 else 0
sc = SparkContext("local", "Test App")
count = sc.parallelize(xrange(0, N)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
When I use top to see CPU
consumption, only 1 core is being utilized. Why is it so? Seconldy, spark
documentation says that the default parallelism is contained in property
spark.default.parallelism. How can I read this property from within my
python program?
Best Solution
As none of the above really worked for me (maybe because I didn't really understand them), here is my two cents.
I was starting my job with
spark-submit program.py
and inside the file I hadsc = SparkContext("local", "Test")
. I tried to verify the number of cores spark sees withsc.defaultParallelism
. It turned out that it was 1. When I changed the context initialization tosc = SparkContext("local[*]", "Test")
it became 16 (the number of cores of my system) and my program was using all the cores.I am quite new to spark, but my understanding is that local by default indicates the use of one core and as it is set inside the program, it would overwrite the other settings (for sure in my case it overwrites those from configuration files and environment variables).