I have a Spark job which reads a source table, does a number of map / flatten / reduce operations and then stores the results into a separate table we use for reporting. Currently this job is run manually using the spark-submit script. I want to schedule it to run every night so the results are pre-populated for the start of the day. Do I:

  1. Set up a cron job to call the spark-submit script?
  2. Add scheduling into my job class, so that it is submitted once but performs the actions every night?
  3. Is there a built-in mechanism in Spark or a separate script that will help me do this?

We are running Spark in Standalone mode.

Any suggestions appreciated!

You can use a cron tab, but really as you start having spark jobs that depend on other spark jobs i would recommend pinball for coordination. https://github.com/pinterest/pinball

To get a simple crontab working I would create wrapper script such as

cd /locm/spark_jobs

export SPARK_HOME=/usr/hdp/
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_USER_NAME=hdfs
export HADOOP_GROUP=hdfs

#export SPARK_CLASSPATH=$SPARK_CLASSPATH:/locm/spark_jobs/configs/*

echo "Running $CLASS With Master: $MASTER With Args: $ARGS And Class Args: $CLASS_ARGS"

$SPARK_HOME/bin/spark-submit --class $CLASS --master $MASTER --num-executors 4 --executor-cores 4 $ARGS spark-jobs-assembly*.jar $CLASS_ARGS >> /locm/spark_jobs/logs/$CLASS.log 2>&1

Then create a crontab by

  1. crontab -e
  2. Insert 30 1 * * * /PATH/TO/SCRIPT.sh $CLASS "yarn-client"
