According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation.
Spark also has an optimized version of
coalesce()that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.
One difference I get is that with
repartition() the number of partitions can be increased/decreased, but with
coalesce() the number of partitions can only be decreased.
If the partitions are spread across multiple machines and
coalesce() is run, how can it avoid data movement?