Spark best practices on EMR

The default settings for Spark on EMR are… sub-optimal to say the least. This is meant to be a quick guide into some things I’ve found to be useful with provisioning a Spark cluster on EMR.

Background Info

Not all users have the same requirements or setup which can change how you configure and launch clusters. YMMV. Here are the things you should know about the clusters I generally work with.

  • Data stored on S3 in parquet
  • Only on-demand clusters used (no spot or instance fleets)
  • One user/application per cluster
  • Only Spark and Hadoop applications running on the cluster

Set Explicit Executor Instances/Cores/Memory/Parallelism/etc

From a capacity planning and application use case perspective, I think its best if you set these configuration options yourself. Your application might benefit from having more, but smaller executors or vice versa. There’s a great spreadsheet out there to help you calculate these settings depending on the size of your cluster.

EMR also has a setting for Spark, [maximizeResourceAllocation]( which is suppose to automatically configure everything based on the size of your cluster. Last time I tried this, it was still under utilizing much of my clusters resources.

Change Yarn’s Resource Allocator

By default, Yarn will assign an application’s resources by the bigger request for resource (I think this is how it works?). In a Spark application on Yarn, that is usually memory which leaves many cpus unallocated to containers.

What worked well for me is changing yarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator.

EMR has a capacity-scheduler to make setting this easy on cluster bootstrap.

  'Classification': 'capacity-scheduler',
  'Properties': {
    'yarn.scheduler.capacity.resource-calculator': 'org.apache.hadoop.yarn.util.resource.DominantResourceCalculator'

Use EMRFS If Applicable

From talking to some people at Amazon, EMRFS is suppose to give a nice performance boost to read and writing data on s3. Unfortunately given how things are set up, we’re unable to use EMRFS at my job so I haven’t been able to test this myself.