Skip to content

Node Decommission

This section shows how to use an Apache Spark feature that allows you to store the shuffle data and cached RDD blocks present on the terminating executors to peer executors before a Spot node gets decommissioned. Consequently, your job does not need to recalculate the shuffle and RDD blocks of the terminating executor that would otherwise be lost, thus allowing the job to have minimal delay in completion.

This feature is supported for releases EMR 6.3.0+.

How does it work?

When spark.decommission.enabled is true, Spark will try its best to shut down the executor gracefully. will enable migrating data stored on the executor. Spark will try to migrate all the cached RDD blocks (controlled by and shuffle blocks (controlled by from the decommissioning executor to all remote executors when spark decommission is enabled. Relevant Spark configurations for using node decommissioning in the jobs are

Configuration Description Default Value
spark.decommission.enabled Whether to enable decommissioning false Whether to decommission the block manager when decommissioning executor false Whether to transfer RDD blocks during block manager decommissioning. false Whether to transfer shuffle blocks during block manager decommissioning. Requires a migratable shuffle resolver (like sort based shuffle) false Maximum number of failures which can be handled for migrating shuffle blocks when block manager is decommissioning and trying to move its existing blocks. 3 Maximum number of threads to use in migrating shuffle files. 8

This feature can currently be enabled through a temporary workaround on EMR 6.3.0+ releases. To enable it, Spark’s file permission must be modified using a custom image. Once the code is fixed, the page will be updated.

Dockerfile for custom image:

FROM <release account id>.dkr.ecr.<aws region><release>
USER root
WORKDIR /home/hadoop
RUN chown hadoop:hadoop /usr/bin/

Setting decommission timeout:

Each executor has to be decommissioned within a certain time limit controlled by the pod’s terminationGracePeriodSeconds configuration. The default value is 30 secs but can be modified using a custom pod template. The pod template for this modification would look like

apiVersion: v1
kind: Pod
  terminationGracePeriodSeconds: <seconds>

Note: terminationGracePeriodSeconds timeout should be lesser than spot instance timeout with around 5 seconds buffer kept aside for triggering the node termination


cat >spark-python-with-node-decommissioning.json << EOF
   "name": "my-job-run-with-node-decommissioning",
   "virtualClusterId": "<virtual-cluster-id>",
   "executionRoleArn": "<execution-role-arn>",
   "releaseLabel": "emr-6.3.0-latest", 
   "jobDriver": {
    "sparkSubmitJobDriver": {
      "entryPoint": "s3://<s3 prefix>/", 
       "sparkSubmitParameters": "--conf spark.driver.cores=5 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6"
   "configurationOverrides": {
    "applicationConfiguration": [
       "classification": "spark-defaults",
       "properties": {
       "spark.kubernetes.container.image": "<account_id>.dkr.ecr.<region><custom_image_repo>",
       "spark.executor.instances": "5",
        "spark.decommission.enabled": "true",
        "": "true",
        "" : "true",
        "": "true"
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "<log group>", 
        "logStreamNamePrefix": "<log-group-prefix>"
      "s3MonitoringConfiguration": {
        "logUri": "<S3 URI>"

Observed Behavior:

When executors begin decommissioning, its shuffle data gets migrated to peer executors instead of recalculating the shuffle blocks again. If sending shuffle blocks to an executor fails, will give the number of retries for migration. The driver log’s stderr will see log lines Updating map output for <shuffle_id> to BlockManagerId(<executor_id>, <ip_address>, <port>, <topology_info>) denoting details about shuffle block ‘s migration. This feature does not emit any other metrics for validation yet.