Instance Store Volumes¶

When working with Spark workloads, it might be useful to use instances powered by SSD instance store volumes to improve the performance of your jobs. This storage is located on disks that are physically attached to the host computer and can provide better performance compared to traditional EBS volumes. In the context of Spark, this might be beneficial for wide transformations (e.g. JOIN, GROUP BY) that generate a significant amount of shuffle data that Spark persists on the local filesystem of the instances where the executors are running.

In this document, we highlight two approaches to leverage NVMe disks in your workloads when using EMR on EKS. For a list of instances supporting NVMe disks, see Instance store volumes in the Amazon EC2 documentation.

Mount kubelet pod directory on NVMe disks¶

The kublet service manages the lifecycle of pod containers that are created using Kubernetes. When a pod is launched on an instance, an ephemeral volume is automatically created for the pod, and this volume is mapped in a subdirectory within the path /var/lib/kubelet of the host node. This volume folder exists for the lifetime of K8s pod, and it will be automatically deleted once the pod ceases to exist.

In order to leverage NVMe disk attached to an EC2 node in our Spark application, we should perform the following actions during node bootstrap:

Prepare the NVMe disks attached to the instance (format disks and create a partition)
Mount the /var/lib/kubelet/pods path on the NVMe

By doing this, all local files generated by your Spark job (blockmanager data, shuffle data, etc.) will be automatically written to NVMe disks. This way, you don't have to configure Spark volume path when launching the pod (driver or executor). This approach is easier to adopt because it doesn’t require any additional configuration in your job. Besides, once the job is completed, all the data stored in ephemeral volumes will be automatically deleted when the EC2 instance is deleted.

However, if you have multiple NVMe disks attached to the instance, you need to create RAID0 configuration of all the disks before mounting the /var/lib/kubelet/pods directory on the RAID partition. Without a RAID setup, it will not be possible to leverage all the disks capacity available on the node.

The following example shows how to create a node group in your cluster using this approach. In order to prepare our NVMe disks, we can use the eksctl preBootstrapCommands definition while creating the node group. The script will perform the following actions:

For instances with a single NVMe disk, format the filesystem, create a Linux partition (e.g. ext4, xfs, etc.)
For instances with multiple NVMe disks, create a RAID 0 configuration across all available volumes

Once the disks are formatted and ready to use, we will mount the folder /var/lib/kubelet/pods using the filesystem and setup correct permissions. Below, you can find an example of an eksctl configuration to create a managed node group using this approach.

Example

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: YOUR_CLUSTER_NAME
  region: YOUR_REGION

managedNodeGroups:
  - name: ng-c5d-9xlarge
    instanceType: c5d.9xlarge
    desiredCapacity: 1
    privateNetworking: true
    subnets:
      - YOUR_NG_SUBNET
    preBootstrapCommands: # commands executed as root
      - yum install -y mdadm nvme-cli
      - nvme_disks=($(nvme list | grep "Amazon EC2 NVMe Instance Storage" | awk -F'[[:space:]][[:space:]]+' '{print $1}')) && [[ ${#nvme_disks[@]} -eq 1 ]] && mkfs.ext4 -F ${nvme_disks[*]} && systemctl stop docker && mkdir -p /var/lib/kubelet/pods && mount ${nvme_disks[*]} /var/lib/kubelet/pods && chmod 750 /var/lib/docker && systemctl start docker
      - nvme_disks=($(nvme list | grep "Amazon EC2 NVMe Instance Storage" | awk -F'[[:space:]][[:space:]]+' '{print $1}')) && [[ ${#nvme_disks[@]} -ge 2 ]] && mdadm --create --verbose /dev/md0 --level=0 --raid-devices=${#nvme_disks[@]} ${nvme_disks[*]} && mkfs.ext4 -F /dev/md0 && systemctl stop docker && mkdir -p /var/lib/kubelet/pods && mount /dev/md0 /var/lib/kubelet/pods && chmod 750 /var/lib/docker && systemctl start docker

Benefits

No need to mount the disk using Spark configurations or pod templates
Data generated by the application, will immediately be deleted at the pod termination. Data will be also purged in case of pod failures.
One time configuration for the node group

Cons

If multiple jobs are allocated on the same EC2 instance, contention of disk resources will occur because it is not possible to allocate instance store volume resources across jobs

Mount NVMe disks as data volumes¶

In this section, we’re going to explicitly mount instance store volumes as the mount path in Spark configuration for drivers and executors

As in the previous example, this script will automatically format the instance store volumes and create an xfs partition. The disks are then mounted in local folders called /spark_data_IDX where IDX is an integer that corresponds to the disk mounted.

Example

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: YOUR_CLUSTER_NAME
  region: YOUR_REGION

managedNodeGroups:
  - name: ng-m5d-4xlarge
    instanceType: m5d.4xlarge
    desiredCapacity: 1
    privateNetworking: true
    subnets:
      - YOUR_NG_SUBNET
    preBootstrapCommands: # commands executed as root
      - "IDX=1;for DEV in /dev/nvme[1-9]n1;do mkfs.xfs ${DEV}; mkdir -p /spark_data_${IDX}; echo ${DEV} /spark_data_${IDX} xfs defaults,noatime 1 2 >> /etc/fstab; IDX=$((${IDX} + 1)); done"
      - "mount -a"
      - "chown 999:1000 /spark_data_*"

In order to successfully use ephemeral volumes within Spark, you need to specify additional configurations. In addition to spark configuration, the mounted volume name should start with spark-local-dir-.

Below an example configuration provided during the EMR on EKS job submission, that shows how to configure Spark to use 2 volumes as local storage for the job.

Spark Configurations

{
  "name": ....,
  "virtualClusterId": ....,
  "executionRoleArn": ....,
  "releaseLabel": ....,
  "jobDriver": ....,
  "configurationOverrides": {
    "applicationConfiguration": [
      {
        "classification": "spark-defaults",
        "properties": {
          "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/spark_data_1",
          "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false",
          "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/spark_data_1",
          "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path": "/spark_data_2",
          "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.readOnly": "false",
          "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path": "/spark_data_2"
        }
      }
    ]
  }
}

Please note that for this approach it is required to specify the following configurations for each volume that you want to use. (IDX is a label to identify the volume mounted)

# Mount path on the host node
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-IDX.options.path

# Mount path on the k8s pod
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-IDX.mount.path

# (boolean) Should be defined as false to allow Spark to write in the path
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-IDX.mount.readOnly

Benefits

You can allocate dedicated resources of instance store volumes across your Spark jobs (For example, lets take a scenario where an EC2 instance has two instance store volumes. If you run two spark jobs on this node, you can dedicate one volume per Spark job)

Cons

Additional configurations are required for Spark jobs to use instance store volumes. This approach can be error-prone if you don’t control the instance types being used (for example, multiple node groups with different instance types). You can mitigate this issue by using k8s node selectors and specify instance type in your spark configuraiton: spark.kubernetes.node.selector.node.kubernetes.io/instance-type
Data created on the volumes is automatically deleted once the job is completed and instance is terminated. However, you need to extra measures to delete the data on instance store volumes if EC2 instance is re-used or is not terminated.