EMR Containers integration with FSx for Lustre¶
Amazon EKS clusters provide the compute and ephemeral storage for Spark workloads. Ephemeral storage provided by EKS is allocated from the EKS worker node's disk storage and the lifecycle of the storage is bound by the lifecycle of the driver and executor pod.
Need for durable storage:
When multiple spark applications are executed as part of a data pipeline, there are scenarios where data from one spark application is passed to subsequent spark applications - in this case data can be persisted in S3. Alternatively, this data can also be persisted in FSx for Lustre. FSx for Lustre provides a fully managed, scalable, POSIX compliant native filesystem interface for the data in s3. With FSx, your torage is decoupled from your compute and has its own lifecycle.
FSx for Lustre Volumes can be mounted on spark driver and executor pods through static and dynamic provisioning.
Data used in the below example is from AWS Open data Registry
FSx for Lustre POSIX permissions¶
When a Lustre filesystem is mounted to driver and executor pods, and if the S3 objects does not have required metadata, the mounted volume defaults
ownership of the file system to root
. EMR on EKS executes the driver and executor pods with UID(999), GID (1000) and groups(1000 and 65534).
In this scenario, the spark application has read only access to the mounted Lustre file system. Below are a few approaches that can be considered:
Tag Metadata to S3 object¶
Applications writing to S3 can tag the S3 objects with the metadata that FSx for Lustre requires.
Walkthrough: Attaching POSIX permissions when uploading objects into an S3 bucket provides a guided tutorial. FSx for Lustre will convert this tagged metadata to corresponding POSIX permissions when mounting Lustre file system to the driver and executor pods.
EMR on EKS spawns the driver and executor pods as non-root user(UID -999, GID - 1000, groups - 1000, 65534
).
To enable the spark application to write to the mounted file system, (UID - 999
) can be made as the file-owner
and supplemental group 65534
be made as the file-group
.
For S3 objects that already exists with no metadata tagging, there can be a process that recursively tags all the S3 objects with the required metadata.
Below is an example:
1. Create FSx for Lustre file system to the S3 prefix.
2. Create Persistent Volume and Persistent Volume claim for the created FSx for Lustre file system
3. Run a pod as root user with FSx for Lustre mounted with the PVC created in Step 2.
```
apiVersion: v1
kind: Pod
metadata:
name: chmod-fsx-pod
namespace: test-demo
spec:
containers:
- name: ownership-change
image: amazonlinux:2
command: ["sh", "-c", "chown -hR +999:+65534 /data"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: fsx-static-root-claim
```
Run a data repository task with import path and export path pointing to the same S3 prefix. This will export the POSIX permission from FSx for Lustre file system as metadata, that is tagged on S3 objects.
Now that the S3 objects are tagged with metadata, the spark application with FSx for Lustre filesystem mounted will have write access.
Static Provisioning¶
Provision a FSx for Lustre cluster¶
FSx for Luster can also be provisioned through aws cli
How to decide what type of FSx for Lustre file system you need ?
Create a Security Group to attach to FSx for Lustre file system as below
Points to Note:
Security group attached to the EKS worker nodes is given access on port number 988, 1021-1023 in inbound rules.
Security group specified when creating the FSx for Lustre filesystem is given access on port number 988, 1021-1023 in inbound rules.
Fsx for Lustre Provisioning through aws cli
cat fsxLustreConfig.json << EOF
{
"ClientRequestToken": "EMRContainers-fsxLustre-demo",
"FileSystemType": "LUSTRE",
"StorageCapacity": 1200,
"StorageType": "SSD",
"SubnetIds": [
"<subnet-id>"
],
"SecurityGroupIds": [
"<securitygroup-id>"
],
"LustreConfiguration": {
"ImportPath": "s3://<s3 prefix>/",
"ExportPath": "s3://<s3 prefix>/",
"DeploymentType": "PERSISTENT_1",
"AutoImportPolicy": "NEW_CHANGED",
"PerUnitStorageThroughput": 200
}
}
EOF
Run the aws-cli command to create the FSx for Lustre filesystem as below.
aws fsx create-file-system --cli-input-json file:///fsxLustreConfig.json
Response is as below
{
"FileSystem": {
"VpcId": "<vpc id>",
"Tags": [],
"StorageType": "SSD",
"SubnetIds": [
"<subnet-id>"
],
"FileSystemType": "LUSTRE",
"CreationTime": 1603752401.183,
"ResourceARN": "<fsx resource arn>",
"StorageCapacity": 1200,
"LustreConfiguration": {
"CopyTagsToBackups": false,
"WeeklyMaintenanceStartTime": "7:11:30",
"DataRepositoryConfiguration": {
"ImportPath": "s3://<s3 prefix>",
"AutoImportPolicy": "NEW_CHANGED",
"ImportedFileChunkSize": 1024,
"Lifecycle": "CREATING",
"ExportPath": "s3://<s3 prefix>/"
},
"DeploymentType": "PERSISTENT_1",
"PerUnitStorageThroughput": 200,
"MountName": "mvmxtbmv"
},
"FileSystemId": "<filesystem id>",
"DNSName": "<filesystem id>.fsx.<region>.amazonaws.com",
"KmsKeyId": "arn:aws:kms:<region>:<account>:key/<key id>",
"OwnerId": "<account>",
"Lifecycle": "CREATING"
}
}
EKS admin tasks¶
- Attach IAM policy to EKS worker node IAM role to enable access to FSx for Lustre - Mount FSx for Lustre on EKS and Create a Security Group for FSx for Lustre
- Install the FSx CSI Driver in EKS
- Configure Storage Class for FSx for Lustre
- Configure Persistent Volume and Persistent Volume Claim for FSx for Lustre
FSx for Lustre file system is created as described above -Provision a FSx for Lustre cluster
Once provisioned, a persistent volume - as specified below is created with a direct (hard-coded) reference to the created lustre file system. A Persistent Volume claim for this persistent volume will always use the same file system.
cat >fsxLustre-static-pv.yaml <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
name: fsx-pv
spec:
capacity:
storage: 1200Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
mountOptions:
- flock
persistentVolumeReclaimPolicy: Recycle
csi:
driver: fsx.csi.aws.com
volumeHandle: <filesystem id>
volumeAttributes:
dnsname: <filesystem id>.fsx.<region>.amazonaws.com
mountname: mvmxtbmv
EOF
kubectl apply -f fsxLustre-static-pv.yaml
Now, a Persistent Volume Claim (PVC) needs to be created that references PV created above.
cat >fsxLustre-static-pvc.yaml <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fsx-claim
namespace: ns1
spec:
accessModes:
- ReadWriteMany
storageClassName: ""
resources:
requests:
storage: 1200Gi
volumeName: fsx-pv
EOF
kubectl apply -f fsxLustre-static-pvc.yaml -n <namespace registered with EMR on EKS Virtual Cluster>
Spark Developer Tasks¶
Now spark applications can use fsx-claim
in their spark application config to mount the FSx for Lustre filesystem to driver and executor container volumes.
cat >spark-python-in-s3-fsx.json <<EOF
{
"name": "spark-python-in-s3-fsx",
"virtualClusterId": "<virtual-cluster-id>",
"executionRoleArn": "<execution-role-arn>",
"releaseLabel": "emr-6.2.0-latest",
"jobDriver": {
"sparkSubmitJobDriver": {
"entryPoint": "s3://<s3 prefix>/trip-count-repartition-fsx.py",
"sparkSubmitParameters": "--conf spark.driver.cores=5 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6"
}
},
"configurationOverrides": {
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName":"fsx-claim",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path":"/var/data/",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly":"false",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName":"fsx-claim",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path":"/var/data/",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly":"false"
}
}
],
"monitoringConfiguration": {
"cloudWatchMonitoringConfiguration": {
"logGroupName": "/emr-containers/jobs",
"logStreamNamePrefix": "demo"
},
"s3MonitoringConfiguration": {
"logUri": "s3://joblogs"
}
}
}
}
EOF
aws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-fsx.json
Expected Behavior:
All spark jobs that are run with persistent volume claims as fsx-claim
will mount to the statically created FSx for Lustre file system.
Use case:
- A data pipeline consisting of 10 spark applications can all be mounted to the statically created FSx for Lustre file system and can write the intermediate output to a particular folder. The next spark job in the data pipeline that is dependent on this data can read from FSx for Lustre. Data that needs to be persisted beyond the scope of the data pipeline can be exported to S3 by creating data repository tasks
- Data that is used often by multiple spark applications can also be stored in FSx for Lustre for improved performance.
Dynamic Provisioning¶
A FSx for Lustre file system can be provisioned on-demand. A Storage-class resource is created and that provisions FSx for Lustre file system dynamically. A PVC is created and refers to the storage class resource that was created. Whenever a pod refers to the PVC, the storage class invokes the FSx for Lustre Container Storage Interface (CSI) to provision a Lustre file system on the fly dynamically. In this model, FSx for Lustre of type Scratch File Systems
is provisioned.
EKS Admin Tasks¶
- Attach IAM policy to EKS worker node IAM role to enable access to FSx for Lustre - Mount FSx for Lustre on EKS and Create a Security Group for FSx for Lustre
- Install the FSx CSI Driver in EKS
- Configure Storage Class for FSx for Lustre
- Configure Persistent Volume Claim(
fsx-dynamic-claim
) for FSx for Lustre.
Create PVC for dynamic provisioning with fsx-sc
storage class.
cat >fsx-dynamic-claim.yaml <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fsx-dynamic-claim
spec:
accessModes:
- ReadWriteMany
storageClassName: fsx-sc
resources:
requests:
storage: 3600Gi
EOF
kubectl apply -f fsx-dynamic-pvc.yaml -n <namespace registered with EMR on EKS Virtual Cluster>
Spark Developer Tasks¶
cat >spark-python-in-s3-fsx-dynamic.json << EOF
{
"name": "spark-python-in-s3-fsx-dynamic",
"virtualClusterId": "<virtual-cluster-id>",
"executionRoleArn": "<execution-role-arn>",
"releaseLabel": "emr-6.2.0-latest",
"jobDriver": {
"sparkSubmitJobDriver": {
"entryPoint": "s3://<s3 prefix>/trip-count-repartition-fsx.py",
"sparkSubmitParameters": "--conf spark.driver.cores=5 --conf spark.kubernetes.pyspark.pythonVersion=3 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 --conf spark.sql.shuffle.partitions=1000"
}
},
"configurationOverrides": {
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.local.dir":"/var/spark/spill/",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.options.claimName":"fsx-claim",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.path":"/var/data/",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.sparkdata.mount.readOnly":"false",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.options.claimName":"fsx-claim",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.path":"/var/data/",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.sparkdata.mount.readOnly":"false",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.options.claimName":"fsx-dynamic-claim",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.mount.path":"/var/spark/spill/",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-spill.mount.readOnly":"false"
}
}
],
"monitoringConfiguration": {
"cloudWatchMonitoringConfiguration": {
"logGroupName": "/emr-containers/jobs",
"logStreamNamePrefix": "demo"
},
"s3MonitoringConfiguration": {
"logUri": "s3://joblogs"
}
}
}
}
EOF
aws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-fsx-dynamic.json
Expected Result:
Statically provisioned FSx for Lustre is mounted to /var/data/
as before for the driver pod.
For all the executors a SCRATCH 1
deployment type FSx for Lustre is provisioned on the fly dynamically by the Storage class that was created. There will be a latency before the first executor can start running - because the Lustre has to be created. Once it is created the same Lustre file system is mounted to all the executors.
Also note - "spark.local.dir":"/var/spark/spill/"
is used to force executor to use this folder mounted to Lustre for all spill and shuffle data. Once the spark job is completed, the Lustre file system is deleted or retained based on the PVC configuration.
This dynamically created Lustre file system is mapped to a S3 path like the statically created filesystem.
FSx-csi user guide