Distributed Training on Amazon EKS with AWS Deep Learning Containers¶

In this post, we show how to configure and verify a distributed training cluster using AWS Deep Learning Containers on Amazon Elastic Kubernetes Service (EKS). We demonstrate building a cost-effective, enterprise-scale distributed training environment for large language models using P4d instances, FSx for Lustre storage, and PyTorch FSDP (Fully Sharded Data Parallel), making sure the infrastructure meets production standards. We demonstrate this by setting up a distributed training system that fine-tunes Meta Llama 2 7B using a systematic approach to launch and verify all required components.

This sample consists of the following components:

Infrastructure Setup – Deploy EKS cluster with GPU-optimized P4d instances and EFA networking for high-performance distributed training.

Container Building – Create custom Docker images based on AWS Deep Learning Containers with additional dependencies for training workloads.

Plugin Installation – Configure NVIDIA GPU plugins, EFA networking, distributed training frameworks (etcd, Kubeflow Training Operator), and persistent storage drivers.

Storage Configuration – Set up FSx for Lustre high-performance parallel filesystem for training data and model checkpoints.

Validation & Testing – Run comprehensive health checks including GPU validation, NCCL communication tests, and sample training workloads.

Training Orchestration – Launch distributed PyTorch jobs using FSDP with proper worker coordination and fault handling.

This repository is explained in detail in the AWS blog "Configuring and Verifying a Distributed Training Cluster with AWS Deep Learning Containers on Amazon Elastic Kubernetes Service"

Code Contains:¶

The repository includes practical scripts and configurations demonstrating:

Building custom Docker images from AWS Deep Learning Containers with PyTorch 2.7.1
Deploying EKS clusters with GPU node groups and EFA-enabled networking using eksctl
Installing and configuring NVIDIA device plugins, EFA plugins, and distributed training operators
Setting up FSx for Lustre filesystem for high-throughput storage
Running NCCL tests to validate multi-node GPU communication
Launching distributed PyTorch training jobs with FSDP using Kubeflow Training Operator

EKS Distributed Training Architecture

Prerequisites¶

An AWS account with billing enabled
AWS CLI configured with appropriate permissions
Docker installed on build environment
At least 100 GiB storage for building containers
Hugging Face token for Llama 2 model access (gated model)
Optional: EC2 Capacity Reservation for P4d instances
Deep Learning AMI for container building

Execution¶

Step 1: Environment Setup¶

Launch an EC2 instance with Deep Learning AMI and install dependencies:

# Clone this repository
git clone <repository-url>
cd <repository-name>

# Install AWS CLI, kubectl, and eksctl
source ./setup_ec2.sh

Step 2: Build Custom Training Container¶

# Build and push custom Docker image with training dependencies
bash ./build.sh

Step 3: Deploy EKS Cluster¶

# Create EKS cluster with GPU nodes and required add-ons
eksctl create cluster -f ./eks-p4d-odcr.yaml

Step 4: Install Training Plugins¶

# Deploy etcd for worker coordination
kubectl apply -f etcd.yaml

# Install Kubeflow Training Operator
kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.3"

# Create FSx filesystem and storage
bash ./fsx_create.sh
kubectl apply -f ./fsx-pvc-static.yaml

Step 5: Validate Environment¶

# Verify GPU availability
kubectl apply -f nvidia_smi.yaml
kubectl logs nvidia-smi

# Test NCCL communication
kubectl apply -f nccl-tests.yaml
kubectl get pods | grep nccl

Step 6: Run Distributed Training¶

# Configure Hugging Face token in fsdp.conf
# Then launch training job
bash ./fsdp.sh
kubectl apply -f ./fsdp.yaml

# Monitor training progress
kubectl get pods | grep fsdp
kubectl logs -f fsdp-worker-0

Project Structure¶

├── setup_ec2.sh              # EC2 environment setup
├── build.sh                  # Docker image build script
├── Dockerfile.llama2-efa-dlc # Custom training container
├── .env                      # Environment variables
│
├── eks-p4d-odcr.yaml         # EKS cluster configuration
├── eks-p4d.yaml              # Alternative cluster config
├── eks-vpc-odcr-p4d.yaml     # VPC-specific cluster config
│
├── fsx_create.sh             # FSx filesystem creation
├── fsx_deploy.sh             # FSx deployment to EKS
├── fsx_delete.sh             # FSx cleanup
├── fsx-pvc-static.yaml       # FSx persistent volume claim
├── fsx.conf                  # FSx configuration
│
├── fsdp.yaml                 # PyTorch distributed training job
├── fsdp.yaml-template        # Training job template
├── fsdp.sh                   # Training job launcher
├── fsdp.conf                 # Training configuration
│
├── etcd.yaml                 # Worker coordination service
├── nccl-tests.yaml           # Network performance validation
└── nvidia_smi.yaml           # GPU validation job

Configuration Details¶

All the sample scripts can be adjusted to the needs of specific workloads.

Cluster Configuration¶

System nodes: c5.2xlarge for cluster management
GPU nodes: p4d.24xlarge with EFA networking (8 H100 GPUs per node)
Storage: 500 GiB EBS volumes + FSx for Lustre
Networking: EFA-enabled for high-performance communication
Kubernetes: Version 1.33 with managed node groups

Training Configuration¶

Model: Meta Llama 2 7B (gated model - requires HF token)
Framework: PyTorch with FSDP (Fully Sharded Data Parallel)
Communication: NCCL with AWS OFI backend for EFA
Storage: FSx for Lustre for dataset and checkpoints
Orchestration: Kubeflow Training Operator with etcd coordination

Validation and Testing¶

The setup includes comprehensive validation steps:

GPU Validation: Verify NVIDIA drivers and GPU visibility
Network Testing: NCCL all-reduce and bandwidth tests
Storage Verification: FSx mount and throughput validation
Training Validation: Sample FSDP job with Llama 2 7B

Troubleshooting¶

Common Issues¶

GPU not visible: Check NVIDIA device plugin installation
EFA not working: Verify EFA plugin and instance type support
Training fails: Ensure etcd is running and accessible
Storage issues: Verify FSx filesystem is mounted correctly

Debugging Commands¶

# Check node status and GPU resources
kubectl get nodes -o wide
kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"'

# Check EFA availability
kubectl get nodes -o=custom-columns=NAME:.metadata.name,EFA:.status.allocatable.vpc\\.amazonaws\\.com/efa

# Monitor training jobs
kubectl describe -f ./fsdp.yaml
kubectl logs <pod-name> -f

Cleanup¶

# Stop training job and coordination services
kubectl delete -f ./fsdp.yaml
kubectl delete -f ./etcd.yaml

# Delete FSx filesystem
bash ./fsx_delete.sh

# Delete EKS cluster
eksctl delete cluster -f ./eks-p4d-odcr.yaml