EC2 Deployment¶
The PyTorch DLC is a training image — it does not serve a model out of the box. Launch it on EC2 (or ECS/EKS) with your own training script.
Single-GPU Training¶
docker run --rm -it --gpus all --shm-size=16g --ipc=host \
-v $(pwd):/workspace \
public.ecr.aws/deep-learning-containers/pytorch:2.11-cu130-amzn2023 \
python train.py
--shm-size=16g --ipc=host is required for PyTorch DataLoader workers to share tensors via shared memory.
Multi-GPU Training (single node)¶
Use torchrun to spawn one process per GPU:
docker run --rm -it --gpus all --shm-size=16g --ipc=host \
-v $(pwd):/workspace \
public.ecr.aws/deep-learning-containers/pytorch:2.11-cu130-amzn2023 \
torchrun --standalone --nproc_per_node=8 train.py
NCCL is pre-configured for multi-GPU collectives — no extra flags required.
Multi-Node Training (EFA)¶
For multi-node training on EFA-capable instances (e.g., p5.48xlarge, p4d.24xlarge), the image ships EFA + the NCCL OFI plugin so collectives flow
over EFA automatically.
Run the container with --privileged (or grant the EFA capabilities via --device) and pass the EFA devices through, then launch via MPI or
torchrun:
docker run --rm -it --gpus all --privileged --network host \
--shm-size=16g --ipc=host \
-v $(pwd):/workspace \
public.ecr.aws/deep-learning-containers/pytorch:2.11-cu130-amzn2023 \
torchrun --nnodes=2 --nproc_per_node=8 \
--rdzv_id=demo --rdzv_backend=c10d --rdzv_endpoint=<head_node>:29500 \
train.py
Verify EFA Connectivity Before Training¶
The image includes the NCCL all_reduce_perf binary at /usr/local/bin/all_reduce_perf. Run it across nodes to confirm EFA + NCCL plumbing before
spending GPU-hours on a real job:
mpirun -np 16 -N 8 -hostfile hosts.txt \
-x NCCL_DEBUG=INFO -x FI_PROVIDER=efa \
/usr/local/bin/all_reduce_perf -b 8 -e 1G -f 2 -g 1
SSH Between Nodes¶
Multi-node MPI launches require SSH between containers. The image ships a pre-configured OpenSSH server on port 22 that runs as root — useful for
test clusters, but you should harden or replace it for production deployments. Expose port 22 with -p 22:22 (or --network host) and add your
public key to /root/.ssh/authorized_keys.
Building on the Image¶
The image includes gcc, gcc-c++, make, cuda-nvcc, and cuda-cudart-devel, so you can build CUDA extensions in-place. PyTorch headers and
libraries are visible at /opt/venv/lib/python3.12/site-packages/torch/.
Troubleshooting EFA Throughput¶
If all_reduce_perf runs but throughput is much lower than expected, check that FI_PROVIDER=efa is exported (otherwise NCCL falls back to sockets),
and that the NIC is mounted in the container. lspci | grep -i mellanox inside the container should list the EFA adapter when EFA is plumbed
correctly.