LLM Serving using vLLM DLC¶

Production-ready Docker images for serving large language models with vLLM on AWS. Built on Amazon Linux 2023 with ongoing security patching.

Images¶

Platform	Image	Default Port
EC2 / EKS	`public.ecr.aws/deep-learning-containers/vllm:server-cuda`	8000
Amazon SageMaker AI	`public.ecr.aws/deep-learning-containers/vllm:server-sagemaker-cuda`	8080

All images are also available on the ECR Public Gallery. For private ECR URIs, see Image Access.

In addition to vLLM and its core stack (PyTorch, CUDA 12.9, NCCL, Python 3.12), the images bundle:

FlashInfer — fused attention kernels with precompiled cubins for fast cold start
DeepEP — expert-parallel kernels for large MoE models (DeepSeek, Qwen MoE)
LMCache + NIXL — KV-cache offloading and disaggregated prefill/decode
runai-model-streamer — stream model weights directly from S3, GCS, or Azure
EFA and OpenMPI — high-throughput multi-node networking on supported instances

The SageMaker image additionally includes standard-supervisor for process auto-recovery, custom handlers, and dependency installation. See Amazon SageMaker AI Deployment for details.

The container runs vLLM's OpenAI-compatible API server. Common endpoints:

Endpoint	Purpose
`POST /v1/chat/completions`	Chat-style generation
`POST /v1/completions`	Legacy text completion
`POST /v1/embeddings`	Generate embeddings (embedding models)
`POST /v1/audio/transcriptions`	Speech-to-text (ASR models)
`POST /v1/responses`	Stateful response API
`POST /v1/rerank`, `/v1/score`	Reranking and scoring
`GET /v1/models`	List loaded model(s)
`POST /tokenize`, `/detokenize`	Tokenizer access
`POST /v1/load_lora_adapter`, `/v1/unload_lora_adapter`	Dynamic LoRA management
`GET /health`, `/ping`	Liveness probe
`GET /metrics`	Prometheus metrics

Refer to vLLM's API documentation for request/response schemas and the full endpoint list.

These images are curated builds, not simple repackages of upstream releases:

Built from upstream main — images track the vLLM main branch with frequent releases, each gated by our regression test suite before publication.
Regression-tested — validated against a suite of models on every release. See Supported Models for the full list.
Security-patched — continuously maintained with security patches from AWS on an Amazon Linux 2023 base.