LLM Serving using SGLang DLC¶

Production-ready Docker images for serving large language models with SGLang on AWS. Built on Amazon Linux 2023 with ongoing security patching.

Images¶

Platform	Image	Default Port
EC2 / EKS	`public.ecr.aws/deep-learning-containers/sglang:server-cuda`	30000
Amazon SageMaker AI	`public.ecr.aws/deep-learning-containers/sglang:server-sagemaker-cuda`	8080

All images are also available on the ECR Public Gallery. For private ECR URIs, see Image Access.

What's Included¶

In addition to SGLang and its core stack (PyTorch 2.11, CUDA 13.0, NCCL, Python 3.12), the images bundle:

FlashInfer — fused attention kernels with precompiled cubins for fast cold start
DeepEP — expert-parallel kernels for large MoE models (DeepSeek, Qwen MoE)
Mooncake — KV-cache transfer engine for disaggregated prefill/decode
NIXL — KV connector for prefill/decode (PD) disaggregation KV transfer
runai-model-streamer — fast weight streaming from object storage, with the s3, gcs, and azure extras enabled
sgl-kernel — SGLang's custom CUDA kernels, built from source for the bundled CUDA arch list
EFA and OpenMPI — high-throughput multi-node networking on supported instances

The images are built from SGLang source against the H100 (sm_90) and Blackwell (sm_100, sm_103) CUDA architectures.

The runtime also bundles decord, lmdb, and peft to support multimodal vision-grounding models such as NVIDIA LocateAnything-3B, which return bounding boxes for objects matching a text prompt. See Supported Models for the tested set.

API Endpoints¶

The container runs SGLang's OpenAI-compatible API server. Common endpoints:

Endpoint	Purpose
`POST /v1/chat/completions`	Chat-style generation
`POST /v1/completions`	Legacy text completion
`POST /v1/embeddings`	Generate embeddings (embedding models)
`POST /generate`	SGLang-native generation API
`GET /v1/models`	List loaded model(s)
`GET /get_model_info`	Model metadata
`GET /health`, `/health_generate`	Liveness probe
`POST /flush_cache`	Flush the radix attention cache
`GET /metrics`	Prometheus metrics

Refer to SGLang's API documentation for request/response schemas and the full endpoint list.

How We Build¶

These images are curated builds, not simple repackages of upstream releases:

Built from upstream source — images build SGLang from a pinned upstream commit, each gated by our regression test suite before publication.
Regression-tested — validated against a suite of models on every release. See Supported Models for the full list.
Security-patched — continuously maintained with security patches from AWS on an Amazon Linux 2023 base.