EC2 Deployment¶
The container runs the SGLang OpenAI-compatible API server on port 30000. Any sglang.launch_server flag may be appended to docker run. See
Configuration for the full list of server arguments.
Single GPU¶
docker run --gpus all -p 30000:30000 \
public.ecr.aws/deep-learning-containers/sglang:server-cuda \
--model-path openai/gpt-oss-20b \
--host 0.0.0.0 --port 30000
For gated models (Llama, Gemma, etc.), pass -e HF_TOKEN=<your_hf_token>.
Send a request:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-20b",
"messages": [{"role": "user", "content": "What is deep learning?"}],
"max_tokens": 256
}'
Multi-GPU (Tensor Parallelism)¶
For models that require multiple GPUs (e.g., 70B+):
docker run --gpus all --ipc=host -p 30000:30000 \
-e HF_TOKEN=<your_hf_token> \
public.ecr.aws/deep-learning-containers/sglang:server-cuda \
--model-path meta-llama/Llama-3.3-70B-Instruct \
--tp 8 \
--host 0.0.0.0 --port 30000
--ipc=host enables shared memory between GPU processes.
Model-Specific Tuning¶
For recommended serving flags, hardware configurations, and quantization options per model, see the SGLang hyperparameter tuning guide. A few notes from our regression suite:
- FP8 MoE models (e.g., Qwen3 Coder, Qwen3.5-35B-A3B): block-quantized weights require each tensor-parallel shard's gate/up
output_sizeto be a multiple of 128. On an 8-GPU host, use--tp 4rather than--tp 8. - Large dense models (e.g., Llama 3.3 70B): pass
--mem-fraction-static 0.88and--disable-piecewise-cuda-graphto reduce compile-time host memory during warmup.