Configuration¶
EC2 / EKS (server-cuda)¶
Pass SGLang server arguments directly to docker run. The entrypoint forwards them to python3 -m sglang.launch_server:
docker run --gpus all -p 30000:30000 \
public.ecr.aws/deep-learning-containers/sglang:server-cuda \
--model-path openai/gpt-oss-20b \
--tp 4 \
--context-length 4096
| Argument | Description | Default |
|---|---|---|
--model-path |
Model ID or path (required) | — |
--host |
Bind address | 127.0.0.1 |
--port |
Server port | 30000 |
--tp |
Tensor-parallel size (number of GPUs) | 1 |
--context-length |
Maximum sequence length | Model default |
--mem-fraction-static |
Fraction of GPU memory for the KV cache pool | Auto |
--dtype |
Data type (auto, bfloat16, float16) | auto |
--quantization |
Quantization method (fp8, awq, gptq, …) | None |
--trust-remote-code |
Allow custom model code from the Hub | false |
--disable-piecewise-cuda-graph |
Disable experimental piecewise CUDA graph capture | false |
For gated models (Llama, Gemma, etc.), pass -e HF_TOKEN=<your_hf_token>.
Amazon SageMaker AI (server-sagemaker-cuda)¶
The SageMaker image serves on port 8080 and accepts SGLang flags via SM_SGLANG_* environment variables. Each variable is converted to the
corresponding SGLang flag — the name is lowercased and underscores become hyphens (e.g., SM_SGLANG_CONTEXT_LENGTH=4096 → --context-length 4096).
Boolean values follow shell convention: true becomes a bare flag (SM_SGLANG_TRUST_REMOTE_CODE=true → --trust-remote-code), and false omits the
flag entirely.
| Variable | Description | Default |
|---|---|---|
SM_SGLANG_MODEL_PATH |
Model ID or path (defaults to /opt/ml/model when SageMaker mounts artifacts) |
/opt/ml/model |
SM_SGLANG_TP |
Tensor-parallel size (number of GPUs) | 1 |
SM_SGLANG_CONTEXT_LENGTH |
Maximum sequence length | Model default |
SM_SGLANG_MEM_FRACTION_STATIC |
Fraction of GPU memory for the KV cache pool | Auto |
SM_SGLANG_DTYPE |
Data type (auto, bfloat16, float16) | auto |
SM_SGLANG_QUANTIZATION |
Quantization method (fp8, awq, gptq, …) | None |
SM_SGLANG_TRUST_REMOTE_CODE |
Allow custom model code from the Hub | false |
HF_TOKEN |
Hugging Face token for gated models | — |
The entrypoint defaults to --port 8080 and --host 0.0.0.0, and you should leave them there: SageMaker forwards /ping and /invocations to port
8080, so changing the port or host breaks endpoint routing. The --model-path defaults to /opt/ml/model (where SageMaker mounts model artifacts)
unless you set SM_SGLANG_MODEL_PATH.