Configuration¶

EC2 / EKS (`server-cuda`)¶

Pass vLLM server arguments directly to docker run:

docker run --gpus all -p 8000:8000 \
  public.ecr.aws/deep-learning-containers/vllm:server-cuda \
  --model openai/gpt-oss-20b \
  --tensor-parallel-size 4 \
  --max-model-len 4096

Argument	Description	Default
`--model`	Model ID or path (required)	—
`--host`	Bind address	`localhost`
`--port`	Server port	`8000`
`--tensor-parallel-size`	Number of GPUs	`1`
`--max-model-len`	Maximum sequence length	Model default
`--gpu-memory-utilization`	Fraction of GPU memory to use	`0.9`
`--enforce-eager`	Disable CUDA graphs	`false`
`--quantization`	Quantization method (awq, gptq, fp8, …)	None
`--dtype`	Data type (auto, float16, bfloat16)	`auto`

For gated models (Llama, Gemma, etc.), pass -e HF_TOKEN=<your_hf_token>.

Amazon SageMaker AI (`server-sagemaker-cuda`)¶

The SageMaker image serves on port 8080 and accepts vLLM flags via SM_VLLM_* environment variables. Each variable is converted to the corresponding vLLM flag (e.g., SM_VLLM_MAX_MODEL_LEN=4096 → --max-model-len 4096). Boolean values follow shell convention: true becomes a bare flag (SM_VLLM_ENFORCE_EAGER=true → --enforce-eager), and false omits the flag entirely.

Variable	Description	Default
`SM_VLLM_MODEL`	Model ID or path (auto-detected from `/opt/ml/model` or `HF_MODEL_ID` if unset)	—
`SM_VLLM_TENSOR_PARALLEL_SIZE`	Number of GPUs	`1`
`SM_VLLM_MAX_MODEL_LEN`	Maximum sequence length	Model default
`SM_VLLM_GPU_MEMORY_UTILIZATION`	Fraction of GPU memory to use	`0.9`
`SM_VLLM_ENFORCE_EAGER`	Disable CUDA graphs	`false`
`SM_VLLM_QUANTIZATION`	Quantization method (awq, gptq, fp8, …)	None
`SM_VLLM_DTYPE`	Data type (auto, float16, bfloat16)	`auto`
`HF_MODEL_ID`	Hugging Face model ID (fallback when `SM_VLLM_MODEL` is unset and `/opt/ml/model` is empty)	—
`HF_TOKEN`	Hugging Face token for gated models	—

Standard-Supervisor Settings¶

The SageMaker image includes standard-supervisor for process management and platform integrations:

Variable	Description	Default
`PROCESS_AUTO_RECOVERY`	Auto-restart vLLM on crash	`true`
`PROCESS_MAX_START_RETRIES`	Max restart attempts before giving up	`3`
`STANDARD_AUTO_INSTALL_REQ`	Auto-install requirements.txt from model artifacts	`true`
`STANDARD_PIP_ARGS`	Custom pip arguments for dependency installation	—

Configuration¶

EC2 / EKS (server-cuda)¶

Amazon SageMaker AI (server-sagemaker-cuda)¶

Standard-Supervisor Settings¶

Full Reference¶

EC2 / EKS (`server-cuda`)¶

Amazon SageMaker AI (`server-sagemaker-cuda`)¶