vLLM Inference¶
Pre-built Docker images for running vLLM inference workloads on AWS. Built on Amazon Linux 2023 with CUDA 12.9 and Python 3.12.
Latest Announcements¶
April 25, 2026 — vLLM 1.0.0 initial release on Amazon Linux 2023 with the simplified server-cuda tag family. Source is pinned to upstream vLLM
commit 6ef1efd5; wheel reports as 0.19.1+amzn2023.6ef1efd5.
Pull Commands¶
EC2:
SageMaker:
See Available Images for all image URIs and Getting Started for authentication instructions.
How We Build¶
The vLLM DLC images are curated builds, not simple repackages of upstream releases:
- Built from a chosen base reference — a specific commit, release candidate, or point in vLLM's history — with targeted patches applied from upstream PRs, forks, and community contributions for new-model support, bug fixes, and performance improvements.
- Opinionated testing — validated against a selected suite of model-serving use cases relevant to AWS customers.
- Faster access with higher confidence — delivers the latest advancements while maintaining reliability for real-world workloads.
Each image ships with vLLM (OpenAI-compatible API server), PyTorch, CUDA, NCCL (multi-GPU), and security patches from AWS.
For package versions included in each release, see the Release Notes.
EC2 Deployment¶
The container runs the vLLM OpenAI-compatible API server on port 8000. Any vllm serve flag may be appended to docker run.
Single GPU¶
docker run --gpus all -p 8000:8000 \
763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:server-cuda-v1 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--host 0.0.0.0 --port 8000
For gated models (Llama, etc.), pass -e HF_TOKEN=<your_hf_token>.
Send a request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [{"role": "user", "content": "What is deep learning?"}],
"max_tokens": 256
}'
Multi-GPU (Tensor Parallelism)¶
For models that require multiple GPUs (e.g., 70B+):
docker run --gpus all --ipc=host -p 8000:8000 \
-e HF_TOKEN=<your_hf_token> \
763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:server-cuda-v1 \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 8 \
--host 0.0.0.0 --port 8000
Use --ipc=host for multi-GPU to enable shared memory between processes.
Recommended Instance Types¶
| Instance Type | GPUs | GPU Memory | Use Case |
|---|---|---|---|
g5.xlarge |
1× A10G (24 GB) | 24 GB | Small models (≤ 8B) |
g5.12xlarge |
4× A10G (24 GB) | 96 GB | Medium models (8B–30B) |
p4d.24xlarge |
8× A100 (40 GB) | 320 GB | Large models (30B–70B) |
p5.48xlarge |
8× H100 (80 GB) | 640 GB | Very large models (70B+) |
ECS / EKS¶
The container works as-is with ECS task definitions and Kubernetes pod specs. Key requirements:
- Request GPU resources (
resourceRequirements: [{type: GPU, value: "1"}]for ECS,resources.limits.nvidia.com/gpu: "1"for EKS) - Pass
--model,--host 0.0.0.0, and--port 8000as container args - Provide
HF_TOKENas an environment variable (or Kubernetes secret) for gated models - Use
/healthon port 8000 for health checks
SageMaker Deployment¶
SageMaker Python SDK v2¶
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
model = Model(
image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:server-sagemaker-cuda-v1",
role="arn:aws:iam::<account_id>:role/<role_name>",
predictor_cls=Predictor,
env={"SM_VLLM_MODEL": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"},
)
predictor = model.deploy(
instance_type="ml.g5.2xlarge",
initial_instance_count=1,
inference_ami_version="al2-ami-sagemaker-inference-gpu-3-1",
serializer=JSONSerializer(),
)
response = predictor.predict({
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [{"role": "user", "content": "What is deep learning?"}],
"max_tokens": 256,
})
print(response)
# Cleanup
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)
GPU deploys require inference_ami_version — the default SageMaker host AMI has incompatible NVIDIA drivers for CUDA 12.9 images. See
ProductionVariant API reference for valid values.
SageMaker Python SDK v3¶
import json
import boto3
from sagemaker.core.resources import Endpoint, EndpointConfig, Model
from sagemaker.core.shapes import ContainerDefinition, ProductionVariant
model = Model.create(
model_name="vllm-model",
primary_container=ContainerDefinition(
image="763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:server-sagemaker-cuda-v1",
environment={"SM_VLLM_MODEL": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"},
),
execution_role_arn="arn:aws:iam::<account_id>:role/<role_name>",
)
ep_cfg = EndpointConfig.create(
endpoint_config_name="vllm-config",
production_variants=[
ProductionVariant(
variant_name="default",
model_name="vllm-model",
instance_type="ml.g5.2xlarge",
initial_instance_count=1,
inference_ami_version="al2-ami-sagemaker-inference-gpu-3-1",
),
],
)
endpoint = Endpoint.create(endpoint_name="vllm-endpoint", endpoint_config_name="vllm-config")
endpoint.wait_for_status("InService")
smrt = boto3.client("sagemaker-runtime")
resp = smrt.invoke_endpoint(
EndpointName="vllm-endpoint",
ContentType="application/json",
Body=json.dumps({
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [{"role": "user", "content": "What is deep learning?"}],
"max_tokens": 256,
}),
)
print(json.loads(resp["Body"].read()))
# Cleanup
endpoint.delete()
ep_cfg.delete()
model.delete()
Boto3¶
import json
import boto3
sm = boto3.client("sagemaker")
smrt = boto3.client("sagemaker-runtime")
sm.create_model(
ModelName="vllm-model",
PrimaryContainer={
"Image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:server-sagemaker-cuda-v1",
"Environment": {"SM_VLLM_MODEL": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"},
},
ExecutionRoleArn="arn:aws:iam::<account_id>:role/<role_name>",
)
sm.create_endpoint_config(
EndpointConfigName="vllm-config",
ProductionVariants=[{
"VariantName": "default",
"ModelName": "vllm-model",
"InstanceType": "ml.g5.2xlarge",
"InitialInstanceCount": 1,
"InferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-3-1",
}],
)
sm.create_endpoint(EndpointName="vllm-endpoint", EndpointConfigName="vllm-config")
sm.get_waiter("endpoint_in_service").wait(EndpointName="vllm-endpoint")
resp = smrt.invoke_endpoint(
EndpointName="vllm-endpoint",
ContentType="application/json",
Body=json.dumps({
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [{"role": "user", "content": "What is deep learning?"}],
"max_tokens": 256,
}),
)
print(json.loads(resp["Body"].read()))
# Cleanup
sm.delete_endpoint(EndpointName="vllm-endpoint")
sm.delete_endpoint_config(EndpointConfigName="vllm-config")
sm.delete_model(ModelName="vllm-model")
Configuration¶
SageMaker Environment Variables¶
Any SM_VLLM_* env var is converted to a --<name> vLLM server argument (e.g., SM_VLLM_MAX_MODEL_LEN=4096 → --max-model-len 4096).
| Variable | Description | Required |
|---|---|---|
SM_VLLM_MODEL |
Model ID from Hugging Face Hub or S3 path | Yes |
HF_TOKEN |
Hugging Face token for gated models | For gated models |
SM_VLLM_TENSOR_PARALLEL_SIZE |
Number of GPUs for tensor parallelism | No |
SM_VLLM_MAX_MODEL_LEN |
Maximum sequence length | No |
SM_VLLM_ENFORCE_EAGER |
Set to true to disable CUDA graphs |
No |
Common Server Arguments (EC2)¶
| Argument | Description | Default |
|---|---|---|
--model |
Model ID or path | Required |
--host |
Bind address | localhost |
--port |
Server port | 8000 |
--tensor-parallel-size |
Number of GPUs for tensor parallelism | 1 |
--max-model-len |
Maximum sequence length | Model default |
--gpu-memory-utilization |
Fraction of GPU memory to use | 0.9 |
--enforce-eager |
Disable CUDA graph for debugging | false |
--quantization |
Quantization method (awq, gptq, fp8, …) | None |
--dtype |
Model data type (auto, float16, bfloat16) | auto |
For the complete list, see the vLLM engine arguments documentation and vLLM environment variables.
Supported Models¶
The following models have been validated on the bundled vLLM version:
| Model Family | Example Models | Tested Instance |
|---|---|---|
| Qwen3 | Qwen/Qwen3-32B |
p4d.24xlarge |
| Llama 3.x | meta-llama/Llama-3.3-70B-Instruct |
p4d.24xlarge |
| Gemma 4 | google/gemma-4-e4b-it, google/gemma-4-31b-it |
g6e.xlarge, g6e.12xlarge |
| Qwen3-ASR (speech recognition) | Qwen/Qwen3-ASR-1.7B |
g6e.12xlarge |
| GPT-OSS | openai/gpt-oss-20b |
g6e.xlarge |
| Qwen3.5 (MoE, FP8) | Qwen/Qwen3.5-35B-A3B-FP8 |
g6e.12xlarge |
| Gemma 4 (MoE) | google/gemma-4-26b-a4b-it |
g6e.12xlarge |
Using Custom Models¶
From S3:
docker run --gpus all -p 8000:8000 \
-e AWS_DEFAULT_REGION=us-west-2 \
763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:server-cuda-v1 \
--model s3://<bucket>/<prefix>/my-model/
From a local path:
docker run --gpus all -p 8000:8000 \
-v /local/models/my-model:/model \
763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:server-cuda-v1 \
--model /model
Release Notes¶
See vLLM Release Notes for version history and changelogs.