Amazon SageMaker AI Deployment¶

Deploy vLLM on Amazon SageMaker AI endpoints. The SageMaker image variant accepts model configuration via environment variables and serves on port 8080.

Specifying the Model¶

The SageMaker image resolves the model in this order:

SM_VLLM_MODEL environment variable — explicit Hugging Face ID or path
/opt/ml/model — when SageMaker mounts model artifacts via ModelDataUrl or ModelDataSource, the entrypoint auto-detects them
HF_MODEL_ID environment variable — fallback Hugging Face ID

Set exactly one. For gated models, also pass HF_TOKEN.

SageMaker Python SDK v2¶

from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer

model = Model(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:server-sagemaker-cuda-v1",
    role="arn:aws:iam::<account_id>:role/<role_name>",
    predictor_cls=Predictor,
    env={"SM_VLLM_MODEL": "openai/gpt-oss-20b"},
)

predictor = model.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
    inference_ami_version="al2-ami-sagemaker-inference-gpu-3-1",
    serializer=JSONSerializer(),
)

response = predictor.predict({
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "What is deep learning?"}],
    "max_tokens": 256,
})
print(response)

# Cleanup
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

SageMaker Python SDK v3¶

import json
import boto3
from sagemaker.core.resources import Endpoint, EndpointConfig, Model
from sagemaker.core.shapes import ContainerDefinition, ProductionVariant

model = Model.create(
    model_name="vllm-model",
    primary_container=ContainerDefinition(
        image="763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:server-sagemaker-cuda-v1",
        environment={"SM_VLLM_MODEL": "openai/gpt-oss-20b"},
    ),
    execution_role_arn="arn:aws:iam::<account_id>:role/<role_name>",
)

ep_cfg = EndpointConfig.create(
    endpoint_config_name="vllm-config",
    production_variants=[
        ProductionVariant(
            variant_name="default",
            model_name="vllm-model",
            instance_type="ml.g5.2xlarge",
            initial_instance_count=1,
            inference_ami_version="al2-ami-sagemaker-inference-gpu-3-1",
        ),
    ],
)

endpoint = Endpoint.create(endpoint_name="vllm-endpoint", endpoint_config_name="vllm-config")
endpoint.wait_for_status("InService")

smrt = boto3.client("sagemaker-runtime")
resp = smrt.invoke_endpoint(
    EndpointName="vllm-endpoint",
    ContentType="application/json",
    Body=json.dumps({
        "model": "openai/gpt-oss-20b",
        "messages": [{"role": "user", "content": "What is deep learning?"}],
        "max_tokens": 256,
    }),
)
print(json.loads(resp["Body"].read()))

# Cleanup
endpoint.delete()
ep_cfg.delete()
model.delete()

Boto3¶

import json
import boto3

sm = boto3.client("sagemaker")
smrt = boto3.client("sagemaker-runtime")

sm.create_model(
    ModelName="vllm-model",
    PrimaryContainer={
        "Image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:server-sagemaker-cuda-v1",
        "Environment": {"SM_VLLM_MODEL": "openai/gpt-oss-20b"},
    },
    ExecutionRoleArn="arn:aws:iam::<account_id>:role/<role_name>",
)

sm.create_endpoint_config(
    EndpointConfigName="vllm-config",
    ProductionVariants=[{
        "VariantName": "default",
        "ModelName": "vllm-model",
        "InstanceType": "ml.g5.2xlarge",
        "InitialInstanceCount": 1,
        "InferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-3-1",
    }],
)

sm.create_endpoint(EndpointName="vllm-endpoint", EndpointConfigName="vllm-config")
sm.get_waiter("endpoint_in_service").wait(EndpointName="vllm-endpoint")

resp = smrt.invoke_endpoint(
    EndpointName="vllm-endpoint",
    ContentType="application/json",
    Body=json.dumps({
        "model": "openai/gpt-oss-20b",
        "messages": [{"role": "user", "content": "What is deep learning?"}],
        "max_tokens": 256,
    }),
)
print(json.loads(resp["Body"].read()))

# Cleanup
sm.delete_endpoint(EndpointName="vllm-endpoint")
sm.delete_endpoint_config(EndpointConfigName="vllm-config")
sm.delete_model(ModelName="vllm-model")

Model Artifacts¶

When ModelDataUrl (or ModelDataSource) points to a tarball/S3 prefix, SageMaker mounts the contents at /opt/ml/model. The entrypoint auto-detects the model directory, so SM_VLLM_MODEL can be omitted. Optional files extend behavior without rebuilding the container:

model.tar.gz
├── config.json              # standard model files (Hugging Face layout)
├── tokenizer.json
├── *.safetensors
├── requirements.txt         # (optional) extra Python deps installed at startup
└── code/
    └── model.py             # (optional) custom /ping and /invocations handlers

SageMaker Features¶

The SageMaker vLLM image uses standard-supervisor to provide platform-specific integrations:

Process supervision and auto-recovery — if the vLLM process crashes, it is automatically restarted (up to 3 retries by default). This prevents the container from exiting on transient failures.

Dynamic dependency installation — bundle a requirements.txt with your model artifacts and dependencies are installed automatically before server startup. No custom container needed.

Custom handler support — override the default /ping and /invocations endpoints with your own logic by placing a model.py in your model artifacts:

import model_hosting_container_standards.sagemaker as sagemaker_standards
from fastapi import Request, Response

@sagemaker_standards.custom_invocation_handler
async def custom_invoke(request: Request) -> Response:
    body = await request.json()
    # your custom logic here
    return Response(content=json.dumps(result), media_type="application/json")

LoRA adapter support — dynamically load and route to LoRA adapters via request headers without restarting the server.

Notes¶

GPU deployments require inference_ami_version — the default SageMaker host AMI has incompatible NVIDIA drivers for CUDA 12.9 images. See ProductionVariant API reference for valid values.
Any SM_VLLM_* environment variable is converted to a --<name> vLLM server argument (e.g., SM_VLLM_MAX_MODEL_LEN=4096 → --max-model-len 4096).

For all configuration options including server arguments, SageMaker environment variables, and standard-supervisor settings, see Configuration.