Skip to content

Using Deep Learning Containers

This page shows common deployment patterns across frameworks. For framework-specific deep dives, see the dedicated guides: vLLM, vLLM-Omni, Ray.

Additional Resources

Running on Amazon SageMaker AI

Using SageMaker Python SDK

Deploy an SGLang inference endpoint:

from sagemaker.model import Model

model = Model(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/sglang:0.5.12-gpu-py312-cu130-ubuntu24.04-sagemaker",
    role="arn:aws:iam::<account_id>:role/<role_name>",
    env={
        "SM_SGLANG_MODEL_PATH": "meta-llama/Llama-3.1-8B-Instruct",
        "HF_TOKEN": "<your_hf_token>",
    },
)

predictor = model.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
)

Deploy a vLLM inference endpoint:

from sagemaker.model import Model

model = Model(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:0.21.0-gpu-py312-cu130-ubuntu22.04-sagemaker",
    role="arn:aws:iam::<account_id>:role/<role_name>",
    env={
        "SM_VLLM_MODEL": "meta-llama/Llama-3.1-8B-Instruct",
        "HF_TOKEN": "<your_hf_token>",
    },
)

predictor = model.deploy(
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1,
)

Using Boto3

Deploy an SGLang inference endpoint:

import boto3

sagemaker = boto3.client("sagemaker")

sagemaker.create_model(
    ModelName="sglang-model",
    PrimaryContainer={
        "Image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/sglang:0.5.12-gpu-py312-cu130-ubuntu24.04-sagemaker",
        "Environment": {
            "SM_SGLANG_MODEL_PATH": "meta-llama/Llama-3.1-8B-Instruct",
            "HF_TOKEN": "<your_hf_token>",
        },
    },
    ExecutionRoleArn="arn:aws:iam::<account_id>:role/<role_name>",
)

sagemaker.create_endpoint_config(
    EndpointConfigName="sglang-endpoint-config",
    ProductionVariants=[
        {
            "VariantName": "default",
            "ModelName": "sglang-model",
            "InstanceType": "ml.g5.2xlarge",
            "InitialInstanceCount": 1,
            "InferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-3-1",
        }
    ],
)

sagemaker.create_endpoint(
    EndpointName="sglang-endpoint",
    EndpointConfigName="sglang-endpoint-config",
)

Deploy a vLLM inference endpoint:

import boto3

sagemaker = boto3.client("sagemaker")

sagemaker.create_model(
    ModelName="vllm-model",
    PrimaryContainer={
        "Image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:0.21.0-gpu-py312-cu130-ubuntu22.04-sagemaker",
        "Environment": {
            "SM_VLLM_MODEL": "meta-llama/Llama-3.1-8B-Instruct",
            "HF_TOKEN": "<your_hf_token>",
        },
    },
    ExecutionRoleArn="arn:aws:iam::<account_id>:role/<role_name>",
)

sagemaker.create_endpoint_config(
    EndpointConfigName="vllm-endpoint-config",
    ProductionVariants=[
        {
            "VariantName": "default",
            "ModelName": "vllm-model",
            "InstanceType": "ml.g5.2xlarge",
            "InitialInstanceCount": 1,
            "InferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-3-1",
        }
    ],
)

sagemaker.create_endpoint(
    EndpointName="vllm-endpoint",
    EndpointConfigName="vllm-endpoint-config",
)

Running on Amazon EC2

Running PyTorch Training Container on an EC2 Instance

# Run interactively
docker run -it --gpus all <account_id>.dkr.ecr.<region>.amazonaws.com/<repository>:<tag> bash

# Example: Run PyTorch container
docker run -it --gpus all 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.10.0-cpu-py313-ubuntu22.04-ec2 bash

# Mount local directories to persist data
docker run -it --gpus all -v /local/data:/data 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.10.0-cpu-py313-ubuntu22.04-ec2 bash