Skip to content

AWS SageMaker vLLM Inference

Deploy and run inference on vLLM models using AWS SageMaker and vLLM DLC.

Files

  • deploy_and_test_sm_endpoint.py - Complete workflow: deploy, inference, and cleanup
  • testNixlConnector.sh - Multi-GPU NixlConnector test script

Prerequisites

  • AWS CLI configured with appropriate permissions
  • HuggingFace token for model access (if required)

Setup

Create IAM Role

# Create role
aws iam create-role --role-name SageMakerExecutionRole

# Attach policies
aws iam attach-role-policy --role-name SageMakerExecutionRole --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
aws iam attach-role-policy --role-name SageMakerExecutionRole --policy-arn arn:aws:iam::aws:policy/AmazonElasticContainerRegistryPublicFullAccess

Quick Start

1. Set Environment Variables

# Note: Using a Public Gallery image to create an SM endpoint is currently not supported
export CONTAINER_URI="763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.11.2-gpu-py312"
export IAM_ROLE="SageMakerExecutionRole"
export HF_TOKEN="your-huggingface-token" 

2. Run Complete Workflow

# Deploy, run inference, and cleanup automatically
python deploy_and_test_sm_endpoint.py --endpoint-name vllm-test-$(date +%s) --prompt "Write a Python function to calculate fibonacci numbers"

# Alternate with custom parameters
python deploy_and_test_sm_endpoint.py \
  --endpoint-name my-vllm-endpoint \
  --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --instance-type ml.g5.12xlarge \
  --prompt "Explain machine learning" \
  --max-tokens 1000 \
  --temperature 0.7

Command Line Options

  • --endpoint-name - SageMaker endpoint name (required)
  • --container-uri - DLC image URI (default from env)
  • --iam-role - IAM role ARN (default from env)
  • --instance-type - Instance type (default: ml.g5.12xlarge)
  • --model-id - HuggingFace model ID (default: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
  • --hf-token - HuggingFace token (default from env)
  • --prompt - Inference prompt (default: code generation example)
  • --max-tokens - Maximum response length (default: 2400)
  • --temperature - Sampling randomness 0-1 (default: 0.01)

Instance Types

Recommended GPU instances: - ml.g5.12xlarge - 4 A10G GPUs, 48 vCPUs, 192 GB RAM - ml.g5.24xlarge - 4 A10G GPUs, 96 vCPUs, 384 GB RAM - ml.p4d.24xlarge - 8 A100 GPUs, 96 vCPUs, 1152 GB RAM

Test NixlConnector

Test NixlConnector locally - NixlConnector Documentation

# Login to aws ecr
aws ecr get-login-password --region us-west-2 | docker login \
--username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

# Pull latest vLLM DLC for EC2
# Note: Using a Public Gallery image to create an SM endpoint is currently not supported
docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.11.2-gpu-py312

# Run container with GPU access
docker run -it --entrypoint=/bin/bash --gpus=all \
  -v $(pwd):/workspace \
  763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.11.2-gpu-py312

# Inside container, run the NixlConnector test
export HF_TOKEN= "<TOKEN>"
./testNixlConnector.sh

Notes

  • The script automatically cleans up resources after inference to avoid ongoing costs
  • Deployment waits for endpoint to be ready before running inference
  • All parameters can be set via environment variables or command line arguments