AWS SageMaker vLLM Inference¶
Deploy and run inference on vLLM models using AWS SageMaker and vLLM DLC.
Files¶
deploy_and_test_sm_endpoint.py- Complete workflow: deploy, inference, and cleanuptestNixlConnector.sh- Multi-GPU NixlConnector test script
Prerequisites¶
- AWS CLI configured with appropriate permissions
- HuggingFace token for model access (if required)
Setup¶
Create IAM Role¶
# Create role
aws iam create-role --role-name SageMakerExecutionRole
# Attach policies
aws iam attach-role-policy --role-name SageMakerExecutionRole --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
aws iam attach-role-policy --role-name SageMakerExecutionRole --policy-arn arn:aws:iam::aws:policy/AmazonElasticContainerRegistryPublicFullAccess
Quick Start¶
1. Set Environment Variables¶
# Note: Using a Public Gallery image to create an SM endpoint is currently not supported
export CONTAINER_URI="763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.11.2-gpu-py312"
export IAM_ROLE="SageMakerExecutionRole"
export HF_TOKEN="your-huggingface-token"
2. Run Complete Workflow¶
# Deploy, run inference, and cleanup automatically
python deploy_and_test_sm_endpoint.py --endpoint-name vllm-test-$(date +%s) --prompt "Write a Python function to calculate fibonacci numbers"
# Alternate with custom parameters
python deploy_and_test_sm_endpoint.py \
--endpoint-name my-vllm-endpoint \
--model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--instance-type ml.g5.12xlarge \
--prompt "Explain machine learning" \
--max-tokens 1000 \
--temperature 0.7
Command Line Options¶
--endpoint-name- SageMaker endpoint name (required)--container-uri- DLC image URI (default from env)--iam-role- IAM role ARN (default from env)--instance-type- Instance type (default: ml.g5.12xlarge)--model-id- HuggingFace model ID (default: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)--hf-token- HuggingFace token (default from env)--prompt- Inference prompt (default: code generation example)--max-tokens- Maximum response length (default: 2400)--temperature- Sampling randomness 0-1 (default: 0.01)
Instance Types¶
Recommended GPU instances:
- ml.g5.12xlarge - 4 A10G GPUs, 48 vCPUs, 192 GB RAM
- ml.g5.24xlarge - 4 A10G GPUs, 96 vCPUs, 384 GB RAM
- ml.p4d.24xlarge - 8 A100 GPUs, 96 vCPUs, 1152 GB RAM
Test NixlConnector¶
Test NixlConnector locally - NixlConnector Documentation
# Login to aws ecr
aws ecr get-login-password --region us-west-2 | docker login \
--username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
# Pull latest vLLM DLC for EC2
# Note: Using a Public Gallery image to create an SM endpoint is currently not supported
docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.11.2-gpu-py312
# Run container with GPU access
docker run -it --entrypoint=/bin/bash --gpus=all \
-v $(pwd):/workspace \
763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.11.2-gpu-py312
# Inside container, run the NixlConnector test
export HF_TOKEN= "<TOKEN>"
./testNixlConnector.sh
Notes¶
- The script automatically cleans up resources after inference to avoid ongoing costs
- Deployment waits for endpoint to be ready before running inference
- All parameters can be set via environment variables or command line arguments