EC2 Deployment¶

The container runs the vLLM OpenAI-compatible API server on port 8000. Any vllm serve flag may be appended to docker run. See Configuration for the full list of server arguments.

Single GPU¶

docker run --gpus all -p 8000:8000 \
  public.ecr.aws/deep-learning-containers/vllm:server-cuda \
  --model openai/gpt-oss-20b \
  --host 0.0.0.0 --port 8000

For gated models (Llama, Gemma, etc.), pass -e HF_TOKEN=<your_hf_token>.

Send a request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "What is deep learning?"}],
    "max_tokens": 256
  }'

Multi-GPU (Tensor Parallelism)¶

For models that require multiple GPUs (e.g., 70B+):

docker run --gpus all --ipc=host -p 8000:8000 \
  -e HF_TOKEN=<your_hf_token> \
  public.ecr.aws/deep-learning-containers/vllm:server-cuda \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4 \
  --host 0.0.0.0 --port 8000

--ipc=host enables shared memory between GPU processes.

Loading Models from S3¶

The image bundles Run:ai Model Streamer — pass an s3:// URI as the model and add --load-format runai_streamer to stream weights directly from S3 to GPU memory:

docker run --gpus all -p 8000:8000 \
  -e AWS_REGION=us-west-2 \
  -e AWS_ACCESS_KEY_ID=<key> \
  -e AWS_SECRET_ACCESS_KEY=<secret> \
  public.ecr.aws/deep-learning-containers/vllm:server-cuda \
  --model s3://<bucket>/<prefix>/ \
  --load-format runai_streamer \
  --host 0.0.0.0 --port 8000

The S3 prefix should contain a Hugging Face model layout (config.json, tokenizer.json, and one or more *.safetensors files). On EC2 with an attached instance role, the AWS credentials may be omitted — the container will pick them up from IMDS. See the Run:ai streamer docs for sharded loading and tuning options.

Model-Specific Tuning¶

For recommended serving flags, hardware configurations, and quantization options per model, see recipes.vllm.ai.