Skip to content

Supported Models

All models listed below are regression-tested on every DLC vLLM release and work with the images listed on the Overview page.

The Coverage column indicates test depth: Smoke runs on every PR; Benchmark runs throughput and latency tests with pass/fail thresholds before release. A Smoke + Benchmark tag means both apply.

Tested Models

Family Model Coverage
Llama meta-llama/Llama-3.3-70B-Instruct Benchmark
Qwen Qwen/Qwen3-32B Benchmark
Qwen/Qwen3.5-0.8B Smoke + Benchmark
Qwen/Qwen3.5-2B Benchmark
Qwen/Qwen3.5-9B Benchmark
Qwen/Qwen3.5-27B-FP8 Benchmark
Qwen/Qwen3.5-35B-A3B-FP8 Benchmark
Qwen/Qwen3.6-27B Benchmark
Qwen/Qwen3.6-35B-A3B Benchmark
Qwen/Qwen3-Coder-Next-FP8 Benchmark
Qwen/Qwen3-Embedding-0.6B Smoke + Benchmark
Qwen/Qwen3-VL-Embedding-2B Smoke + Benchmark
Qwen/Qwen3-ASR-1.7B Benchmark
Gemma google/gemma-4-26B-A4B-it Benchmark
google/gemma-4-31B-it Benchmark
google/gemma-4-E4B-it Benchmark
GPT-OSS openai/gpt-oss-20b Benchmark

Model-Specific Tuning

For recommended serving flags, hardware configurations, and quantization options per model, see recipes.vllm.ai.

Custom Models

Any model supported by upstream vLLM should work. To serve a model not listed above:

docker run --gpus all -p 8000:8000 \
  public.ecr.aws/deep-learning-containers/vllm:server-cuda \
  --model <org>/<model-name>

Models can also be loaded from a local path (-v /path:/model --model /model) or streamed from S3 — see Loading Models from S3.