Supported Models¶
All models listed below are regression-tested on every DLC vLLM release and work with the images listed on the Overview page.
The Coverage column indicates test depth: Smoke runs on every PR; Benchmark runs throughput and latency tests with pass/fail thresholds before release. A Smoke + Benchmark tag means both apply.
Tested Models¶
| Family | Model | Coverage |
|---|---|---|
| Llama | meta-llama/Llama-3.3-70B-Instruct | Benchmark |
| Qwen | Qwen/Qwen3-32B | Benchmark |
| Qwen/Qwen3.5-0.8B | Smoke + Benchmark | |
| Qwen/Qwen3.5-2B | Benchmark | |
| Qwen/Qwen3.5-9B | Benchmark | |
| Qwen/Qwen3.5-27B-FP8 | Benchmark | |
| Qwen/Qwen3.5-35B-A3B-FP8 | Benchmark | |
| Qwen/Qwen3.6-27B | Benchmark | |
| Qwen/Qwen3.6-35B-A3B | Benchmark | |
| Qwen/Qwen3-Coder-Next-FP8 | Benchmark | |
| Qwen/Qwen3-Embedding-0.6B | Smoke + Benchmark | |
| Qwen/Qwen3-VL-Embedding-2B | Smoke + Benchmark | |
| Qwen/Qwen3-ASR-1.7B | Benchmark | |
| Gemma | google/gemma-4-26B-A4B-it | Benchmark |
| google/gemma-4-31B-it | Benchmark | |
| google/gemma-4-E4B-it | Benchmark | |
| GPT-OSS | openai/gpt-oss-20b | Benchmark |
Model-Specific Tuning¶
For recommended serving flags, hardware configurations, and quantization options per model, see recipes.vllm.ai.
Custom Models¶
Any model supported by upstream vLLM should work. To serve a model not listed above:
docker run --gpus all -p 8000:8000 \
public.ecr.aws/deep-learning-containers/vllm:server-cuda \
--model <org>/<model-name>
Models can also be loaded from a local path (-v /path:/model --model /model) or streamed from S3 — see
Loading Models from S3.