Multimodal Serving using vLLM-Omni DLC¶
Production-ready Docker images for serving multimodal models with vLLM-Omni on AWS. Built on Amazon Linux 2023 with ongoing security patching.
Supports text-to-speech, audio generation, image generation, video generation, and multimodal chat through OpenAI-compatible APIs.
Images¶
| Platform | Image | Default Port |
|---|---|---|
| EC2 / EKS | public.ecr.aws/deep-learning-containers/vllm:omni-cuda |
8080 |
| Amazon SageMaker AI | public.ecr.aws/deep-learning-containers/vllm:omni-sagemaker-cuda |
8080 |
All images are also available on the ECR Public Gallery. For private ECR URIs, see Image Access.
Supported Modalities¶
| Modality | Route | Example Models |
|---|---|---|
| Text-to-Speech | /v1/audio/speech |
Qwen3-TTS-1.7B, CosyVoice3-0.5B |
| Audio Generation | /v1/audio/generate |
Stable-Audio-Open-1.0 |
| Image Generation | /v1/images/generations |
FLUX.2-klein-4B, ERNIE-Image-Turbo |
| Video Generation (async) | /v1/videos |
Wan2.1-T2V-1.3B, Wan2.1-VACE-1.3B |
| Video Generation (sync) | /v1/videos/sync |
Wan2.1-T2V-1.3B, Wan2.1-VACE-1.3B |
| Multimodal Chat | /v1/chat/completions |
Qwen2.5-Omni-3B |
What's Included¶
In addition to vLLM-Omni and its core stack (PyTorch, CUDA 13.0, NCCL, Python 3.12), the images bundle:
- FlashInfer — fused attention kernels with precompiled cubins for fast cold start
- DeepEP — expert-parallel kernels for large MoE models
- LMCache + NIXL — KV-cache offloading and disaggregated prefill/decode
- runai-model-streamer — stream model weights directly from S3 or GCS
- EFA and OpenMPI — high-throughput multi-node networking on supported instances
- espeak-ng and ffmpeg — system-level dependencies for TTS phonemizer and audio/video encoding
The SageMaker image additionally includes a routing middleware that dispatches /invocations to omni-specific routes (TTS, image, video, etc.)
via the CustomAttributes header. See Amazon SageMaker AI Deployment.
CUDA Forward Compatibility¶
The images use CUDA 13.0, which requires NVIDIA driver 580+ on the host. On hosts with older datacenter drivers, set
-e VLLM_ENABLE_CUDA_COMPATIBILITY=1 to enable the bundled CUDA 13 forward-compat layer.
How We Build¶
These images are curated builds tracking the vLLM-Omni project:
- Built from upstream releases — images track vLLM-Omni releases, each gated by our regression test suite before publication.
- Regression-tested — validated against a suite of multimodal models on every release. See Supported Models for the full list.
- Security-patched — continuously maintained with security patches from AWS on an Amazon Linux 2023 base.