Skip to content

Multimodal Serving using vLLM-Omni DLC

Production-ready Docker images for serving multimodal models with vLLM-Omni on AWS. Built on Amazon Linux 2023 with ongoing security patching.

Supports text-to-speech, audio generation, image generation, video generation, and multimodal chat through OpenAI-compatible APIs.

Images

Platform Image Default Port
EC2 / EKS public.ecr.aws/deep-learning-containers/vllm:omni-cuda 8080
Amazon SageMaker AI public.ecr.aws/deep-learning-containers/vllm:omni-sagemaker-cuda 8080

All images are also available on the ECR Public Gallery. For private ECR URIs, see Image Access.

Supported Modalities

Modality Route Example Models
Text-to-Speech /v1/audio/speech Qwen3-TTS-1.7B, CosyVoice3-0.5B
Audio Generation /v1/audio/generate Stable-Audio-Open-1.0
Image Generation /v1/images/generations FLUX.2-klein-4B, ERNIE-Image-Turbo
Video Generation (async) /v1/videos Wan2.1-T2V-1.3B, Wan2.1-VACE-1.3B
Video Generation (sync) /v1/videos/sync Wan2.1-T2V-1.3B, Wan2.1-VACE-1.3B
Multimodal Chat /v1/chat/completions Qwen2.5-Omni-3B

What's Included

In addition to vLLM-Omni and its core stack (PyTorch, CUDA 13.0, NCCL, Python 3.12), the images bundle:

  • FlashInfer — fused attention kernels with precompiled cubins for fast cold start
  • DeepEP — expert-parallel kernels for large MoE models
  • LMCache + NIXL — KV-cache offloading and disaggregated prefill/decode
  • runai-model-streamer — stream model weights directly from S3 or GCS
  • EFA and OpenMPI — high-throughput multi-node networking on supported instances
  • espeak-ng and ffmpeg — system-level dependencies for TTS phonemizer and audio/video encoding

The SageMaker image additionally includes a routing middleware that dispatches /invocations to omni-specific routes (TTS, image, video, etc.) via the CustomAttributes header. See Amazon SageMaker AI Deployment.

CUDA Forward Compatibility

The images use CUDA 13.0, which requires NVIDIA driver 580+ on the host. On hosts with older datacenter drivers, set -e VLLM_ENABLE_CUDA_COMPATIBILITY=1 to enable the bundled CUDA 13 forward-compat layer.

How We Build

These images are curated builds tracking the vLLM-Omni project:

  • Built from upstream releases — images track vLLM-Omni releases, each gated by our regression test suite before publication.
  • Regression-tested — validated against a suite of multimodal models on every release. See Supported Models for the full list.
  • Security-patched — continuously maintained with security patches from AWS on an Amazon Linux 2023 base.