Skip to content

vLLM-Omni Inference

Pre-built Docker images for serving omni-modality models (text-to-speech, image generation, video generation, and multimodal chat) with vLLM-Omni. Built on Amazon Linux 2023 with CUDA 12.9 and Python 3.12.

Latest Announcements

April 24, 2026 — vLLM-Omni 0.18.0 initial release. Serves TTS, image, video, and omni-chat models through OpenAI-compatible APIs. Includes a SageMaker routing middleware for dispatching /invocations to any omni endpoint via CustomAttributes.

Pull Commands

EC2:

docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:omni-cuda-v1

SageMaker:

docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:omni-sagemaker-cuda-v1

See Available Images for all image URIs and Getting Started for authentication instructions.

Packages

For package versions included in each release, see the Release Notes.

Supported Modalities

Modality Route Example Model
Text-to-Speech /v1/audio/speech Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Image Generation /v1/images/generations black-forest-labs/FLUX.2-klein-4B
Video Generation /v1/videos Wan-AI/Wan2.1-T2V-1.3B-Diffusers
Multimodal Chat /v1/chat/completions bytedance-research/BAGEL-7B-MoT, Qwen/Qwen2.5-Omni-3B

Model Compatibility

  • Models must have a standard HuggingFace config.json with a recognized model_type, or be diffusers pipeline models with model_index.json.
  • Some HuggingFace repos ship a config.json without a model_type field; vllm-omni's config resolver will reject these. Patching the local snapshot with a minimal config.json ({"model_type": "...", "architectures": ["..."]}) is a common workaround, but the container's pinned transformers version must also register the model type — models newer than that pin will fail at engine startup. Upgrading transformers in-place risks breaking the supported models; wait for a future vllm-omni release with an updated pin.
  • Multi-stage omni models (thinker + talker + decoder) like Qwen2.5-Omni need significantly more VRAM than the model size suggests. Refer to the individual model cards for minimum GPU requirements.

EC2 Deployment

The container runs vllm serve --omni and exposes the OpenAI-compatible API on port 8080. Each example below is a self-contained shell script that starts the container, waits for readiness, submits a request, and writes the output to disk. Any vllm serve flag may be appended to docker run (e.g., --tensor-parallel-size 2, --max-model-len 2048, --enforce-eager).

Text-to-Speech

Model: Qwen3-TTS-12Hz-1.7B-CustomVoice — a 1.7B-parameter Qwen3 text-to-speech model supporting multiple voices and languages, runs on a single 24 GB GPU (A10G / L4).

#!/usr/bin/env bash
# End-to-end TTS example: start server, wait for ready, synthesize speech.
# Requires: docker (with NVIDIA runtime), curl, an authenticated ECR pull.
set -euo pipefail

IMAGE="${IMAGE:-763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm-omni:omni-cuda-v1}"
MODEL="${MODEL:-Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice}"
NAME="${NAME:-omni-tts}"

docker run -d --name "${NAME}" --gpus all --shm-size=2g -p 8080:8000 \
  -v "${HOME}/hf-cache:/root/.cache/huggingface" \
  "${IMAGE}" --model "${MODEL}"

until curl -sf http://localhost:8080/health >/dev/null; do sleep 5; done

curl -sf -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello from vLLM-Omni.", "voice": "vivian", "language": "English"}' \
  --output speech.wav

echo "wrote speech.wav ($(stat -f%z speech.wav 2>/dev/null || stat -c%s speech.wav) bytes)"
# Cleanup:  docker stop "${NAME}" && docker rm "${NAME}"

Image Generation

Model: FLUX.2-klein-4B — a 4B-parameter rectified-flow transformer from Black Forest Labs, produces high-quality 512×512 images from text prompts, runs on a single 24 GB GPU.

#!/usr/bin/env bash
# End-to-end image-generation example: start server, wait for ready, generate.
set -euo pipefail

IMAGE="${IMAGE:-763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm-omni:omni-cuda-v1}"
MODEL="${MODEL:-black-forest-labs/FLUX.2-klein-4B}"
NAME="${NAME:-omni-image}"

docker run -d --name "${NAME}" --gpus all --shm-size=2g -p 8080:8000 \
  -v "${HOME}/hf-cache:/root/.cache/huggingface" \
  "${IMAGE}" --model "${MODEL}"

until curl -sf http://localhost:8080/health >/dev/null; do sleep 5; done

# Response JSON has data[0].b64_json — decode to PNG.
curl -sf -X POST http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"prompt": "a red apple on a white table, studio lighting", "size": "512x512", "n": 1}' \
  | python3 -c "import base64,json,sys;open('image.png','wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))"

echo "wrote image.png ($(stat -f%z image.png 2>/dev/null || stat -c%s image.png) bytes)"
# Cleanup:  docker stop "${NAME}" && docker rm "${NAME}"

Video Generation

Model: Wan2.1-T2V-1.3B — a 1.3B-parameter text-to-video diffusion model from the Wan team, generates short clips at up to 480×832 resolution. Needs a 48 GB GPU (L40S) or 2× 24 GB GPUs with --tensor-parallel-size 2.

The /v1/videos endpoint is asynchronous — it returns a job ID immediately and generates the video in the background. The script below submits the job, polls until it completes, then downloads the MP4.

#!/usr/bin/env bash
# End-to-end video-generation example: start server, submit job, poll, download.
# /v1/videos is async — it returns a job ID; the MP4 is produced in the background.
set -euo pipefail

IMAGE="${IMAGE:-763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm-omni:omni-cuda-v1}"
MODEL="${MODEL:-Wan-AI/Wan2.1-T2V-1.3B-Diffusers}"
NAME="${NAME:-omni-video}"

docker run -d --name "${NAME}" --gpus all --shm-size=8g -p 8080:8000 \
  -v "${HOME}/hf-cache:/root/.cache/huggingface" \
  "${IMAGE}" --model "${MODEL}" --tensor-parallel-size 2

until curl -sf http://localhost:8080/health >/dev/null; do sleep 5; done

# /v1/videos requires multipart/form-data.
JOB_ID=$(curl -sf -X POST http://localhost:8080/v1/videos \
  -F "prompt=a dog running on a beach at sunset" \
  -F "num_frames=17" -F "num_inference_steps=30" \
  -F "size=480x320" -F "seed=42" \
  | python3 -c "import json,sys;print(json.load(sys.stdin)['id'])")

echo "submitted job ${JOB_ID}"

# Poll until completed (5s interval, 10 min timeout).
for _ in $(seq 1 120); do
  STATUS=$(curl -sf "http://localhost:8080/v1/videos/${JOB_ID}" \
    | python3 -c "import json,sys;print(json.load(sys.stdin)['status'])")
  [ "${STATUS}" = "completed" ] && break
  [ "${STATUS}" = "failed" ] && { echo "job failed"; exit 1; }
  sleep 5
done

curl -sf "http://localhost:8080/v1/videos/${JOB_ID}/content" --output video.mp4
echo "wrote video.mp4 ($(stat -f%z video.mp4 2>/dev/null || stat -c%s video.mp4) bytes)"
# Cleanup:  docker stop "${NAME}" && docker rm "${NAME}"

Multimodal Chat

Use the standard OpenAI chat-completions API. Multimodal inputs (images, audio) are supplied as URL or base64 content parts in the message list.

Example model: Qwen2.5-Omni-3B — a 3B-parameter omni model accepting text, image, and audio inputs and generating text or speech outputs. Multi-stage architecture (thinker + talker + code2wav) requires ≥ 4 GPUs: g5.12xlarge / g6.12xlarge (4× A10G) or g6e.12xlarge (4× L40S).

Start the server, then submit a request. Three things are required on /v1/chat/completions to produce clean audio from Qwen2.5-Omni:

  1. "modalities": ["audio"] — not ["text","audio"] (that returns empty audio).
  2. "sampling_params_list" — a 3-element list (thinker, talker, code2wav). The image's built-in per-stage defaults produce noise; use the values from the official Qwen docs.
  3. The exact Qwen system prompt.

Omitting sampling_params_list returns 200 with valid WAV bytes that sound like noise — the single most common footgun.

#!/usr/bin/env bash
# End-to-end Qwen2.5-Omni-3B example: start server, wait for ready,
# generate speech via /v1/chat/completions.
#
# REQUIRES ≥ 4 GPUs (e.g., g5.12xlarge / g6.12xlarge / g6e.12xlarge).
# On single-GPU hosts the model's talker stage fails to load on GPU 1.
set -euo pipefail

IMAGE="${IMAGE:-763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm-omni:omni-cuda-v1}"
MODEL="${MODEL:-Qwen/Qwen2.5-Omni-3B}"
NAME="${NAME:-omni3b}"

docker run -d --name "${NAME}" --gpus all --shm-size=16g -p 8080:8080 \
  -v "${HOME}/hf-cache:/root/.cache/huggingface" \
  -e HF_HUB_ENABLE_HF_TRANSFER=1 \
  "${IMAGE}" --model "${MODEL}" \
  --host 0.0.0.0 --port 8080 \
  --max-model-len 16384 --dtype bfloat16

# First start takes ~8 min (weight download + 3-stage load).
until curl -sf http://localhost:8080/health >/dev/null; do sleep 10; done

# Three things are REQUIRED for clean audio:
#   1. "modalities": ["audio"]  (NOT ["text","audio"] — returns empty audio)
#   2. "sampling_params_list"   (3-element list: thinker, talker, code2wav;
#                                built-in defaults produce noise)
#   3. The exact Qwen system prompt below.
# Omitting #2 returns 200 OK with valid WAV bytes that sound like noise.
curl -sf -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen2.5-Omni-3B",
    "modalities": ["audio"],
    "sampling_params_list": [
      {"temperature":0.0,"top_p":1.0,"top_k":-1,"max_tokens":2048,"seed":42,"detokenize":true,"repetition_penalty":1.1},
      {"temperature":0.9,"top_p":0.8,"top_k":40,"max_tokens":2048,"seed":42,"detokenize":true,"repetition_penalty":1.05,"stop_token_ids":[8294]},
      {"temperature":0.0,"top_p":1.0,"top_k":-1,"max_tokens":2048,"seed":42,"detokenize":true,"repetition_penalty":1.1}
    ],
    "messages": [
      {"role":"system","content":[{"type":"text","text":"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
      {"role":"user","content":[{"type":"text","text":"Tell me a short, calming bedtime lullaby story for a 6-year-old girl."}]}
    ]
  }' | jq -r '.choices[0].message.audio.data' | base64 -d > lullaby.wav

echo "wrote lullaby.wav ($(stat -f%z lullaby.wav 2>/dev/null || stat -c%s lullaby.wav) bytes)"
# Cleanup:  docker stop "${NAME}" && docker rm "${NAME}"

The /v1/audio/speech shortcut (voices: Chelsie, Ethan) bypasses the thinker and does not apply the correct sampling params in 0.18.0, so it produces noisy output for Qwen2.5-Omni. Prefer /v1/chat/completions for this model.

SageMaker Deployment

Prerequisites

  • AWS CLI configured with appropriate permissions
  • An IAM execution role with SageMaker and ECR permissions (see Ray tutorial for an example setup)
  • SageMaker Python SDK v2:
pip install 'sagemaker>=2,<3'

Routing Middleware

The SageMaker image includes an ASGI middleware that dispatches /invocations to the correct vllm-omni endpoint based on the CustomAttributes header:

CustomAttributes Dispatched to
route=/v1/audio/speech TTS
route=/v1/images/generations Image generation
route=/v1/videos Video generation (JSON auto-converted to form-data) — returns job-ID only in 0.18.0, MP4 not retrievable via SageMaker
route=/v1/chat/completions Multimodal chat
(no route) vLLM default /invocations (chat/completion/embed)

Environment Variables

Any SM_VLLM_* env var is converted to a --<name> CLI argument (e.g., SM_VLLM_MAX_MODEL_LEN=2048--max-model-len 2048).

Variable Description Example
SM_VLLM_MODEL Model ID (HuggingFace or local path) Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
SM_VLLM_MAX_MODEL_LEN Max sequence length 2048
SM_VLLM_ENFORCE_EAGER Disable CUDA graphs true
SM_VLLM_TENSOR_PARALLEL_SIZE Number of GPUs for TP 2
HF_TOKEN HuggingFace token for gated models hf_...

Deploy a TTS Endpoint

SageMaker endpoint deployment takes several minutes and incurs costs. Remember to delete endpoints when done.

"""Deploy a vLLM-Omni TTS model to a real-time SageMaker endpoint."""

from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer

model = Model(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:omni-sagemaker-cuda-v1",
    role="arn:aws:iam::<ACCOUNT>:role/SageMakerExecutionRole",
    env={"SM_VLLM_MODEL": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"},
    predictor_cls=Predictor,
)

predictor = model.deploy(
    instance_type="ml.g5.xlarge",
    initial_instance_count=1,
    endpoint_name="vllm-omni-tts",
    inference_ami_version="al2-ami-sagemaker-inference-gpu-3-1",
    serializer=JSONSerializer(),
    wait=True,
)

# Invoke — route /invocations to /v1/audio/speech via CustomAttributes
sm_runtime = predictor.sagemaker_session.sagemaker_runtime_client
response = sm_runtime.invoke_endpoint(
    EndpointName=predictor.endpoint_name,
    ContentType="application/json",
    Body='{"input": "Hello world", "voice": "vivian", "language": "English"}',
    CustomAttributes="route=/v1/audio/speech",
)
with open("speech.wav", "wb") as f:
    f.write(response["Body"].read())

GPU deploys require inference_ami_version — the default SageMaker host AMI has incompatible NVIDIA drivers for CUDA 12.9 images. See ProductionVariant API reference for valid values.

When done, delete the endpoint:

predictor.delete_endpoint()

Async Inference for Long-Running TTS Generation

SageMaker real-time inference has a 60-second timeout. First requests to TTS models may exceed this due to torch.compile warmup (~67s); async inference avoids the limit, as does retrying after warmup completes.

Video generation is not supported on SageMaker in 0.18.0 — see Known Limitations below. Use EC2 for video.

"""Deploy a vLLM-Omni TTS model to a SageMaker async inference endpoint.

Async inference avoids the 60-second real-time invoke timeout, which the first
TTS request can exceed due to torch.compile warmup (~67s). The /v1/audio/speech
endpoint returns raw WAV bytes, so the async output written to S3 is the usable
audio file — no polling or extra retrieval step needed.
"""

from sagemaker.async_inference import AsyncInferenceConfig
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer

model = Model(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:omni-sagemaker-cuda-v1",
    role="arn:aws:iam::<ACCOUNT>:role/SageMakerExecutionRole",
    env={"SM_VLLM_MODEL": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"},
    predictor_cls=Predictor,
)

predictor = model.deploy(
    instance_type="ml.g5.xlarge",
    initial_instance_count=1,
    endpoint_name="vllm-omni-tts-async",
    inference_ami_version="al2-ami-sagemaker-inference-gpu-3-1",
    serializer=JSONSerializer(),
    async_inference_config=AsyncInferenceConfig(
        output_path="s3://<BUCKET>/vllm-omni-async-output/",
        max_concurrent_invocations_per_instance=1,
    ),
    wait=True,
)

# Invoke async — upload the JSON input to S3, then call invoke_endpoint_async.
# The resulting .out object in S3 is the raw WAV audio bytes (content-type audio/wav).
# Use CustomAttributes to route /invocations → /v1/audio/speech.

For async inference, upload the JSON input payload to S3 first, then call invoke_endpoint_async with InputLocation=<s3-uri> and CustomAttributes="route=/v1/audio/speech". The resulting .out object in the configured S3 output path is the raw WAV audio — no polling or additional retrieval step required.

Known Limitations

  • Video generation is not supported on SageMaker in 0.18.0. The /v1/videos endpoint is async by design — it returns a job-ID JSON immediately and generates the MP4 in the background. Through SageMaker async inference, only that job-ID JSON is written to S3; the MP4 itself never lands in S3 and cannot be retrieved through invoke_endpoint or invoke_endpoint_async. Use EC2 for video generation — direct container access supports the full workflow (create job, poll status, download MP4). SageMaker support is expected once POST /v1/videos/sync (which blocks and returns raw MP4 bytes) is available in a future vllm-omni release.
  • First-request latency on SageMaker real-time endpoints. TTS models can exceed the 60s invoke timeout on the first request due to torch.compile warmup. Use async inference or retry after warmup.

Release Notes

See vLLM-Omni Release Notes for version history and changelogs.

Resources