vLLM-Omni Inference¶
Pre-built Docker images for serving omni-modality models (text-to-speech, image generation, video generation, and multimodal chat) with vLLM-Omni. Built on Amazon Linux 2023 with CUDA 12.9 and Python 3.12.
Latest Announcements¶
April 24, 2026 — vLLM-Omni 0.18.0 initial release. Serves TTS, image, video, and omni-chat models through OpenAI-compatible APIs. Includes a
SageMaker routing middleware for dispatching /invocations to any omni endpoint via CustomAttributes.
Pull Commands¶
EC2:
SageMaker:
See Available Images for all image URIs and Getting Started for authentication instructions.
Packages¶
For package versions included in each release, see the Release Notes.
Supported Modalities¶
| Modality | Route | Example Model |
|---|---|---|
| Text-to-Speech | /v1/audio/speech |
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice |
| Image Generation | /v1/images/generations |
black-forest-labs/FLUX.2-klein-4B |
| Video Generation | /v1/videos |
Wan-AI/Wan2.1-T2V-1.3B-Diffusers |
| Multimodal Chat | /v1/chat/completions |
bytedance-research/BAGEL-7B-MoT, Qwen/Qwen2.5-Omni-3B |
Model Compatibility¶
- Models must have a standard HuggingFace
config.jsonwith a recognizedmodel_type, or be diffusers pipeline models withmodel_index.json. - Some HuggingFace repos ship a
config.jsonwithout amodel_typefield; vllm-omni's config resolver will reject these. Patching the local snapshot with a minimalconfig.json({"model_type": "...", "architectures": ["..."]}) is a common workaround, but the container's pinnedtransformersversion must also register the model type — models newer than that pin will fail at engine startup. Upgradingtransformersin-place risks breaking the supported models; wait for a future vllm-omni release with an updated pin. - Multi-stage omni models (thinker + talker + decoder) like Qwen2.5-Omni need significantly more VRAM than the model size suggests. Refer to the individual model cards for minimum GPU requirements.
EC2 Deployment¶
The container runs vllm serve --omni and exposes the OpenAI-compatible API on port 8080. Each example below is a self-contained shell script that
starts the container, waits for readiness, submits a request, and writes the output to disk. Any vllm serve flag may be appended to docker run
(e.g., --tensor-parallel-size 2, --max-model-len 2048, --enforce-eager).
Text-to-Speech¶
Model: Qwen3-TTS-12Hz-1.7B-CustomVoice — a 1.7B-parameter Qwen3 text-to-speech model supporting multiple voices and languages, runs on a single 24 GB GPU (A10G / L4).
#!/usr/bin/env bash
# End-to-end TTS example: start server, wait for ready, synthesize speech.
# Requires: docker (with NVIDIA runtime), curl, an authenticated ECR pull.
set -euo pipefail
IMAGE="${IMAGE:-763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm-omni:omni-cuda-v1}"
MODEL="${MODEL:-Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice}"
NAME="${NAME:-omni-tts}"
docker run -d --name "${NAME}" --gpus all --shm-size=2g -p 8080:8000 \
-v "${HOME}/hf-cache:/root/.cache/huggingface" \
"${IMAGE}" --model "${MODEL}"
until curl -sf http://localhost:8080/health >/dev/null; do sleep 5; done
curl -sf -X POST http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello from vLLM-Omni.", "voice": "vivian", "language": "English"}' \
--output speech.wav
echo "wrote speech.wav ($(stat -f%z speech.wav 2>/dev/null || stat -c%s speech.wav) bytes)"
# Cleanup: docker stop "${NAME}" && docker rm "${NAME}"
Image Generation¶
Model: FLUX.2-klein-4B — a 4B-parameter rectified-flow transformer from Black Forest Labs, produces high-quality 512×512 images from text prompts, runs on a single 24 GB GPU.
#!/usr/bin/env bash
# End-to-end image-generation example: start server, wait for ready, generate.
set -euo pipefail
IMAGE="${IMAGE:-763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm-omni:omni-cuda-v1}"
MODEL="${MODEL:-black-forest-labs/FLUX.2-klein-4B}"
NAME="${NAME:-omni-image}"
docker run -d --name "${NAME}" --gpus all --shm-size=2g -p 8080:8000 \
-v "${HOME}/hf-cache:/root/.cache/huggingface" \
"${IMAGE}" --model "${MODEL}"
until curl -sf http://localhost:8080/health >/dev/null; do sleep 5; done
# Response JSON has data[0].b64_json — decode to PNG.
curl -sf -X POST http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"prompt": "a red apple on a white table, studio lighting", "size": "512x512", "n": 1}' \
| python3 -c "import base64,json,sys;open('image.png','wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))"
echo "wrote image.png ($(stat -f%z image.png 2>/dev/null || stat -c%s image.png) bytes)"
# Cleanup: docker stop "${NAME}" && docker rm "${NAME}"
Video Generation¶
Model: Wan2.1-T2V-1.3B — a 1.3B-parameter text-to-video diffusion model from the Wan
team, generates short clips at up to 480×832 resolution. Needs a 48 GB GPU (L40S) or 2× 24 GB GPUs with --tensor-parallel-size 2.
The /v1/videos endpoint is asynchronous — it returns a job ID immediately and generates the video in the background. The script below submits the
job, polls until it completes, then downloads the MP4.
#!/usr/bin/env bash
# End-to-end video-generation example: start server, submit job, poll, download.
# /v1/videos is async — it returns a job ID; the MP4 is produced in the background.
set -euo pipefail
IMAGE="${IMAGE:-763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm-omni:omni-cuda-v1}"
MODEL="${MODEL:-Wan-AI/Wan2.1-T2V-1.3B-Diffusers}"
NAME="${NAME:-omni-video}"
docker run -d --name "${NAME}" --gpus all --shm-size=8g -p 8080:8000 \
-v "${HOME}/hf-cache:/root/.cache/huggingface" \
"${IMAGE}" --model "${MODEL}" --tensor-parallel-size 2
until curl -sf http://localhost:8080/health >/dev/null; do sleep 5; done
# /v1/videos requires multipart/form-data.
JOB_ID=$(curl -sf -X POST http://localhost:8080/v1/videos \
-F "prompt=a dog running on a beach at sunset" \
-F "num_frames=17" -F "num_inference_steps=30" \
-F "size=480x320" -F "seed=42" \
| python3 -c "import json,sys;print(json.load(sys.stdin)['id'])")
echo "submitted job ${JOB_ID}"
# Poll until completed (5s interval, 10 min timeout).
for _ in $(seq 1 120); do
STATUS=$(curl -sf "http://localhost:8080/v1/videos/${JOB_ID}" \
| python3 -c "import json,sys;print(json.load(sys.stdin)['status'])")
[ "${STATUS}" = "completed" ] && break
[ "${STATUS}" = "failed" ] && { echo "job failed"; exit 1; }
sleep 5
done
curl -sf "http://localhost:8080/v1/videos/${JOB_ID}/content" --output video.mp4
echo "wrote video.mp4 ($(stat -f%z video.mp4 2>/dev/null || stat -c%s video.mp4) bytes)"
# Cleanup: docker stop "${NAME}" && docker rm "${NAME}"
Multimodal Chat¶
Use the standard OpenAI chat-completions API. Multimodal inputs (images, audio) are supplied as URL or base64 content parts in the message list.
Example model: Qwen2.5-Omni-3B — a 3B-parameter omni model accepting text, image, and audio inputs
and generating text or speech outputs. Multi-stage architecture (thinker + talker + code2wav) requires ≥ 4 GPUs: g5.12xlarge / g6.12xlarge (4×
A10G) or g6e.12xlarge (4× L40S).
Start the server, then submit a request. Three things are required on /v1/chat/completions to produce clean audio from Qwen2.5-Omni:
"modalities": ["audio"]— not["text","audio"](that returns empty audio)."sampling_params_list"— a 3-element list (thinker, talker, code2wav). The image's built-in per-stage defaults produce noise; use the values from the official Qwen docs.- The exact Qwen system prompt.
Omitting sampling_params_list returns 200 with valid WAV bytes that sound like noise — the single most common footgun.
#!/usr/bin/env bash
# End-to-end Qwen2.5-Omni-3B example: start server, wait for ready,
# generate speech via /v1/chat/completions.
#
# REQUIRES ≥ 4 GPUs (e.g., g5.12xlarge / g6.12xlarge / g6e.12xlarge).
# On single-GPU hosts the model's talker stage fails to load on GPU 1.
set -euo pipefail
IMAGE="${IMAGE:-763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm-omni:omni-cuda-v1}"
MODEL="${MODEL:-Qwen/Qwen2.5-Omni-3B}"
NAME="${NAME:-omni3b}"
docker run -d --name "${NAME}" --gpus all --shm-size=16g -p 8080:8080 \
-v "${HOME}/hf-cache:/root/.cache/huggingface" \
-e HF_HUB_ENABLE_HF_TRANSFER=1 \
"${IMAGE}" --model "${MODEL}" \
--host 0.0.0.0 --port 8080 \
--max-model-len 16384 --dtype bfloat16
# First start takes ~8 min (weight download + 3-stage load).
until curl -sf http://localhost:8080/health >/dev/null; do sleep 10; done
# Three things are REQUIRED for clean audio:
# 1. "modalities": ["audio"] (NOT ["text","audio"] — returns empty audio)
# 2. "sampling_params_list" (3-element list: thinker, talker, code2wav;
# built-in defaults produce noise)
# 3. The exact Qwen system prompt below.
# Omitting #2 returns 200 OK with valid WAV bytes that sound like noise.
curl -sf -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen2.5-Omni-3B",
"modalities": ["audio"],
"sampling_params_list": [
{"temperature":0.0,"top_p":1.0,"top_k":-1,"max_tokens":2048,"seed":42,"detokenize":true,"repetition_penalty":1.1},
{"temperature":0.9,"top_p":0.8,"top_k":40,"max_tokens":2048,"seed":42,"detokenize":true,"repetition_penalty":1.05,"stop_token_ids":[8294]},
{"temperature":0.0,"top_p":1.0,"top_k":-1,"max_tokens":2048,"seed":42,"detokenize":true,"repetition_penalty":1.1}
],
"messages": [
{"role":"system","content":[{"type":"text","text":"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
{"role":"user","content":[{"type":"text","text":"Tell me a short, calming bedtime lullaby story for a 6-year-old girl."}]}
]
}' | jq -r '.choices[0].message.audio.data' | base64 -d > lullaby.wav
echo "wrote lullaby.wav ($(stat -f%z lullaby.wav 2>/dev/null || stat -c%s lullaby.wav) bytes)"
# Cleanup: docker stop "${NAME}" && docker rm "${NAME}"
The /v1/audio/speech shortcut (voices: Chelsie, Ethan) bypasses the thinker and does not apply the correct sampling params in 0.18.0, so it
produces noisy output for Qwen2.5-Omni. Prefer /v1/chat/completions for this model.
SageMaker Deployment¶
Prerequisites¶
- AWS CLI configured with appropriate permissions
- An IAM execution role with SageMaker and ECR permissions (see Ray tutorial for an example setup)
- SageMaker Python SDK v2:
Routing Middleware¶
The SageMaker image includes an ASGI middleware that dispatches /invocations to the correct vllm-omni endpoint based on the CustomAttributes
header:
CustomAttributes |
Dispatched to |
|---|---|
route=/v1/audio/speech |
TTS |
route=/v1/images/generations |
Image generation |
route=/v1/videos |
Video generation (JSON auto-converted to form-data) — returns job-ID only in 0.18.0, MP4 not retrievable via SageMaker |
route=/v1/chat/completions |
Multimodal chat |
| (no route) | vLLM default /invocations (chat/completion/embed) |
Environment Variables¶
Any SM_VLLM_* env var is converted to a --<name> CLI argument (e.g., SM_VLLM_MAX_MODEL_LEN=2048 → --max-model-len 2048).
| Variable | Description | Example |
|---|---|---|
SM_VLLM_MODEL |
Model ID (HuggingFace or local path) | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice |
SM_VLLM_MAX_MODEL_LEN |
Max sequence length | 2048 |
SM_VLLM_ENFORCE_EAGER |
Disable CUDA graphs | true |
SM_VLLM_TENSOR_PARALLEL_SIZE |
Number of GPUs for TP | 2 |
HF_TOKEN |
HuggingFace token for gated models | hf_... |
Deploy a TTS Endpoint¶
SageMaker endpoint deployment takes several minutes and incurs costs. Remember to delete endpoints when done.
"""Deploy a vLLM-Omni TTS model to a real-time SageMaker endpoint."""
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
model = Model(
image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:omni-sagemaker-cuda-v1",
role="arn:aws:iam::<ACCOUNT>:role/SageMakerExecutionRole",
env={"SM_VLLM_MODEL": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"},
predictor_cls=Predictor,
)
predictor = model.deploy(
instance_type="ml.g5.xlarge",
initial_instance_count=1,
endpoint_name="vllm-omni-tts",
inference_ami_version="al2-ami-sagemaker-inference-gpu-3-1",
serializer=JSONSerializer(),
wait=True,
)
# Invoke — route /invocations to /v1/audio/speech via CustomAttributes
sm_runtime = predictor.sagemaker_session.sagemaker_runtime_client
response = sm_runtime.invoke_endpoint(
EndpointName=predictor.endpoint_name,
ContentType="application/json",
Body='{"input": "Hello world", "voice": "vivian", "language": "English"}',
CustomAttributes="route=/v1/audio/speech",
)
with open("speech.wav", "wb") as f:
f.write(response["Body"].read())
GPU deploys require inference_ami_version — the default SageMaker host AMI has incompatible NVIDIA drivers for CUDA 12.9 images. See
ProductionVariant API reference for valid values.
When done, delete the endpoint:
Async Inference for Long-Running TTS Generation¶
SageMaker real-time inference has a 60-second timeout. First requests to TTS models may exceed this due to torch.compile warmup (~67s); async
inference avoids the limit, as does retrying after warmup completes.
Video generation is not supported on SageMaker in 0.18.0 — see Known Limitations below. Use EC2 for video.
"""Deploy a vLLM-Omni TTS model to a SageMaker async inference endpoint.
Async inference avoids the 60-second real-time invoke timeout, which the first
TTS request can exceed due to torch.compile warmup (~67s). The /v1/audio/speech
endpoint returns raw WAV bytes, so the async output written to S3 is the usable
audio file — no polling or extra retrieval step needed.
"""
from sagemaker.async_inference import AsyncInferenceConfig
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
model = Model(
image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm:omni-sagemaker-cuda-v1",
role="arn:aws:iam::<ACCOUNT>:role/SageMakerExecutionRole",
env={"SM_VLLM_MODEL": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"},
predictor_cls=Predictor,
)
predictor = model.deploy(
instance_type="ml.g5.xlarge",
initial_instance_count=1,
endpoint_name="vllm-omni-tts-async",
inference_ami_version="al2-ami-sagemaker-inference-gpu-3-1",
serializer=JSONSerializer(),
async_inference_config=AsyncInferenceConfig(
output_path="s3://<BUCKET>/vllm-omni-async-output/",
max_concurrent_invocations_per_instance=1,
),
wait=True,
)
# Invoke async — upload the JSON input to S3, then call invoke_endpoint_async.
# The resulting .out object in S3 is the raw WAV audio bytes (content-type audio/wav).
# Use CustomAttributes to route /invocations → /v1/audio/speech.
For async inference, upload the JSON input payload to S3 first, then call invoke_endpoint_async with InputLocation=<s3-uri> and
CustomAttributes="route=/v1/audio/speech". The resulting .out object in the configured S3 output path is the raw WAV audio — no polling or
additional retrieval step required.
Known Limitations¶
- Video generation is not supported on SageMaker in 0.18.0. The
/v1/videosendpoint is async by design — it returns a job-ID JSON immediately and generates the MP4 in the background. Through SageMaker async inference, only that job-ID JSON is written to S3; the MP4 itself never lands in S3 and cannot be retrieved throughinvoke_endpointorinvoke_endpoint_async. Use EC2 for video generation — direct container access supports the full workflow (create job, poll status, download MP4). SageMaker support is expected oncePOST /v1/videos/sync(which blocks and returns raw MP4 bytes) is available in a future vllm-omni release. - First-request latency on SageMaker real-time endpoints. TTS models can exceed the 60s invoke timeout on the first request due to
torch.compilewarmup. Use async inference or retry after warmup.
Release Notes¶
See vLLM-Omni Release Notes for version history and changelogs.