Supported Models¶
All models listed below are regression-tested on every DLC vLLM-Omni release and work with the images listed on the Overview page.
The Coverage column indicates test depth: Smoke runs on every PR; Benchmark runs throughput and latency tests with pass/fail thresholds before release. A Smoke + Benchmark tag means both apply.
Tested Models¶
| Modality | Model | Coverage |
|---|---|---|
| TTS (preset voice) | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice | Smoke + Benchmark |
| TTS (voice clone) | Qwen/Qwen3-TTS-12Hz-1.7B-Base | Smoke + Benchmark |
| FunAudioLLM/CosyVoice3-0.5B | Smoke + Benchmark | |
| Image generation | black-forest-labs/FLUX.2-klein-4B | Smoke + Benchmark |
| baidu/ERNIE-Image-Turbo | Smoke + Benchmark | |
| Video — text-to-video | Wan-AI/Wan2.1-T2V-1.3B-Diffusers | Smoke + Benchmark |
| Video — unified create/edit | Wan-AI/Wan2.1-VACE-1.3B-Diffusers | Smoke + Benchmark |
| Audio generation | stabilityai/stable-audio-open-1.0 | Smoke + Benchmark |
| Omni chat | Qwen/Qwen2.5-Omni-3B | Benchmark |
The Wan2.1-VACE model accepts text plus optional video, mask, or reference image inputs for unified video creation and editing — distinct from the text-only Wan2.1-T2V pipeline.
Model Compatibility¶
Any model supported by upstream vLLM-Omni should work. Requirements:
- Models must have a standard HuggingFace
config.jsonwith a recognizedmodel_type, or be diffusers pipeline models withmodel_index.json. - Multi-stage omni models (thinker + talker + decoder) like Qwen2.5-Omni need significantly more VRAM than the model size suggests. Refer to individual model cards for minimum GPU requirements.