Skip to content

Supported Models

All models listed below are regression-tested on every DLC vLLM-Omni release and work with the images listed on the Overview page.

The Coverage column indicates test depth: Smoke runs on every PR; Benchmark runs throughput and latency tests with pass/fail thresholds before release. A Smoke + Benchmark tag means both apply.

Tested Models

Modality Model Coverage
TTS (preset voice) Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice Smoke + Benchmark
TTS (voice clone) Qwen/Qwen3-TTS-12Hz-1.7B-Base Smoke + Benchmark
FunAudioLLM/CosyVoice3-0.5B Smoke + Benchmark
Image generation black-forest-labs/FLUX.2-klein-4B Smoke + Benchmark
baidu/ERNIE-Image-Turbo Smoke + Benchmark
Video — text-to-video Wan-AI/Wan2.1-T2V-1.3B-Diffusers Smoke + Benchmark
Video — unified create/edit Wan-AI/Wan2.1-VACE-1.3B-Diffusers Smoke + Benchmark
Audio generation stabilityai/stable-audio-open-1.0 Smoke + Benchmark
Omni chat Qwen/Qwen2.5-Omni-3B Benchmark

The Wan2.1-VACE model accepts text plus optional video, mask, or reference image inputs for unified video creation and editing — distinct from the text-only Wan2.1-T2V pipeline.

Model Compatibility

Any model supported by upstream vLLM-Omni should work. Requirements:

  • Models must have a standard HuggingFace config.json with a recognized model_type, or be diffusers pipeline models with model_index.json.
  • Multi-stage omni models (thinker + talker + decoder) like Qwen2.5-Omni need significantly more VRAM than the model size suggests. Refer to individual model cards for minimum GPU requirements.