Changelog¶

Changelog for the Amazon Linux 2023-based vLLM-Omni images (omni-cuda, omni-sagemaker-cuda).

v1.4.0 — 2026-07-02¶

Tags: omni-cuda-v1.4 · omni-sagemaker-cuda-v1.4

vLLM-Omni source: v0.21.0rc1 (unchanged from v1.3)

DLC PR: #6298

SageMaker /v1/videos and /v1/videos/sync now accept application/json again. The routing middleware restores JSON→multipart conversion for the form-data video routes, so clients can send a plain JSON body instead of hand-building multipart. The existing multipart/form-data path is unchanged (byte-for-byte passthrough), so callers already sending multipart need no changes.

No framework bump — still tracks vLLM-Omni 0.21.0rc1 (upstream vLLM v0.21.0). This is a DLC-minor release (v1.3 → v1.4) scoped to the SageMaker video-route change above.

Tags: omni-cuda-v1.3 · omni-sagemaker-cuda-v1.3

vLLM-Omni source: v0.21.0rc1 (pre-release, tracking upstream vLLM v0.21.0)

DLC PR: #6110

Upgraded to vLLM-Omni 0.21.0rc1, aligned with upstream vLLM v0.21.0
Cherry-picked upstream Dockerfile fixes for cublas headers (JIT), flashinfer cubin layering, and the nixl-cu13 install ordering for matching nixl_ep_cpp.so

Voice-clone TTS (Qwen3-TTS-Base) throughput restored — the upstream Code2Wav decode-chunk un-batching regression flagged in v1.1 is resolved in vllm-omni 0.21.0rc1.

Transformers pinned to <5.9.0. Transformers 5.9.0 removed the deprecated input_embeds alias and the cache_position kwarg from create_causal_mask / create_sliding_window_causal_mask, which breaks Qwen3-TTS decode in vllm-omni 0.21.0rc1. Upstream fix: vllm-project/vllm-omni#3786. Pin will be dropped once a vllm-omni release containing it ships.

Tags: omni-cuda-v1.2 · omni-sagemaker-cuda-v1.2

vLLM-Omni source: v0.20.0 (unchanged from v1.1)

DLC PR: #6101

SageMaker /v1/videos and /v1/videos/sync now require multipart/form-data directly. The routing middleware no longer auto-converts JSON request bodies to multipart. Clients must build the multipart body locally and pass ContentType="multipart/form-data; boundary=..." to InvokeEndpoint; SageMaker forwards the body and ContentType through to the model server unchanged.
See examples/vllm-omni/sagemaker/deploy_video_sync.py for the updated invocation pattern.

Clients that previously sent JSON to /v1/videos* via SageMaker CustomAttributes routing must switch to a pre-built multipart body. JSON requests to these routes will now reach the model server unconverted and fail.

Tags: omni-cuda-v1.1 · omni-sagemaker-cuda-v1.1

vLLM-Omni source: v0.20.0

Upgraded to vLLM-Omni 0.20.0, aligned with upstream vLLM v0.20.0
CUDA bumped from 12.9 to 13.0
New /v1/audio/generate endpoint for diffusion-based audio generation
New /v1/videos/sync endpoint — blocks until complete and returns raw MP4, enabling video generation on SageMaker

Added numactl for fastsafetensors compatibility with CUDA 13
Added VLLM_ENABLE_CUDA_COMPATIBILITY=0 env (set to 1 for hosts with older NVIDIA drivers)
Removed sox system dependency (no longer needed by vllm-omni)
Expanded smoke-test matrix from 6 to 9 models with performance benchmarks

Voice-clone TTS (Qwen3-TTS-Base) throughput regression vs v1.0 due to upstream Code2Wav un-batching. Fix merged upstream, pending next release.

Tags: omni-cuda-v1.0 · omni-sagemaker-cuda-v1.0

vLLM-Omni source: v0.18.0

Initial release of vLLM-Omni containers on Amazon Linux 2023
Serves TTS, image generation, video generation, and multimodal chat through OpenAI-compatible APIs
SageMaker routing middleware for dispatching /invocations to any omni endpoint via CustomAttributes
Built on CUDA 12.9 with Python 3.12