Skip to content

Changelog

Changelog for the Amazon Linux 2023-based vLLM images (server-cuda, server-sagemaker-cuda).


v2.0.0 — 2026-06-05

Tags: server-cuda-v2.0 · server-sagemaker-cuda-v2.0

vLLM source: 6aabe22 (0.22.1rc0+amzn2023.6aabe221)

Bundled versions: CUDA 13.0.2 · Python 3.12 · FlashInfer 0.6.11.post2 · DeepEP 73b6ea4

Highlights

  • vLLM 0.22.1rc0 — major version bump from 0.20.0.dev361 (v1.4)
  • CUDA 13.0.2 — upgraded from 12.9.1; requires NVIDIA driver 580+
  • FlashInfer 0.6.11.post2 — upgraded from 0.6.8.post1; precompiled cubins now bundled
  • EC2 entrypoint simplified — uses vllm serve CLI instead of python3 -m vllm.entrypoints.openai.api_server
  • nixl-cu13 fix — KV connector NIXL now correctly linked against CUDA 13
  • transformers pinned to <5.10 — avoids AttributeError on Voxtral with mistral-common 1.11.2

New Model Support

  • Qwen3-Embedding-0.6B and Qwen3-VL-Embedding-2B (embedding)
  • Qwen3-Reranker-4B (reranking)
  • IBM Granite-Speech-4.1-2B (ASR)
  • Gemma 4 family: 26B-A4B-it, 31B-it, E4B-it, E2B-it
  • Qwen3.5 (0.8B, 2B) and Qwen3.6 (27B, 35B-A3B)

Security

  • CVE-2025-33219: explicit cuda-compat-13-0 upgrade in EC2 and SageMaker stages
  • model-hosting-container-standards bumped to ≥0.1.15

v1.4.0 — 2026-05-22

Tags: server-cuda-v1.4 · server-sagemaker-cuda-v1.4

vLLM source: 3f5bd48 (0.20.0.dev361+amzn2023.3f5bd482)

Bundled versions: CUDA 12.9.1 · Python 3.12 · FlashInfer 0.6.8.post1 · DeepEP 73b6ea4

Highlights

  • SageMaker route middleware (#6096) — /invocations requests can now be routed to any vLLM endpoint via X-Amzn-SageMaker-Custom-Attributes: route=<path>
  • libsndfile added to the SageMaker image for audio I/O in network-isolated deployments

New Model Support

  • Voxtral-Mini-4B-Realtime-2602 — Mistral audio transcription via /v1/audio/transcriptions

v1.3.0 — 2026-05-12

Tags: server-cuda-v1.3 · server-sagemaker-cuda-v1.3

vLLM source: 3f5bd48 (0.20.0.dev361+amzn2023.3f5bd482)

Bundled versions: CUDA 12.9.1 · Python 3.12 · FlashInfer 0.6.8.post1 · DeepEP commit 73b6ea4

Highlights

  • SageMaker standard-supervisor integration — process auto-recovery on crash, dynamic dependency installation from requirements.txt in model artifacts, and custom handler support via model.py
  • Gemma 4 fixes — pipeline parallelism, MoE weight loading, CUDA graph capture, multimodal memory, tool calling stability, MTP speculative decoding support
  • DeepSeek V4 fixes — numerical correctness for topk, tool calling for non-streaming, disaggregated P/D serving, performance optimizations

SageMaker Features (new)

  • Process supervision with auto-recovery (configurable via PROCESS_AUTO_RECOVERY)
  • Dynamic requirements.txt installation before server startup
  • Custom /ping and /invocations handler support via model.py in model artifacts
  • LoRA adapter routing via request headers

Model Fixes

  • Gemma 4: fix PP, MoE expert weight remapping, activation mismatch, infinite loop in tool parser, chat template sync
  • Gemma 4: add MTP speculative decoding support
  • DeepSeek V4: fix topk numerical issue, repeated RoPE cache initialization, disaggregated serving
  • DeepSeek V4: integrate tile kernel head_compute_mix_kernel for improved performance

v1.2.0 — 2026-04-30

Tags: server-cuda-v1.2 · server-sagemaker-cuda-v1.2

vLLM source: 8a8c9b5 (0.20.0.dev60+amzn2023.8a8c9b56)

Bundled versions: CUDA 12.9.1 · Python 3.12 · FlashInfer 0.6.8.post1 · DeepEP commit 73b6ea4

Highlights

  • DeepSeek V4 support — full model support including Pro and Flash variants, multi-stream pre-attention GEMM, MLA + group FP8 fusion
  • Qwen3.5 / Qwen3.6 / Qwen3-Coder fixes — LoRA for MoE, double gate call fix, tool calling fix
  • Gemma 4 fixes — multimodal embedder norm order, bidirectional vision attention for sliding layers
  • Removed vLLM RayServe setup (deprecated)
  • Fixed telemetry script version matching with proper PEP 440 compatibility

New Model Support

  • DeepSeek V4 Pro and Flash
  • DeepSeek V4 base model

Model Fixes

  • DeepSeek V4: token leakage fix, inductor error fix, KV block release for skipped P-ranks with MLA
  • Qwen3.5: LoRA support for MoE, double gate call fix
  • Qwen3: tool calling fix for <tool_call> as implicit reasoning end
  • Gemma 4: multimodal embedder norm order fix, bidirectional vision attention

v1.1.0 — 2026-04-28

Tags: server-cuda-v1.1 · server-sagemaker-cuda-v1.1

vLLM source: 6ef1efd5 (0.19.1+amzn2023.6ef1efd5)

Bundled versions: CUDA 12.9.1 · Python 3.12 · FlashInfer 0.6.7 · LMCache 0.4.5.dev0 (custom build)

Highlights

  • LMCache bidirectional NIXL cache probe — enables disaggregated prefill/decode (P/D) deployments with bidirectional cache querying between prefill and decode workers

Changes

  • Override LMCache with source build from commit 7f60057 for bidirectional NIXL feature
  • LMCache version: 0.4.5.dev0+amzn2023.7f60057c

v1.0.0 — 2026-04-25

Tags: server-cuda-v1.0 · server-sagemaker-cuda-v1.0

vLLM source: 6ef1efd5 (0.19.1+amzn2023.6ef1efd5)

Bundled versions: CUDA 12.9.1 · Python 3.12 · FlashInfer 0.6.7

Highlights

  • Initial release of vLLM Server containers on Amazon Linux 2023
  • Simplified tag format: server-cuda[-vMAJOR[.MINOR[.PATCH]]]
  • OpenAI-compatible API server on port 8000
  • Multi-GPU inference via tensor parallelism with NCCL
  • EFA support for multi-node deployments