Changelog¶
Changelog for the Amazon Linux 2023-based vLLM images (server-cuda, server-sagemaker-cuda).
v1.3.0 — 2026-05-12¶
Tags: server-cuda-v1.3 · server-sagemaker-cuda-v1.3
vLLM source: 3f5bd48 (0.20.0.dev361+amzn2023.3f5bd482)
Bundled versions: CUDA 12.9.1 · Python 3.12 · FlashInfer 0.6.8.post1 · DeepEP commit 73b6ea4
Highlights¶
- SageMaker standard-supervisor integration — process auto-recovery on crash, dynamic dependency installation from
requirements.txtin model artifacts, and custom handler support viamodel.py - Gemma 4 fixes — pipeline parallelism, MoE weight loading, CUDA graph capture, multimodal memory, tool calling stability, MTP speculative decoding support
- DeepSeek V4 fixes — numerical correctness for topk, tool calling for non-streaming, disaggregated P/D serving, performance optimizations
SageMaker Features (new)¶
- Process supervision with auto-recovery (configurable via
PROCESS_AUTO_RECOVERY) - Dynamic
requirements.txtinstallation before server startup - Custom
/pingand/invocationshandler support viamodel.pyin model artifacts - LoRA adapter routing via request headers
Model Fixes¶
- Gemma 4: fix PP, MoE expert weight remapping, activation mismatch, infinite loop in tool parser, chat template sync
- Gemma 4: add MTP speculative decoding support
- DeepSeek V4: fix topk numerical issue, repeated RoPE cache initialization, disaggregated serving
- DeepSeek V4: integrate tile kernel head_compute_mix_kernel for improved performance
v1.2.0 — 2026-04-30¶
Tags: server-cuda-v1.2 · server-sagemaker-cuda-v1.2
vLLM source: 8a8c9b5 (0.20.0.dev60+amzn2023.8a8c9b56)
Bundled versions: CUDA 12.9.1 · Python 3.12 · FlashInfer 0.6.8.post1 · DeepEP commit 73b6ea4
Highlights¶
- DeepSeek V4 support — full model support including Pro and Flash variants, multi-stream pre-attention GEMM, MLA + group FP8 fusion
- Qwen3.5 / Qwen3.6 / Qwen3-Coder fixes — LoRA for MoE, double gate call fix, tool calling fix
- Gemma 4 fixes — multimodal embedder norm order, bidirectional vision attention for sliding layers
- Removed vLLM RayServe setup (deprecated)
- Fixed telemetry script version matching with proper PEP 440 compatibility
New Model Support¶
- DeepSeek V4 Pro and Flash
- DeepSeek V4 base model
Model Fixes¶
- DeepSeek V4: token leakage fix, inductor error fix, KV block release for skipped P-ranks with MLA
- Qwen3.5: LoRA support for MoE, double gate call fix
- Qwen3: tool calling fix for
<tool_call>as implicit reasoning end - Gemma 4: multimodal embedder norm order fix, bidirectional vision attention
v1.1.0 — 2026-04-28¶
Tags: server-cuda-v1.1 · server-sagemaker-cuda-v1.1
vLLM source: 6ef1efd5 (0.19.1+amzn2023.6ef1efd5)
Bundled versions: CUDA 12.9.1 · Python 3.12 · FlashInfer 0.6.7 · LMCache 0.4.5.dev0 (custom build)
Highlights¶
- LMCache bidirectional NIXL cache probe — enables disaggregated prefill/decode (P/D) deployments with bidirectional cache querying between prefill and decode workers
Changes¶
- Override LMCache with source build from commit 7f60057 for bidirectional NIXL feature
- LMCache version: 0.4.5.dev0+amzn2023.7f60057c
v1.0.0 — 2026-04-25¶
Tags: server-cuda-v1.0 · server-sagemaker-cuda-v1.0
vLLM source: 6ef1efd5 (0.19.1+amzn2023.6ef1efd5)
Bundled versions: CUDA 12.9.1 · Python 3.12 · FlashInfer 0.6.7
Highlights¶
- Initial release of vLLM Server containers on Amazon Linux 2023
- Simplified tag format:
server-cuda[-vMAJOR[.MINOR[.PATCH]]] - OpenAI-compatible API server on port 8000
- Multi-GPU inference via tensor parallelism with NCCL
- EFA support for multi-node deployments