EKS Deployment¶
The vLLM container works directly with Kubernetes manifests on Amazon EKS.
Deployment Example¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: public.ecr.aws/deep-learning-containers/vllm:server-cuda
args:
- "--model"
- "openai/gpt-oss-20b"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "1"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
Key Requirements¶
- Request GPU resources via
resources.limits.nvidia.com/gpu - Pass
--host 0.0.0.0so the server binds to all interfaces - Use
/healthon port 8000 for liveness and readiness probes - Set
initialDelaySecondshigh enough for model loading (120s+ for large models) - For gated models, provide
HF_TOKENvia a Kubernetes Secret:
Multi-GPU¶
For tensor parallelism across multiple GPUs on a single node:
Add --tensor-parallel-size 4 to the container args and ensure the node has 4+ GPUs available.
Model-Specific Tuning¶
For recommended serving flags, hardware configurations, and quantization options per model, see recipes.vllm.ai.