EKS Deployment¶
The SGLang container works directly with Kubernetes manifests on Amazon EKS. It serves the OpenAI-compatible API on port 30000 — the same as
EC2 — so any sglang.launch_server flag may be passed via the container args.
Deployment Example¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: sglang-server
spec:
replicas: 1
selector:
matchLabels:
app: sglang-server
template:
metadata:
labels:
app: sglang-server
spec:
containers:
- name: sglang
image: public.ecr.aws/deep-learning-containers/sglang:server-cuda
args:
- "--model-path"
- "openai/gpt-oss-20b"
- "--host"
- "0.0.0.0"
- "--port"
- "30000"
ports:
- containerPort: 30000
resources:
limits:
nvidia.com/gpu: "1"
livenessProbe:
httpGet:
path: /health
port: 30000
initialDelaySeconds: 120
readinessProbe:
httpGet:
path: /health
port: 30000
initialDelaySeconds: 120
Key Requirements¶
- Request GPU resources via
resources.limits.nvidia.com/gpu - Pass
--host 0.0.0.0so the server binds to all interfaces - Use
/healthon port 30000 for liveness and readiness probes - Set
initialDelaySecondshigh enough for model loading (120s+ for large models) - For gated models, provide
HF_TOKENvia a Kubernetes Secret:
Multi-GPU¶
For tensor parallelism across multiple GPUs on a single node:
Add --tp 4 to the container args and ensure the node has 4+ GPUs available.
Model-Specific Tuning¶
For recommended serving flags, hardware configurations, and quantization options per model, see the SGLang hyperparameter tuning guide and Supported Models.