EKS Cluster Auto-Scaler¶
Kubernetes provisions nodes using CAS (Cluster Autoscaler). AWS EKS has its own implementation of K8 CAS, and EKS uses Managed-Nodegroups to spuns of Nodes.
Logs of EKS Cluster Auto-scaler.¶
On AWS, Cluster Autoscaler utilizes Amazon EC2 Auto Scaling Groups to provision nodes. This section will help you identify the error message when a AutoScaler fails to provision nodes.
An example scenario, where the NodeGroup would fail due to non-supported nodes in certain AZs.
Could not launch On-Demand Instances. Unsupported - Your requested instance type (g4dn.xlarge) is not supported in your requested Availability Zone (ca-central-1d). Please retry your request by not specifying an Availability Zone or choosing ca-central-1a, ca-central-1b. Launching EC2 instance failed.
The steps to find the logs for AutoScalingGroups are,
Step 1: Login to AWS Console, and select Elastic Kubernetes Service
Step 2: Select Compute
tab, and select the NodeGroup
that fails.
Step 3: Select the Autoscaling group name
from the NodeGroup's section, which will direct you to EC2 --> AutoScaling Group
page.
Step 4: Click the Tab Activity
of the AutoScaling Group
, and the Activity History
would give provide the details of the error.
- Status
- Description
- Cause
- Start Time
- End Time
Alternatively, the activities/logs can be found via CLI as well
aws autoscaling describe-scaling-activities \
--region <region> \
--auto-scaling-group-name <NodeGroup-AutoScaling-Group>
In the above error scenario, the ca-central-1d
availability zone doesn't support g4dn.xlarge
. The solution is
Step 1: Identify the Subnets of the Availability zones that supports the GPU node type. The NodeGroup Section would list all the subnets, and you can click each subnet to see which AZ it is deployed to.
Step 2: Create a NodeGroup only in the Subnets identified in the above step
aws eks create-nodegroup \
--region <region> \
--cluster-name <cluster-name> \
--nodegroup-name <nodegroup-name> \
--scaling-config minSize=10,maxSize=10,desiredSize=10 \
--ami-type AL2_x86_64_GPU \
--node-role <NodeGroupRole> \
--subnets <subnet-1-that-supports-gpu> <subnet-2-that-supports-gpu> \
--instance-types g4dn.xlarge \
--disk-size <disk size>