Kubernetes Data Plane¶
The Kubernetes Data Plane includes EC2 instances, load balancers, storage, and other APIs used by the Kubernetes Control Plane. For organization purposes we grouped cluster services in a separate page and load balancer scaling can be found in the workloads section. This section will focus on scaling compute resources.
Selecting EC2 instance types is possibly one of the hardest decisions customers face because in clusters with multiple workloads. There is no one-size-fits all solution. Here are some tips to help you avoid common pitfalls with scaling compute.
Automatic node autoscaling¶
Managed node groups will give you the flexibility of Amazon EC2 Auto Scaling groups with added benefits for managed upgrades and configuration. It can be scaled with the Kubernetes Cluster Autoscaler and is a common option for clusters that have a variety of compute needs.
Karpenter is an open source, workload-native node autoscaler created by AWS. It scales nodes in a cluster based on the workload requirements for resources (e.g. GPU) and taints and tolerations (e.g. zone spread) without managing node groups. Nodes are created directly from EC2 which avoids default node group quotas—450 nodes per group—and provides greater instance selection flexibility with less operational overhead. We recommend customers use Karpenter when possible.
Use many different EC2 instance types¶
Each AWS region has a limited number of available instances per instance type. If you create a cluster that uses only one instance type and scale the number of nodes beyond the capacity of the region you will receive an error that no instances are available. To avoid this issue you should not arbitrarily limit the type of instances that can be use in your cluster.
Karpenter will use a broad set of compatible instance types by default and will pick an instance at provisioning time based on pending workload requirements, availability, and cost. You can broaden the list of instance types used in the
karpenter.k8s.aws/instance-category key of the provisioner.
The Kubernetes Cluster Autoscaler requires node groups to be similarly sized so they can be consistently scaled. You should create multiple groups based on CPU and memory size and scale them independently. Use the ec2-instance-selector to identify instances that are similarly sized for your node groups.
Prefer larger nodes to reduce API server load¶
When deciding what instance types to use, fewer, large nodes will put less load on the Kubernetes Control Plane because there will be fewer kubelets and DaemonSets running. However, large nodes may not be utilized fully like smaller nodes. Node sizes should be evaluated based on your workload availability and scale requirements.
A cluster with three u-24tb1.metal instances (24 TB memory and 448 cores) has 3 kublets, and would be limited to 110 pods per node by default. If your pods use 4 cores each then this might be expected (4 cores x 110 = 440 cores/node). With a 3 node cluster your ability to handle an instance incident would be low because 1 instance outage could impact 1/3 of the cluster. You should specify node requirements and pod spread in your workloads so the Kubernetes scheduler can place workloads properly.
Workloads should define the resources they need and the availability required via taints, tolerations, and PodTopologySpread. They should prefer the largest nodes that can be fully utilized and meet availability goals to reduce control plane load, lower operations, and reduce cost.
The Kubernetes Scheduler will automatically try to spread workloads across availablility zones and hosts if resources are available. If no capacity is available the Kubernetes Cluster Autoscaler will attempt to add nodes in each Availability Zone evenly. Karpenter will attempt to add nodes as quickly and cheaply as possible unless the workload specifies other requirements.
To force workloads to spread with the scheduler and new nodes to be created across availability zones you should use topologySpreadConstraints:
spec: topologySpreadConstraints: - maxSkew: 3 topologyKey: "topology.kubernetes.io/zone" whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: dev: my-deployment - maxSkew: 2 topologyKey: "kubernetes.io/hostname" whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: dev: my-deployment
Use similar node sizes for consistent workload performance¶
Workloads should define what size nodes they need to be run on to allow consistent performance and predictable scaling. A workload requesting 500m CPU will perform differently on an instance with 4 cores vs one with 16 cores. Avoid instance types that use burstable CPUs like T series instances.
To make sure your workloads get consistent performance a workload can use the supported Karpenter labels to target specific instances sizes.
Workloads being scheduled in a cluster with the Kubernetes Cluster Autoscaler should match a node selector to node groups based on label matching.
Use compute resources efficiently¶
Compute resources include EC2 instances and availability zones. Using compute resources effectively will increase your scalability, availability, performance, and reduce your total cost. Efficient resource usage is extremely difficult to predict in an autoscaling environment with multiple applications. Karpenter was created to provision instances on-demand based on the workload needs to maximize utilization and flexibility.
Karpenter allows workloads to declare the type of compute resources it needs without first creating node groups or configuring label taints for specific nodes. See the Karpenter best practices for more information. Consider enabling consolidation in your Karpenter provisioner to replace nodes that are under utilized.
Automate Amazon Machine Image (AMI) updates¶
Keeping worker node components up to date will make sure you have the latest security patches and compatible features with the Kubernetes API. Updating the kublet is the most important component for Kubernetes functionality, but automating OS, kernel, and locally installed application patches will reduce maintenance as you scale.
It is recommended that you use the latest Amazon EKS optimized Amazon Linux 2 or Amazon EKS optimized Bottlerocket AMI for your node image. Karpenter will automatically use the latest available AMI to provision new nodes in the cluster. Managed node groups will update the AMI during a node group update but will not update the AMI ID at node provisioning time.
For Managed Node Groups you need to update the Auto Scaling Group (ASG) launch template with new AMI IDs when they are available for patch releases. AMI minor versions (e.g. 1.23.5 to 1.24.3) will be available in the EKS console and API as upgrades for the node group. Patch release versions (e.g. 1.23.5 to 1.23.6) will not be presented as upgrades for the node groups. If you want to keep your node group up to date with AMI patch releases you need to create new launch template version and let the node group replace instances with the new AMI release.
You can find the latest available AMI from this page or use the AWS CLI.
Use multiple EBS volumes for containers¶
EBS volumes have input/output (I/O) quota based on the type of volume (e.g. gp3) and the size of the disk. If your applications share a single EBS root volume with the host this can exhaust the disk quota for the entire host and cause other applications to wait for available capacity. Applications write to disk if they write files to their overlay partition, mount a local volume from the host, and also when they log to standard out (STDOUT) depending on the logging agent used.
To avoid disk I/O exhaustion you should mount a second volume to the container state folder (e.g. /run/containerd), use separate EBS volumes for workload storage, and disable unnecessary local logging.
To mount a second volume to your EC2 instances using eksctl you can use a node group with this configuration:
managedNodeGroups: - name: al2-workers amiFamily: AmazonLinux2 desiredCapacity: 2 volumeSize: 80 additionalVolumes: - volumeName: '/dev/sdz' volumeSize: 100 preBootstrapCommands: - "systemctl stop containerd" - "mkfs -t ext4 /dev/nvme1n1" - "rm -rf /var/lib/containerd/*" - "mount /dev/nvme1n1 /var/lib/containerd/" - "systemctl start containerd"
If you are using terraform to provision your node groups please see examples in EKS Blueprints for terraform. If you are using Karpenter to provision nodes you can use
blockDeviceMappings with node user-data to add additional volumes.
To mount an EBS volume directly to your pod you should use the AWS EBS CSI driver and consume a volume with a storage class.
--- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: ebs-sc provisioner: ebs.csi.aws.com volumeBindingMode: WaitForFirstConsumer --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ebs-claim spec: accessModes: - ReadWriteOnce storageClassName: ebs-sc resources: requests: storage: 4Gi --- apiVersion: v1 kind: Pod metadata: name: app spec: containers: - name: app image: public.ecr.aws/docker/library/nginx volumeMounts: - name: persistent-storage mountPath: /data volumes: - name: persistent-storage persistentVolumeClaim: claimName: ebs-claim
Avoid instances with low EBS attach limits if workloads use EBS volumes¶
EBS is one of the easiest ways for workloads to have persistent storage, but it also comes with scalability limitations. Each instance type has a maximum number of EBS volumes that can be attached. Workloads need to declare what instance types they should run on and limit the number of replicas on a single instance with Kubernetes taints.
Disable unnecessary logging to disk¶
Avoid unnecessary local logging by not running your applications with debug logging in production and disabling logging that reads and writes to disk frequently. Journald is the local logging service that keeps a log buffer in memory and flushes to disk periodically. Journald is preferred over syslog which logs every line immediately to disk. Disabling syslog also lowers the total amount of storage you need and avoids needing complicated log rotation rules. To disable syslog you can add the following snippet to your cloud-init configuration:
Patch instances in place when OS update speed is a necessity¶
Patching instances in place should only be done when required. Amazon recommends treating infrastructure as immutable and thoroughly testing updates that are promoted through lower environments the same way applications are. This section applies when that is not possible.
It takes seconds to install a package on an existing Linux host without disrupting containerized workloads. The package can be installed and validated without cordoning, draining, or replacing the instance.
To replace an instance you first need to create, validate, and distribute new AMIs. The instance needs to have a replacement created, and the old instance needs to be cordoned and drained. Then workloads need to be created on the new instance, verified, and repeated for all instances that need to be patched. It takes hours, days, or weeks to replace instances safely without disrupting workloads.
Amazon recommends using immutable infrastructure that is built, tested, and promoted from an automated, declarative system, but if you have a requirement to patch systems quickly then you will need to patch systems in place and replace them as new AMIs are made available. Because of the large time differential between patching and replacing systems we recommend using AWS Systems Manager Patch Manager to automate patching nodes when required to do so.
Patching nodes will allow you to quickly roll out security updates and replace the instances on a regular schedule after your AMI has been updated. If you are using an operating system with a read-only root file system like Flatcar Container Linux or Bottlerocket OS we recommend using the update operators that work with those operating systems. The Flatcar Linux update operator and Bottlerocket update operator will reboot instances to keep nodes up to date automatically.