Karpenter Best Practices¶

Karpenter is an open-source node provisioning tool for Kubernetes. It can help improve the efficiency and cost of running large-scale data workloads. Karpenter's ability to dynamically provision optimal resources for heterogeneous workloads, efficiently manage large numbers of clusters, optimize cost through consolidation, and prioritize EC2 Spot instances make it the preferred auto-scaler for workloads on Amazon Elastic Kubernetes Service (EKS).

We recommend using Karpenter and Bottlerocket with EKS for data workloads, as it aligns with EC2 Spot best practices of diversification, using allocation strategy, and can manage the Spot lifecycle seamlessly. As documented in the Spark Operator with YuniKorn DoEKS blueprint, Karpenter integrates seamlessly with Kubernetes, providing automatic, real-time adjustments to the cluster size based on observed workloads and scaling events. This enables a more efficient and cost-effective EKS cluster design that adapts to the ever-changing demands of Spark applications and other data workloads.

More details can be found in these blogs [1, 2].

Cost optimization with Karpenter, Spot and Graviton¶

To achieve high cost optimizations with Spark, we recommend using the following Karpenter nodepool configurations.

- key:karpenter.sh/capacity-type
  operator: In
  values:
    - on-demand
    - spot- key:kubernetes.io/arch
  operator: In
  values:
    - amd64
    - arm64

For pod specs:

nodeAffinity:
 preferedDuringSchedulingIgnoredDuringExecution:
  - weight: 1
    preference:
     matchExpressions:
        - key:beta.kubernetes.io/arch
          operator: In
          values:
            - arm64

By applying this configuration, Karpenter should select the instance type in the following order, and fallback to another instance type immediately if the instance type is not available:

arm spot
x86 spot
arm on-demand
x86 on-demand

Configure proper consolidation policy¶

For Spark workloads, configure the executor nodepool:

Enable Pod bin packing for batch jobs
Configure Karpenter's Consolidation to "WhenEmptyOrUnderutilized"
Increase "consolidateAfter" if needed
Set expiry time to longest-running EMR on EKS job duration (Example: Set to 4 hours if that's your longest job run time over the entire workload)

Capacity management and prioritize instance types based your workloads¶

Capacity Management

AWS brings a large number of instance types, but there are some circumstances that some instance types are not available due to EC2 capacity, We suggest configure instance type as many as you can especially for Spot instances. Allowing Karpenter to provision nodes from a large, diverse set of instance types will help you to stay on Spot longer and lower your costs due to Spot’s discounted pricing. Moreover, if Spot capacity becomes constrained, this instance type diversity will also increase the chances that you’ll be able to continue to launch On-Demand capacity for your workloads.

Multi-arch support is also recommended for capacity management, it will increase the instance diversity along with the performance improvement by Graviton.

Prioritize instance types based your workloads

Workloads have different requirements, for example, Spark workloads need high memory ratios. You can create weighted nodepools to prioritize some specific instance types like the following, Karpenter will first try using nodepool-high-weight.

Nodepool with weight to 50

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: nodepool-high-weight
spec:
  template:
    metadata:
      labels:
        billing-team: my-team
      annotations:
        example.com/owner: "my-team"
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws  # Updated since only a single version will be served
        kind: EC2NodeClass
        name: default
      taints:
        - key: example.com/special-taint
          effect: NoSchedule

      expireAfter: Never
      requirements:
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["c", "m", "r"]
          # minValues here enforces the scheduler to consider at least that number of unique instance-category to schedule the pods.
          # This field is ALPHA and can be dropped or replaced at any time
          minValues: 2
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["r7g","r6g"]
        - key: "karpenter.k8s.aws/instance-cpu"
          operator: In
          values: ["4", "8", "16", "32"]
        - key: "karpenter.k8s.aws/instance-hypervisor"
          operator: In
          values: ["nitro"]
        - key: "karpenter.k8s.aws/instance-generation"
          operator: Gt
          values: ["2"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-west-2a", "us-west-2b"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["arm64", "amd64"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand", "reserved"]
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 1m | Never # Added to allow additional control over consolidation aggressiveness
    budgets:
    - nodes: 10%
    - schedule: "0 9 * * mon-fri"
      duration: 8h
      nodes: "0"
  limits:
    cpu: "1000"
    memory: 1000Gi
  weight: 50

Nodepool with weight to 10

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: nodepool-high-weight
spec:
  template:
    metadata:
      labels:
        billing-team: my-team
      annotations:
        example.com/owner: "my-team"
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws  # Updated since only a single version will be served
        kind: EC2NodeClass
        name: default
      taints:
        - key: example.com/special-taint
          effect: NoSchedule

      expireAfter: Never
      requirements:
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["c", "m", "r"]
          # minValues here enforces the scheduler to consider at least that number of unique instance-category to schedule the pods.
          # This field is ALPHA and can be dropped or replaced at any time
          minValues: 2
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["c5","c6i"]
        - key: "karpenter.k8s.aws/instance-cpu"
          operator: In
          values: ["4", "8", "16", "32"]
        - key: "karpenter.k8s.aws/instance-hypervisor"
          operator: In
          values: ["nitro"]
        - key: "karpenter.k8s.aws/instance-generation"
          operator: Gt
          values: ["2"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-west-2a", "us-west-2b"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["arm64", "amd64"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand", "reserved"]
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 1m | Never # Added to allow additional control over consolidation aggressiveness
    budgets:
    - nodes: 10%
    - schedule: "0 9 * * mon-fri"
      duration: 8h
      nodes: "0"
  limits:
    cpu: "1000"
    memory: 1000Gi
  weight: 10

Carefully configure resource requests and limits for workloads¶

Rightsizing and optimizing your cluster is a shared responsibility. Karpenter effectively optimizes and scales infrastructure, but the end result depends on how well you have rightsized your pod requests and any other Kubernetes scheduling constraints. Karpenter does not consider limits or resource utilization. For most workloads with non-compressible resources, such as memory, it is generally recommended to set requests==limits because if a workload tries to burst beyond the available memory of the host, an out-of-memory (OOM) error occurs. Karpenter consolidation can increase the probability of this as it proactively tries to reduce total allocatable resources for a Kubernetes cluster. For help with rightsizing your Kubernetes pods, consider exploring Kubecost, Vertical Pod Autoscaler configured in recommendation mode, or an open source tool such as Goldilocks.

For each instance type, Karpenter reports max allocatable resources with some assumptions and after the instance overhead has been subtracted, for example, the m6g.8xlarge max allocatable resources(defaults) are as follows:

Resource	Quantity
cpu	31850m
ephemeral-storage	17Gi
memory	118253Mi
pods	234
vpc.amazonaws.com/pod-eni	54

For some circumstances, you may want the Karpenter to provide more resources than the defaults or the rest of the resources can not host one pod, you can adjust the VM_MEMORY_OVERHEAD_PERCENT to 0.07. The guidance is by adjusting VM_MEMORY_OVERHEAD_PERCENT, one more pod can be scheduled on the Karpenter node.

Pressure testing with Karpenter¶

If using a new account with Karpenter or a account that has not too much EC2 scaling especially for Spark workloads, we recommend do some pressure testing with Karpenter to know your account EC2 api throttling before go production based on your EC2 scale size because of Karpenter is extremely fast to call the following EC2 api.

The following api calls are mainly called by Karpenter

ModifyNetworkingInterfaceAttribute
CreateTags
AssignPrivateIpAddress
DescribeNetworkInterfaces
DescribeIamInstanceProfileAssociations
DescribeInstances
CreateFleet
DescribeTags
DeleteSecurityGroup
UnassignPrivateIpAddress
DeleteLaunchTemplate

Avoid using pod anti-affinity for large scale with Karpenter¶

Karpenter will have more time to do the simulation of scheduling pods if there is pod anti-affinity. Starting with Karpenter v0.32.6, there are performance improvements around using hostname topologies. But we recommend do not use pod anti-affinity for large scale.