Skip to content

Custom Scheduler for Binpacking

Binpacking directly influences the cost performance of your EKS cluster. Both Cluster Autoscaler and Karpenter, as node provisioners, can efficiently scales nodes in and out, but without a binpacking custom scheduler, you might end up with a scenario where nodes are underutilized due to inefficient pod distribution at the pod scheduling time.

  • Better Binpacking: When pods are tightly packed onto fewer nodes, node provisioner (Cluster Autoscaler or Karpenter) can more effectively scale down unused nodes, reducing costs.
  • Poor Binpacking: If pods are spread out inefficiently across many nodes, node provisioner might struggle to find opportunities to scale down, leading to resource waste.

Demonstration

Poor Binpacking with the default kube-scheduler:

Better Binpacking with a Custom Scheduler:

Before we tackle the binpacking issue in Amazon EMR on EKS, let’s install a node viewer tool - eks-node-viewer. It provides enhanced visibility into the EKS nodes utilization, helping with real-time monitoring, debugging, and optimization efforts. In this case, we use it to visualize dynamic node and pod allocation within an EKS cluster for tracking the binpacking performance.

Install eks-node-viewer

eks-node-viewer:https://github.com/awslabs/eks-node-viewer

HomeBrew

brew tap aws/tap
brew install eks-node-viewer

Manual

go install github.com/awslabs/eks-node-viewer/cmd/eks-node-viewer@latest

Quick Start:

eks-node-viewer --resources cpu,memory

Install Bin-packing custom scheduler

In the scheduling-plugin NodeResourcesFit of kube-scheduler, there are two scoring strategies that support the bin packing of resources: MostAllocated and RequestedToCapacityRatio. We created a custom scheduler based on the MostAllocated strategy. See the K8s’s Resource Bin Packing documentation for more details.

The following customer scheduler named “my-scheduler” is created for EKS version v1.28, as the “kube-scheduler” container image version is required to match with EKS version. The KubeSchedulerConfiguration API version is stable(v1) in Kubernetes 1.25, for those cluster prior to 1.25, should use kubescheduler.config.k8s.io/v1beta2. Please adjust them accordingly if your EKS version is different.

We do not recommend build the kube-scheduler by yourself, you can leverage the eks-distro kube-scheduler image. For example:

  • Amazon EKS 1.28 image: public.ecr.aws/eks-distro/kubernetes/kube-scheduler:v1.28.11-eks-1-28-latest
  • Amazon EKS 1.29 image: public.ecr.aws/eks-distro/kubernetes/kube-scheduler:v1.29.6-eks-1-29-18

Run the following command against EKS v1.28:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-scheduler
  namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: my-scheduler
rules:
- apiGroups:
  - ""
  - events.k8s.io
  resources:
  - events
  verbs:
  - create
  - patch
  - update
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - create
  - get
  - list
  - update
- apiGroups:
  - coordination.k8s.io
  resourceNames:
  - kube-scheduler
  resources:
  - leases
  verbs:
  - get
  - update
- apiGroups:
  - ""
  resources:
  - endpoints
  verbs:
  - create
- apiGroups:
  - ""
  resourceNames:
  - kube-scheduler
  resources:
  - endpoints
  verbs:
  - get
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - delete
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - bindings
  - pods/binding
  verbs:
  - create
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
  - update
- apiGroups:
  - ""
  resources:
  - replicationcontrollers
  - services
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - apps
  - extensions
  resources:
  - replicasets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - persistentvolumeclaims
  - persistentvolumes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - authentication.k8s.io
  resources:
  - tokenreviews
  verbs:
  - create
- apiGroups:
  - authorization.k8s.io
  resources:
  - subjectaccessreviews
  verbs:
  - create
- apiGroups:
  - storage.k8s.io
  resources:
  - csinodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - storage.k8s.io
  resources:
  - csidrivers
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - storage.k8s.io
  resources:
  - csistoragecapacities
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: my-scheduler-as-kube-scheduler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: my-scheduler
subjects:
- kind: ServiceAccount
  name: my-scheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: my-scheduler-as-volume-scheduler
subjects:
- kind: ServiceAccount
  name: my-scheduler
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:volume-scheduler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-scheduler-config
  namespace: kube-system
data:
  my-scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    profiles:
      - pluginConfig:
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1
              kind: NodeResourcesFitArgs
              scoringStrategy:
                  resources:
                      - name: cpu
                        weight: 1
                      - name: memory
                        weight: 1
                  type: MostAllocated
            name: NodeResourcesFit
        plugins:
          score:
              enabled:
                  - name: NodeResourcesFit
                    weight: 1
              disabled:
                  - name: "*"
          multiPoint:
              enabled:
                  - name: NodeResourcesFit
                    weight: 1
        schedulerName: my-scheduler
    leaderElection:
      leaderElect: true
      resourceNamespace: kube-system
      resourceName: my-scheduler
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
  name: my-scheduler
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: scheduler
      tier: control-plane
  replicas: 2
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: karpenter.sh/nodepool
                operator: DoesNotExist
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                component: scheduler
                tier: control-plane
            topologyKey: kubernetes.io/hostname
      serviceAccountName: my-scheduler
      containers:
      - command:
        - /usr/local/bin/kube-scheduler
        - --bind-address=0.0.0.0
        - --config=/etc/kubernetes/my-scheduler/my-scheduler-config.yaml
        - --v=5
        image: public.ecr.aws/eks-distro/kubernetes/kube-scheduler:v1.28.11-eks-1-28-latest
        livenessProbe:
          httpGet:
            path: /healthz
            port: 10259
            scheme: HTTPS
          initialDelaySeconds: 15
        name: kube-second-scheduler
        readinessProbe:
          httpGet:
            path: /healthz
            port: 10259
            scheme: HTTPS
        resources:
          requests:
            cpu: '1'
        securityContext:
          privileged: false
        volumeMounts:
          - name: config-volume
            mountPath: /etc/kubernetes/my-scheduler
      hostNetwork: false
      hostPID: false
      volumes:
        - name: config-volume
          configMap:
               name: my-scheduler-config
EOF

Validate the Custom Scheduler

  • Step1: Launch the node viewer in a terminal:
eks-node-viewer --resources cpu,memory
  • Step 2: Add the custom scheduler to an EMR on EKS job either via a Spark config or via Pod Templates. For exmaple:

sample-job.sh (last line)

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name $app_name \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-7.2.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
    "entryPoint": "local:///usr/lib/spark/examples/jars/spark-examples.jar", 
    "entryPointArguments": ["1000000"],
    "sparkSubmitParameters": "--class org.apache.spark.examples.SparkPi --conf spark.executor.instances=10" }}' \
--configuration-overrides '{
"applicationConfiguration": [
    {
    "classification": "spark-defaults", 
    "properties": {
        "spark.kubernetes.node.selector.provisioner": "karpenter-binpack-test",
        "spark.kubernetes.scheduler.name": "my-scheduler" }}]}'

OR sample-pod-template.yaml (line #3)

kind: Pod
spec:
  schedulerName: my-scheduler
  nodeSelector:
    provisioner: karpenter-binpack-test
  volumes:
    # EKS automatically mount EBS or NVMe SSD by
    # https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2/runtime/bootstrap.sh 
    - name: spark-local-dir-1
      emptyDir: {}
  containers:
  - name: spark-kubernetes-executor
    volumeMounts:
      - name: spark-local-dir-1
        mountPath: /data1
        readOnly: false    

Step3: Monitor via eks-node-viewer

-Before apply the change in pod template:-

-After the change:-

  • Higher resource usage per node at pod scheduling time
  • Over 50% of cost reduction since Karpenter was terminating idle EC2 nodes at the same time 

Consideration:

  1. At pod launch, the custom scheduler can optimize resource utilization by fitting as many pods as possible onto a single EC2 node. This approach minimizes pod distribution across multiple nodes, leading to higher resource utilization and improved cost efficiency.
  2. Dynamic Resource Allocation (DRA) requires shuffle tracking to be enabled because Spark on Kubernetes does not yet support an external shuffle service. Consequently, we have observed that idle EC2 nodes sometimes fail to scale down due to the shuffle data tracking requirement. Using a Binpack scheduler can help reduce the likelihood of long-running idle nodes with leftover shuffle data, resulting a better DRA outcome.
  3. Karpenter’s consolidation policy remains essential for reducing underutilized or empty nodes. This is particularly beneficial for long-running jobs where pods are scattered across underutilized EC2 nodes.
  4. After implementing a custom scheduler for binpacking, it's recommended to adjust the consolidation frequency to be less aggressive.