Spark Operator Numbers |
For the single spark operator, the max performance for submission rate would be around 30 jobs per min, and the performance tune on a single operator is very limited in the current version. Thus, to handle the large volume of workload, to horizontally scale up, with multiple Spark Operator would be the best solution. The operators will be not impacted from each other on eks cluster side, but higher number of operators will increase the overhead on apiserver/etcd side. |
Spark Operator's controllerThreads |
controllerThreads is also named as "workers", which controls the number of concurrent threads used to process Spark application requests. Increasing this value can increase the performance of spark operator to handle the requests. |
Binpacking |
Binpacking could efficiently allocate pods to available nodes within a Kubernetes cluster. Its primary goal is to optimize resource utilization by packing pods as tightly as possible onto nodes, while still meeting resource requirements and constraints. This approach aims to maximize cluster efficiency, reduce costs, and improve overall performance by minimizing the number of active nodes required to run the workload. With Binpacking enabled, the overall workload can minimise the resources used on network traffic between phsical nodes, as most of pods will be allocated in a single node at its launch time. However, we use Karpenter's consolidation feature to maximize pods tensity when node's utilization starts to drop. |
Spark Operator timeToLiveSeconds |
TimeToLiveSeconds defines the Time-To-Live (TTL) duration in seconds for this SparkApplication after its termination. The SparkApplication object will be garbage collected if the current time is more than the TimeToLiveSeconds since its termination. |
Spark Job Run Time |
Experimental observations indicate that Spark Operator performs better with longer-running jobs compared to shorter ones. This is likely due to the Operator's internal workqueue mechanism for managing job submissions and completions. The Spark Operator uses watchers to monitor SparkApplication status changes in the EKS cluster. Each status change triggers a task in the Operator's workqueue. Consequently, shorter jobs cause more frequent status changes, resulting in higher workqueue activity. This design suggests that a higher frequency of short-running jobs may impose greater processing overhead on the Spark Operator compared to fewer, longer-running jobs, even with equivalent total computation time. Understanding this behavior is important for optimizing Spark job scheduling in environments with varying job durations. |
Number of executors in EKS Spark Operator |
The higher number of executors (10 vs 2) per single EMR on EKS job, number of objects created on EKS cluster grow such as pods, config maps, events etc. This becomes a limiting factor for querying etcd database eventually causing EKS cluster performance degradation. |
Spark Job Config - InitContainers |
initContainer has a big impacted on the both Spark Operator and API server / etcd side. As with this setting enbaled, which will creat more events than jobs without this enabled. To utilize more Spark Operators for the job needs this set up, but for etcd size, it still be a bottleneck when the workload is large. |
EKS Spark Operator submission Rate |
The EMR on EKS job submission rate will dictate the load placed on the API Server. A larger EMR EKS job submission rate can cause the more k8s objects in etcd db, increasing etcd db size, increased etcd request and api server request latency, and lower EMR job submission throughput. |
EMR Image pull policy |
Default Image pull policy for job submitter, driver, and executor pods is set as Always. This adds latency in pod creation times. Unless specifically required for customer usecase, we can set Image pull policy to IfNotPresent resulting in lesser pod creation times. |
EKS K8s control plane scaling |
EKS will autoscale the K8s control plane including API server and etcd db instances for customer's EKS cluster based on resource consumption. To be able to successfully run larger amounts of concurrent EMR on EKS jobs on EKS the API Server needs to be scaled up to handle the extra load. However if factors like webhook latency impact the metrics needed by the EKS API Server autoscaler are inhibited this can lead to not properly scaling up. This will impact the health of the API Server and etcd db and lead to a lower throughput on successfully completed jobs. |
EKS etcd db size |
As you submit more concurrent running EMR on EKS jobs, the number of k8s objects stored in etcd db grow and in turn increase etcd db size as well. Increased etcd db size causes lantecy in some api server requests requiring cluster-wide/namespace-wide etcd read calls and will reduce EMR job submission throughput. Upper bound on etcd db size is 8GB as specified by EKS and reaching this capacity can make EKS cluster in read-only mode. Customers should monitor and keep their etcd db size within limits. We recommend keeping it below 7GB. In addition, as Spark Operator does not store the metadata of all the running jobs, so if there is any unhealthy or crash happened in etcd/API server, then could cause some job failed or running state lost with Spark Operator. |
EKS VPC Subnets IP pool |
Available IP addresses in VPC subnets that are configured for EKS cluster also impact the EMR on EKS job throughput. Each pod needs to be assigned an IP address, thus it is essential to have large enough IP address pool available in VPC subnets of the EKS cluster to achieve higher pod concurrency. Exhaustion of IP adresses causes new pod creation requests to fail. |
EKS Cluster version |
EKS has made improvements to cluster versions higher than 1.28 resulting in higher job throughput for EMR on EKS jobs. These recommendations are based on using EKS cluster version 1.30. |
Pod template size |
Having high pod template sizes, for example from a high number of sidecar containers, init containers, or numerous environment variables, results in increased usage of the etcd database. This increased usage can potentially limit the number of pods that can be stored in etcd and may impact the cluster's overall performance, including the rate at which EMR jobs can be submitted. |