Skip to main content

Price Performance

In the scope of this tutorial, "price-performance" signifies the monetary expense associated with executing a given workload while maintaining a specific degree of performance, expressed in terms of execution duration (seconds). Evaluating price-performance plays a vital role in understanding the impact of factors that are not easily quantifiable, such as deployment architectures, competitive offerings, container allocation strategies, and processing engines.

For variables that are within our control, such as infrastructure sizing or application settings, ensuring uniformity among all benchmarks is indispensable for accurate comparisons.

The following examples highlight the importance of price-performance.

Example 1: Customer wants to compare Open Source Software (OSS) Spark vs EMR Spark with different cluster sizes

Cluster #1Cluster #2
Runtime (s)1230
# of nodes5010
EngineOSS Spark RuntimeEMR Spark Runtime
Cost ($)600300

In the above example, Cluster #1 is running OSS spark and completes in 12s with 50 nodes, while EMR Spark completes in 30s with 10 nodes. However, when we look at total cost, cluster #2 total cost is lower than cluster #1 making it a better option. Comparing cost in relation to the work being done considers the difference in # of nodes and engine. Assuming performance is linear, lets look at what happens when we increase the # of nodes in cluster 2.

Example 2: Customer wants to compare Open Source Software (OSS) Spark vs EMR Spark with same cluster sizes

Cluster #1Cluster #2
Runtime (s)126
# of nodes5050
EngineOSS Spark RuntimeEMR Spark Runtime
Cost ($)600300

After increasing the # of nodes to be the same across both clusters, runtime is reduced to 6seconds on Cluster #2 and cost remains the same at 300$. Our conclusion from the first example remains the same. Cluster #2 is the best option from a price-performance perspective.

It’s important to note that price-performance is not always linear. This is often seen when workloads have data skew. In these cases, adding more compute does not reduce runtime proportionally and adds costs.

Example 3: Same workload across different # of nodes - data skew

Run #1Run #2
Runtime (s)10075
# of nodes1020
EngineEMR Spark RuntimeEMR Spark Runtime
Cost ($)10001500

In the above example, performance is not linear. While runtime reduced to 75s, overall cost increased. In these cases, it’s important ensure the # of nodes are the same for both comparisons.

Another scenario where price-performance is useful is when comparing different pricing models or vendors. Take the example below:

Example 4: Same workload across different pricing models

EMR Spark RuntimeVendor
Runtime (s)5040
# of nodes1010
$/s11.5
Cost ($)500600

In the above example, the same workload on vendor runs in 40s, while EMR runs in 50s. While vendor may seem faster, when we factor in price-performance, we see total cost is lower with EMR. If runtime is a key requirement, we can increase the # of nodes in relation to performance as illustrated in example 5.

Example 5: Same workload across different pricing models with different # of nodes

EMR Spark RuntimeEMR Spark Runtime linear performanceVendor
Runtime (s)502540
# of nodes102010
$/s111.5
Cost ($)500500600

The goal with benchmarking should always be to have like-for-like comparisons. This is especially true for factors such as application configuration settings such as executor sizes, input and output dataset, cluster size and instances. However, factors like vendor/aws pricing model, engine optimizations, and schedulers cannot be made the same. As such, it’s important to use price-performance as a key factor.