Optimizing performance
Graviton Performance Runbook toplevel
This section describes multiple different optimization suggestions to try on Graviton based instances to attain higher performance for your service. Each sub-section defines some optimization recommendations that can help improve performance if you see a particular signature after measuring the performance using the previous checklists.
Optimizing for large instruction footprint
- On C/C++ applications,
-flto,-Os, and Feedback Directed Optimization can help with code layout using GCC. - On Java,
-XX:-TieredCompilation,-XX:ReservedCodeCacheSizeand-XX:InitialCodeCacheSizecan be tuned to reduce the pressure the JIT places on the instruction footprint. The JDK defaults to setting up a 256MB region by default for the code-cache which over time can fill, become fragmented, and live code may become sparse.- We recommend setting the code cache initially to:
-XX:-TieredCompilation -XX:ReservedCodeCacheSize=64M -XX:InitialCodeCacheSize=64Mand then tuning the size up or down as required. - Experiment with setting
-XX:+TieredCompilationto gain faster start-up time and better optimized code. - When tuning the code JVM code cache, watch for
code cache fullerror messages in the logs indicating that the cache has been set too small. A full code cache can lead to worse performance.
- We recommend setting the code cache initially to:
Optimizing for high TLB miss rates
A TLB (translation lookaside buffer) is a cache that holds recent virtual address to physical address translations for the CPU to use. Making sure this cache never misses can improve application performance.
- Enable Transparent Huge Pages (THP)
echo always > /sys/kernel/mm/transparent_hugepage/enabled-or-echo madvise > /sys/kernel/mm/transparent_hugepage/enabled - On Linux kernels >=6.9 Transparent Huge Pages (THP) has been extended with Folios that create 16kB, and 64kB huge pages in addition to 2MB pages. This allows the Linux kernel to use huge pages in more places to increase performance by reducing TLB pressure. All folio sizes can be set using
inheritto use the setting of the top-level THP setting, or set independently to select the sizes to use. Can also set each folio usingnever,alwaysandmadvise.- To use 16kB pages:
echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled - To use 64kB pages:
echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled - To use 2MB pages:
echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
- To use 16kB pages:
- If your application can use pinned hugepages because it uses mmap directly, try reserving huge pages directly via the OS. This can be done by two methods.
- At runtime:
sysctl -w vm.nr_hugepages=X - At boot time by specifying on the kernel command line in
/etc/default/grub:hugepagesz=2M hugepages=512
- At runtime:
- For Java, hugepages can be used for both the code-heap and data-heap by adding the below flags to your JVM command line
-XX:+UseTransparentHugePageswhen THP is set to at leastmadvise-XX:+UseLargePagesif you have pre-allocated huge pages throughsysctlor the kernel command line.
Using huge-pages should generally improve performance on all EC2 instance types, but there can be cases where using exclusively huge-pages may lead to performance degradation. Therefore, it is always recommended to fully test your application after enabling and/or allocating huge-pages.
Porting and optimizing assembly routines
- If you need to port an optimized routine that uses x86 vector instruction instrinsics to Graviton’s vector instructions (called NEON instructions), you can use the SSE2NEON library to assist in the porting. While SSE2NEON won’t produce optimal code, it generally gets close enough to reduce the performance penalty of not using the vector intrinsics.
- For additional information on the vector instructions used on Graviton
Optimizing synchronization heavy optimizations
- Look for specialized back-off routines for custom locks tuned using x86
PAUSEor the equivalent x86rep; nopsequence. Graviton2 should use a singleISBinstruction as a drop in replacement, for an example and explanation see recent commit to the Wired Tiger storage layer. - If a locking routine tries to acquire a lock in a fast path before forcing the thread to sleep via the OS to wait, try experimenting with modifying the fast path to attempt the fast path a few additional times before executing down the slow path. An example of this from the Finagle code-base where on Graviton2 we will spin longer for a lock before sleeping.
- If you do not intend to run your application on Graviton1, try compiling your code on GCC using
-march=armv8.2-ainstead of using-moutline-atomicsto reduce overhead of using synchronization builtins.
Network heavy workload optimizations
- Check ENA device tunings with
ethtool -c ethNwhereNis the device number and checkAdaptive RXsetting. By default on instances without extra ENI’s this will beeth0.- Set to
ethtool -C ethN adpative-rx offfor a latency sensitive workload - ENA tunings via
ethtoolcan be made permanent by editing/etc/sysconfig/network-scripts/ifcfg-ethNfiles.
- Set to
- Disable
irqbalancefrom dynamically moving IRQ processing between vCPUs and set dedicated cores to process each IRQ. Example script below:
# Assign eth0 ENA interrupts to the first N-1 cores
systemctl stop irqbalance
irqs=$(grep "eth0-Tx-Rx" /proc/interrupts | awk -F':' '{print $1}')
cpu=0
for i in $irqs; do
echo $cpu > /proc/irq/$i/smp_affinity_list
let cpu=${cpu}+1
done
- Disable Receive Packet Steering (RPS) to avoid contention and extra IPIs.
cat /sys/class/net/ethN/queues/rx-N/rps_cpusand verify they are set to0. In general RPS is not needed on Graviton2 and newer.- You can try using RPS if your situation is unique. Read the documentation on RPS to understand further how it might help. Also refer to Optimizing network intensive workloads on Amazon EC2 A1 Instances for concrete examples.
Metal instance IO optimizations
- If on Graviton2 and newer metal instances, try disabling the System MMU (Memory Management Unit) to speed up IO handling:
%> cd ~/aws-gravition-getting-started/perfrunbook/utilities
# Configure the SMMU to be off on metal, which is the default on x86.
# Leave the SMMU on if you require the additional security protections it offers.
# Virtualized instances do not expose an SMMU to instances.
%> sudo ./configure_graviton_metal_iommu.sh off
%> sudo shutdown now -r