I use a CPU trainer to train a variety of machine-learning models. The CPU trainer is optimized for training on CPUs. Given the cost savings reported by AWS customers — up to 30% when using Graviton instances — I decided to experiment with Graviton for my long-running CPU jobs.

Initial Results

Without any modifications to the CPU trainer or my jobs, I launched ML training jobs on Graviton instances. This initial attempt yielded an 8% cost reduction, though the jobs were 10% slower compared to x86 instances. While this was a promising start, it was not ideal given that some models can suffer due to the slower refresh rate.

Following AWS recommendations and implementing some optimizations was a game changer:

Compiler Optimization

Graviton is built on the Neoverse-N1, an arm64 microarchitecture designed to optimize performance on cloud servers. By default, when compiling binaries for arm64, the compiler selects instructions compatible with all arm64 architectures. However, arm64 includes various microarchitectures optimized for specific use cases. Targeting a specific microarchitecture during compilation allows the use of specialized instructions, resulting in better performance. This can be achieved using the -mcpu flag and other flags that enable architecture-specific features.

For instance, AWS recommends using the -mcpu=neoverse-n1 flag for c6g instances to optimize performance. Additionally, AWS advises using the latest compiler versions whenever possible to take advantage of the most recent optimizations and features.

ARM NEON

SSE (Streaming SIMD Extensions) is an instruction set extension for the x86 architecture used to enhance performance on floating-point operations. ARM Neon is the equivalent of ARM CPUs, providing similar performance improvements. Applications using SSE must adopt ARM Neon to optimize performance on arm64 CPUs. This typically involves adding ARM Neon functions when specific flags are detected.

While examining my CPU trainer codebase, I discovered multiple instances where SSE was used without a corresponding NEON implementation, causing the training process to fall back to standard instruction execution. To address this quickly, I integrated the SSE2NEON library, a translation layer that ports SSE code to ARM Neon by simply adding the library headers. This enabled me to implement NEON functionality efficiently, improving performance on arm64 CPUs.

VCPU to Physical Core Mapping

While running benchmarks, I noticed that ARM machines were not reaching 90% CPU usage compared to x86 machines. This discrepancy is due to my practice of running the CPU trainer on only half of the available vCPUs on x86 instances because of Simultaneous Multi-Threading (SMT).

A key difference between AWS Graviton2 instances and other types is the vCPU to physical core mapping. Each vCPU on a Graviton2 processor is a physical core, providing full isolation between vCPUs and eliminating the need for SMT. In contrast, vCPUs on x86 instances are hyper-threads rather than physical cores.

As a result, the same instance size on Graviton2 offers more actual physical cores, which I could leverage for training. This allowed me to reduce the overall number of machines used compared to x86 instances, enhancing cost-effectiveness.

Results

After applying these optimizations, I achieved approximately 35% cost savings and a 10% performance gain. Results varied by job, with some seeing up to 40% cost reduction, while others were 20% slower on Graviton. Overall, most jobs cost less on Graviton compared to x86, and although some were slower, this trade-off was acceptable for non-time-sensitive tasks.

Conclusion

Using ARM-based instances for CPU ML training resulted in a performance improvement of 5% to 20% and cost savings of 25% to 35%. To achieve similar results, consider the following:

Utilize the latest versions of compilers, language runtimes, and applications.
Use appropriate compilation flags tailored to your specific ARM microarchitecture.
Optimize floating-point and integer operations with ARM NEON where possible.
Align the level of parallelism in your job with the vCPU to physical core mapping.

Resources

For more information, visit the AWS Graviton Getting Started Guide.