Benchmarking DevitoPRO on Intel® Xeon® 6 Processors

Optimizing Elastic Finite-Difference Seismic Simulations

Featured image

Introduction

High-performance computing (HPC) plays a critical role in scientific and engineering applications, from seismic imaging to simulation. DevitoPRO is designed to help researchers and engineers automate HPC code generation, allowing them to focus on algorithmic development for their domain-specific challenges while ensuring their simulations run efficiently on modern hardware.

As part of our ongoing efforts to optimize performance across multiple architectures, this post presents benchmark results for DevitoPRO on Intel® Xeon® 6 6980P processors, comparing them to the previous-generation 5th Gen Intel® Xeon® Platinum 8592+ processors. These results provide insights into computational throughput, data transfer performance, and mixed-precision acceleration—all key factors in achieving high-performance finite-difference seismic imaging kernels.

Benchmarking Setup

The goal of this benchmarking study is to evaluate:

Benchmarks

The benchmarks use two workloads:

  1. Acoustic anisotropic propagator (acoustic TTI model) – A widely used model in seismic imaging for energy applications, testing floating-point operations per second (FLOPS) and data transfer rates.
  2. Elastic propagator – A more complex model incorporating mixed-precision techniques (FP32/FP16) to evaluate performance improvements in multi-node MPI environments.

Processor Specifications

Processor Physical Cores Sockets Architecture HBM Memory
5th Gen Intel® Xeon® Platinum 8592+ (Emerald Rapids) 128 2 x86_64 No Supports DDR5 memory with an eight-channel interface
Intel® Xeon® 6 6980P (Granite Rapids) 256 2 x86_64 No Supports DDR5 memory with an twelve-channel interface

Benchmarks were conducted using identical compilers, software environments, and simulation parameters to ensure fair comparisons.

Performance Results

Generational Performance Gains for acoustic TTI

The acoustic TTI benchmark measures the efficiency of seismic wave propagation simulations, a key workload in geophysical exploration. The benchmark was conducted using a 1024×2048×1024 computational grid with 5000 time steps, ensuring a realistic and computationally demanding test case. The simulations were executed using DevitoPRO, leveraging optimized code generation for modern CPU architectures.

Metric 5th Gen Intel® Xeon® Platinum 8592+ (Emerald Rapids) Intel® Xeon® 6 6980P (Granite Rapids) Improvement
Operations 2.28 TFlops 5.96 TFlops 2.6x faster
FD-throughput (GPts/s) 7.45 GPts/s 15.74 GPts/s 2.1x faster

To fully exploit the hardware capabilities, the benchmark used a NUMA-aware hybrid MPI-OpenMP configuration, optimizing both computation and data locality:

The NUMA-aware process placement ensured that:

Key Factors Driving Performance Gains

Granite Rapids exhibited over twice the performance of Emerald Rapids due to:

  1. Higher Memory Bandwidth:
    • GNR features twelve DDR5 memory channels per socket, improving data movement efficiency.
  2. Increased Core Count:
    • GNR doubles the physical core count (256 vs. 128) compared to EMR, significantly boosting parallel execution.

These architectural and software improvements collectively delivered 2.6x higher floating-point performance and 2.1x faster compute throughput (GPts/s), demonstrating the generational leap in efficiency from Emerald Rapids to Granite Rapids.

Mixed-precision performance gains for Isotropic Elastic on Granite Rapids (Gen 6)

The isotropic elastic benchmark evaluates the efficiency of multi-component wave propagation simulations, a crucial workload in geophysical imaging. This test was conducted on a 1024 × 2048 × 1024 computational grid with 5000 time steps.

Unlike acoustic TTI, which was benchmarked exclusively in FP32 precision, isotropic elastic simulations were tested in both FP32-only and mixed FP32/FP16 modes. The introduction of mixed precision yielded significant computational and memory efficiency improvements.

We use the same hybrid MPI-OpenMP parallelism with NUMA-aware pinning as with the acoustic TTI benchmark.

Precision Mode Operations Compute Throughput (GPts/s) Improvement
FP32 1.02 TFlops 3.59 GPts/s -
FP32/FP16 (Mixed) 2.37 TFlops 8.30 GPts/s ~2.3x faster

Why Does Mixed Precision Matter?

Switching from FP32-only computation to a mixed FP32/FP16 approach provided a ~2.3x speedup, achieved through a balanced approach that optimizes both storage and arithmetic precision. Since finite difference methods are memory-bound, the key to performance gains lies in reducing memory bandwidth pressure while maintaining numerical accuracy.

The mixed-precision strategy used in DevitoPRO follows this principle:

This approach ensures that precision is maintained where it matters most while taking advantage of FP16’s efficiency for memory operations.

Key Benefits of Mixed Precision

  1. Lower Memory Footprint:
    • FP16 values take up half the memory of FP32, doubling effective cache capacity and reducing pressure on memory bandwidth.
    • More wavefield data can fit in fast-access memory, improving locality and cache reuse.
  2. Reduced Communication Overhead:
    • In MPI-based distributed environments, using FP16 for storage reduces halo exchange size, halving inter-rank data transfer costs.
    • This is particularly beneficial in multi-node scaling scenarios, where communication is a major bottleneck.
  3. Faster Computation:
    • Intel Xeon 6 processors feature optimized FP16 vector and tensor operations, accelerating data movement and memory loads.
    • Arithmetic remains in FP32, avoiding excessive rounding errors while still benefiting from higher memory bandwidth efficiency.

By carefully combining FP16 for storage and FP32 for arithmetic, DevitoPRO achieves significant speedups while ensuring numerical stability, making this approach ideal for large-scale elastic wave simulations.

Impact on Elastic Wave Simulations

Elastic wave simulations require multiple coupled wavefields, significantly increasing memory and computational demands. By leveraging mixed FP32/FP16 precision, DevitoPRO achieves:

These optimizations ensure Granite Rapids delivers superior elastic wave simulation performance, making it a compelling choice for next-generation geophysical imaging workloads.

Key Takeaways

Looking Ahead

At Devito Codes, we remain committed to hardware-neutral performance optimization. While this benchmarking study focuses on Intel Xeon 6, we are actively working on:

Users and organizations interested in seismic imaging, medical imaging, or wave-based simulations can explore DevitoPRO’s capabilities at https://www.devitocodes.com/features/.

Would you like more details? Feel free to reach out!