GSoC Week 2: Investigating Performance Bottlenecks in SU2

2 minute read

Published:

TL;DR

To optimize performance in a large codebase like SU2, you must:

  1. Identify performance bottlenecks using profiling tools.
  2. Locate the exact function(s) consuming the most time.
  3. Understand their internals using call graphs.
  4. Micro-measure time-critical regions.
  5. Apply improvements and verify the impact.

This post focuses on profiling and visualizing the call graph using valgrind and kcachegrind.


Introduction

When working on performance-critical applications like SU2, identifying and addressing bottlenecks is essential. While this blog uses SU2 as the context, the process discussed here is applicable to most large C++ projects.

Profiling with gprof

SU2 supports profiling via gprof. To enable it, compile SU2 with -Denable-gprof=true.

./meson.py build_gsoc -Denable-pywrapper=true -Denable-mlpcpp=true -Denable-gprof=true --prefix=/home/divyaprakash/SU2_DP
 ./ninja -C build_gsoc install

Then run the desired executable normally:

/path/to/binary/SU2_CFD config.cfg

This generates a gmon.out file in the working directory. Analyze it using:

gprof /path/to/binary/SU2_CFD > analysis.txt

Interpreting the Output

In analysis.txt, the flat profile lists the most time-consuming functions:

time   seconds   seconds    calls  Ts/call  Ts/call  name    
  6  47.43      2.03     2.03                             MLPToolbox::CNeuralNetwork::Predict(std::vector<double, std::allocator<double> >&)

Here, MLPToolbox::CNeuralNetwork::Predict emerges as the top time consumer. However, it may not be straightforward to locate this function in the source code directly, especially in a large project like SU2.

Finding the Function in Source Code

To locate the function in context, use gdb:

gdb --args ~/SU2_DP/serial/bin/SU2_CFD config_NICFD_PINN.cfg
(gdb) break MLPToolbox::CNeuralNetwork::Predict
(gdb) run

This places a breakpoint and reveals the exact location of the function in the source code. In my case the following output was obtained which pointed to exact location.

Breakpoint 1, MLPToolbox::CNeuralNetwork::Predict (this=0x209d500, inputs=std::vector of length 2, capacity 2 = {...}) at ../SU2_CFD/src/fluid/../../include/fluid/../../../subprojects/MLPCpp/include/CNeuralNetwork.hpp:651

Going Beyond: Understanding the Call Graph

Profiling alone doesn’t reveal why a function is expensive. The time could be due to the function itself or its callees. While gprof is supposed to provide call graph information, it did not in this case—possibly due to function inlining or compiler optimizations, which are known to interfere with accurate profiling output.

This is where valgrind and kcachegrind come into picture.


Visualizing the Call Graph with Valgrind + KCachegrind

To analyze function calls and their time share:

valgrind --tool=callgrind ~/SU2_DP/serial/bin/SU2_CFD config_NICFD_PINN.cfg

This creates an output file such as callgrind.out.50464. You can visualize it with:

kcachegrind callgrind.out.50464

Interpreting the Call Graph

Using kcachegrind, we navigate the call graph and observe that the Predict function spends most of its time in two callees:

  • ComputePsi
  • ComputeChi

Call Graph

This pinpointed the actual performance bottleneck, enabling targeted optimization efforts.


Final Notes

While valgrind and kcachegrind are powerful tools for analyzing code structure and bottlenecks, they are not ideal for runtime performance measurements due to the overhead introduced. In future posts, I will explore tracy, a high-performance real-time profiler that provides detailed runtime metrics with minimal overhead.

To dig deeper into kcachegrind, refer to the official documentation.


Leave a Comment