GSoC Week 2: Investigating Performance Bottlenecks in SU2
Published:
TL;DR
To optimize performance in a large codebase like SU2, you must:
- Identify performance bottlenecks using profiling tools.
- Locate the exact function(s) consuming the most time.
- Understand their internals using call graphs.
- Micro-measure time-critical regions.
- Apply improvements and verify the impact.
This post focuses on profiling and visualizing the call graph using valgrind
and kcachegrind
.
Introduction
When working on performance-critical applications like SU2, identifying and addressing bottlenecks is essential. While this blog uses SU2 as the context, the process discussed here is applicable to most large C++ projects.
Profiling with gprof
SU2 supports profiling via gprof
. To enable it, compile SU2 with -Denable-gprof=true
.
./meson.py build_gsoc -Denable-pywrapper=true -Denable-mlpcpp=true -Denable-gprof=true --prefix=/home/divyaprakash/SU2_DP
./ninja -C build_gsoc install
Then run the desired executable normally:
/path/to/binary/SU2_CFD config.cfg
This generates a gmon.out
file in the working directory. Analyze it using:
gprof /path/to/binary/SU2_CFD > analysis.txt
Interpreting the Output
In analysis.txt
, the flat profile lists the most time-consuming functions:
time seconds seconds calls Ts/call Ts/call name
6 47.43 2.03 2.03 MLPToolbox::CNeuralNetwork::Predict(std::vector<double, std::allocator<double> >&)
Here, MLPToolbox::CNeuralNetwork::Predict
emerges as the top time consumer. However, it may not be straightforward to locate this function in the source code directly, especially in a large project like SU2.
Finding the Function in Source Code
To locate the function in context, use gdb
:
gdb --args ~/SU2_DP/serial/bin/SU2_CFD config_NICFD_PINN.cfg
(gdb) break MLPToolbox::CNeuralNetwork::Predict
(gdb) run
This places a breakpoint and reveals the exact location of the function in the source code. In my case the following output was obtained which pointed to exact location.
Breakpoint 1, MLPToolbox::CNeuralNetwork::Predict (this=0x209d500, inputs=std::vector of length 2, capacity 2 = {...}) at ../SU2_CFD/src/fluid/../../include/fluid/../../../subprojects/MLPCpp/include/CNeuralNetwork.hpp:651
Going Beyond: Understanding the Call Graph
Profiling alone doesn’t reveal why a function is expensive. The time could be due to the function itself or its callees. While gprof
is supposed to provide call graph information, it did not in this case—possibly due to function inlining or compiler optimizations, which are known to interfere with accurate profiling output.
This is where valgrind
and kcachegrind
come into picture.
Visualizing the Call Graph with Valgrind + KCachegrind
To analyze function calls and their time share:
valgrind --tool=callgrind ~/SU2_DP/serial/bin/SU2_CFD config_NICFD_PINN.cfg
This creates an output file such as callgrind.out.50464
. You can visualize it with:
kcachegrind callgrind.out.50464
Interpreting the Call Graph
Using kcachegrind
, we navigate the call graph and observe that the Predict
function spends most of its time in two callees:
ComputePsi
ComputeChi
This pinpointed the actual performance bottleneck, enabling targeted optimization efforts.
Final Notes
While valgrind
and kcachegrind
are powerful tools for analyzing code structure and bottlenecks, they are not ideal for runtime performance measurements due to the overhead introduced. In future posts, I will explore tracy
, a high-performance real-time profiler that provides detailed runtime metrics with minimal overhead.
To dig deeper into kcachegrind
, refer to the official documentation.
Leave a Comment