Docs: Using LIKWID with NEURON for performance analysis

- Added insructions about how to build NEURON with LIKWID and how to use it with low overhead (see #2983) - Added some info about performance regression in master
neuronsimulator · Jul 14, 2024 · e8b8ac9 · e8b8ac9
1 parent 3a2336e
commit e8b8ac9
Show file tree

Hide file tree

Showing 3 changed files with 323 additions and 1 deletion.
diff --git a/docs/install/debug.md b/docs/install/debug.md
@@ -579,4 +579,326 @@ It's important to note that higher latencies do not necessarily indicate false s
 instance, indirect-memory accesses with a strided pattern could lead to increased latencies. 
 One should examine how the memory object is used within a function to determine if false sharing 
 is a potential issue. Additionally, comparing access latencies with scenarios like single-threaded 
-execution or versions without such issues can provide further insights.
+execution or versions without such issues can provide further insights.
+
+#### Using LIKWID With NEURON
+
+As described in the previous section, Intel VTune is a powerful tool for node-level performance analysis.
+However, we might need alternative options like LIKWID profiling tools in scenarios such as:
+1) VTune is not available due to non-Intel CPUs or lack of necessary permissions.
+2) We prefer to precisely mark the code regions of interest rather than using a sampling-based approach.
+3) The raw events/performance counters shown by VTune are overwhelming, and we want high-level metrics that typically used in HPC/Scientific Computing.
+4) Or simply we want to cross-check VTune's results with another tool like LIKWID to ensure there are no false positives.
+
+LIKWID offers a comprehensive set of tools for performance measurement on HPC platforms. It supports Intel, AMD, and ARM CPUs.
+However, as it is not vendor-specific, it may lack support for specific CPUs (e.g., Intel Alder Lake). Despite this, LIKWID is still a valuable tool.
+Let’s quickly see how we can use LIKWID with NEURON.
+
+We assume LIKWID is installed with the necessary permissions (via `accessDaemon` or Linux `perf` mode).
+Usage of LIKWID is covered in multiple other tutorials like [this](https://github.com/RRZE-HPC/likwid/wiki/TutorialStart) and [this](https://hpc.fau.de/research/tools/likwid/).
+Here, we focus on its usage with NEURON.
+
+LIKWID can measure performance counters on any binary like NEURON. For example, in the execution below,
+we measure metrics from `FLOPS_DP` LIKWID group, i.e., double precision floating point related metrics:
+
+```console
+$ likwid-perfctr -C 0 -g FLOPS_DP ./x86_64/special -python test.py
+--------------------------------------------------------------------------------
+CPU name:	Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
+...
+--------------------------------------------------------------------------------
+Group 1: FLOPS_DP
++------------------------------------------+---------+------------+
+|                   Event                  | Counter | HWThread 0 |
++------------------------------------------+---------+------------+
+|             INSTR_RETIRED_ANY            |  FIXC0  | 8229299612 |
+|           CPU_CLK_UNHALTED_CORE          |  FIXC1  | 3491693048 |
+|           CPU_CLK_UNHALTED_REF           |  FIXC2  | 2690601456 |
+| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |    1663849 |
+|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |  740794589 |
+| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |          0 |
+| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |      -     |
++------------------------------------------+---------+------------+
+
++----------------------+------------+
+|        Metric        | HWThread 0 |
++----------------------+------------+
+|  Runtime (RDTSC) [s] |     1.4446 |
+| Runtime unhalted [s] |     1.5181 |
+|      Clock [MHz]     |  2984.7862 |
+|          CPI         |     0.4243 |
+|     DP [MFLOP/s]     |   515.0941 |
+|   AVX DP [MFLOP/s]   |          0 |
+|  AVX512 DP [MFLOP/s] |          0 |
+|   Packed [MUOPS/s]   |     1.1517 |
+|   Scalar [MUOPS/s]   |   512.7906 |
+|  Vectorization ratio |     0.2241 |
++----------------------+------------+
+```
+
+In this execution, we see information like the total number of instructions executed and contributions from SSE, AVX2, and AVX-512 instructions.
+
+But, in the context of NEURON model execution, this is not sufficient because the above measurements summarize the full execution,
+including model building and simulation phases. For detailed performance analysis, we need granular information. For example:
+1) We want to compare hardware counters for phases like current update (`BREAKPOINT`) vs. state update (`DERIVATIVE`) due to their different properties (memory-bound vs. compute-bound).
+2) We might want to analyze the performance of a specific MOD file and it's kernels.
+
+This is where [LIKWID's Marker API](https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerC) comes into play.
+Currently, Caliper doesn't integrate LIKWID as a service, but NEURON's profiler interface supports enabling LIKWID markers via the same
+[Instrumentor API](https://github.com/neuronsimulator/nrn/blob/master/src/coreneuron/utils/profile/profiler_interface.h).
+
+###### Building NEURON With LIKWID Support
+
+Like Caliper, building LIKWID support in NEURON is easy:
+
+```console
+cmake .. \
+    -DNRN_ENABLE_PROFILING=ON \
+    -DNRN_PROFILER=likwid \
+    -DCMAKE_PREFIX_PATH=<likwid-install-prefix>/share/likwid \
+    -DCMAKE_INSTALL_PREFIX=`pwd`/install \
+    -DCMAKE_BUILD_TYPE=RelWithDebInfo
+make -j && make install
+```
+
+Once built this way, LIKWID markers are added for the various simulation phases similar to those shown in Caliper and Vtune section.
+
+###### Measurement with LIKWID
+
+Measuring different metrics with LIKWID is easy, as seen earlier. By building LIKWID support via CMake, we now have enabled LIKWID markers 
+that help us see the performance counters for different phases of simulations. In the example below, we added the `-m` CLI option to enable markers:
+
+```console
+$ likwid-perfctr -C 0 -m -g FLOPS_DP ./x86_64/special -python test.py
+...
+...
+Region psolve, Group 1: FLOPS_DP
++-------------------+------------+
+|    Region Info    | HWThread 0 |
++-------------------+------------+
+| RDTSC Runtime [s] |  10.688310 |
+|     call count    |          1 |
++-------------------+------------+
+
++------------------------------------------+---------+------------+
+|                   Event                  | Counter | HWThread 0 |
++------------------------------------------+---------+------------+
+|             INSTR_RETIRED_ANY            |  FIXC0  | 5496569000 |
+|           CPU_CLK_UNHALTED_CORE          |  FIXC1  | 2753500000 |
+|           CPU_CLK_UNHALTED_REF           |  FIXC2  | 2111229000 |
+| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |     489202 |
+|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |  730000000 |
+| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |          0 |
+| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |      -     |
++------------------------------------------+---------+------------+
+...
+...
+Region state-cdp5StCmod, Group 1: FLOPS_DP
++-------------------+------------+
+|    Region Info    | HWThread 0 |
++-------------------+------------+
+| RDTSC Runtime [s] |   0.353965 |
+|     call count    |        400 |
++-------------------+------------+
+
++------------------------------------------+---------+------------+
+|                   Event                  | Counter | HWThread 0 |
++------------------------------------------+---------+------------+
+|             INSTR_RETIRED_ANY            |  FIXC0  | 2875111000 |
+|           CPU_CLK_UNHALTED_CORE          |  FIXC1  | 1057608000 |
+|           CPU_CLK_UNHALTED_REF           |  FIXC2  |  810826000 |
+| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |     380402 |
+|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |  358722700 |
+| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |          0 |
+| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |      -     |
++------------------------------------------+---------+------------+
+...
+Region state-Cav2_3, Group 1: FLOPS_DP
++-------------------+------------+
+|    Region Info    | HWThread 0 |
++-------------------+------------+
+| RDTSC Runtime [s] |   0.002266 |
+|     call count    |        400 |
++-------------------+------------+
+...
+...
+```
+
+Here, we see performance counters for the `psolve` region, which includes the full simulation loop,
+and a breakdown into channel-specific kernels like `state-cdp5StCmod` and `state-Cav2_3`.
+
+###### Avoiding Measurement Overhead with `NRN_PROFILE_REGIONS`
+
+For profiling like Caliper, merely measuring the execution time has negligible overhead.
+However, with LIKWID, starting and stopping measurement using high-level API like `LIKWID_MARKER_START()` and
+`LIKWID_MARKER_STOP()` can introduce significant overhead, especially when instrumentation covers small code regions.
+This is the case in NEURON, where we instrument all simulation phases and individual mechanisms' state and current update kernels.
+In small examples, such overhead could slow down execution by 10x.
+
+To avoid this, NEURON provides an environment variable `NRN_PROFILE_REGIONS` to enable profiling for specific code regions.
+For example, let's assume we want to understand hardware performance counters for two phases:
+- `psolve`: entire simulation phase
+- `state-hh`: one specific state update phase where we call `DERIVATIVE` block of the `hh.mod` file
+
+These names are the same as those instrumented in the code using `Instrumentor::phase` API (see previous step).
+We can specify these regions to profile as a *comma-separated list* via the `NRN_PROFILE_REGIONS` environment variable:
+
+```console
+$ export NRN_PROFILE_REGIONS=psolve,state-hh
+$ likwid-perfctr -C 0 -m -g FLOPS_DP ./x86_64/special -python test.py
+```
+
+Now, we should get the performance counter report only for these two regions with relatively small execution overhead:
+
+```console
+Region psolve, Group 1: FLOPS_DP
++-------------------+------------+
+|    Region Info    | HWThread 0 |
++-------------------+------------+
+| RDTSC Runtime [s] |  10.688310 |
+|     call count    |          1 |
++-------------------+------------+
+
++------------------------------------------+---------+------------+
+|                   Event                  | Counter | HWThread 0 |
++------------------------------------------+---------+------------+
+|             INSTR_RETIRED_ANY            |  FIXC0  | 5496569000 |
+|           CPU_CLK_UNHALTED_CORE          |  FIXC1  | 2753500000 |
+...
+...
+Region state-hh, Group 1: FLOPS_DP
++-------------------+------------+
+|    Region Info    | HWThread 0 |
++-------------------+------------+
+| RDTSC Runtime [s] |   0.180081 |
+|     call count    |        400 |
++-------------------+------------+
+
++------------------------------------------+---------+------------+
+|                   Event                  | Counter | HWThread 0 |
++------------------------------------------+---------+------------+
+|             INSTR_RETIRED_ANY            |  FIXC0  | 1341553000 |
+|           CPU_CLK_UNHALTED_CORE          |  FIXC1  |  540962900 |
+...
+```
+
+> NOTE: Currently, NEURON uses marker APIs `LIKWID_MARKER_START()` and `LIKWID_MARKER_STOP()` from LIKWID.
+> We should consider using `LIKWID_MARKER_REGISTER()` API to reduce overhead and prevent incorrect report counts for tiny code regions.
+
+### An Example Of Cross-Checking VTune Results with LIKWID
+
+This section isn't about comparing Intel VTune with LIKWID, but rather about how one can use LIKWID to confirm
+performance results initially observed with Intel VTune.
+
+While investigating a performance regression in the upcoming NEURON v9.0 (see [#2787](https://github.com/neuronsimulator/nrn/issues/2787)),
+we compared the performance of the master branch with [an earlier commit](https://github.com/neuronsimulator/nrn/commit/b4ade55096d81c31d7e5db099cce949d56162490)
+(referred to as `presoa`). This commit precedes the merging of the [new SoA data structure PR](https://github.com/neuronsimulator/nrn/pull/2027).
+When applying a fix from [#2966](https://github.com/neuronsimulator/nrn/pull/2966) to the `presoa` version, we observed a 10-20% performance
+regression in the master branch compared to the `presoa` version.
+
+To investigate this, we generated profiles for both versions and compared them using VTune:
+
+![VTune Comparison](images/nrn_likwid_vtune_instruction_comparison.png)
+
+We noticed that the `psolve` region, which represents the entire simulation phase, was about 2.3 seconds slower.
+However, the `state-cdp5StCmod` and `state-Cav2_3` kernels showed a huge discrepancy in the total number
+of retired instructions (~142 billion vs. ~80 million) and execution time. This was puzzling, as such a huge
+difference is unexpected unless the new implementation added many extra instructions or introduced a bug.
+The `cdp5StCmod.mod` file, with its complex, large `KINETIC` block and `sparse` solver, made it challenging
+to determine if the solver implementation had caused a regression in the #2027 PR.
+
+Uncertain if VTune's sampling-based mechanism had accurately captured the executed instructions, we decided to
+cross-check the measurements with LIKWID. Using LIKWID, we observed the following measurements. Note that this
+measurement was done on a different machine (due to permissions issue) with different simulation duration and
+hence the absolute results of LIKWID can not be compared with Vtune.
+
+Using the master branch we get:
+
+```console
+Region state-cdp5StCmod, Group 1: FLOPS_DP
++-------------------+------------+
+|    Region Info    | HWThread 0 |
++-------------------+------------+
+| RDTSC Runtime [s] |   0.369202 |
+|     call count    |        400 |
++-------------------+------------+
+
++------------------------------------------+---------+------------+
+|                   Event                  | Counter | HWThread 0 |
++------------------------------------------+---------+------------+
+|             INSTR_RETIRED_ANY            |  FIXC0  | 2875244000 |
+|           CPU_CLK_UNHALTED_CORE          |  FIXC1  | 1102965000 |
+|           CPU_CLK_UNHALTED_REF           |  FIXC2  |  845614900 |
+| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |     380402 |
+|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |  358722700 |
+| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |          0 |
+| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |      -     |
++------------------------------------------+---------+------------+
+```
+
+whereas with the `presoa`:
+
+```console
+Region state-cdp5StCmod, Group 1: FLOPS_DP
++-------------------+------------+
+|    Region Info    | HWThread 0 |
++-------------------+------------+
+| RDTSC Runtime [s] |   0.000044 |
+|     call count    |        400 |
++-------------------+------------+
+
++------------------------------------------+---------+------------+
+|                   Event                  | Counter | HWThread 0 |
++------------------------------------------+---------+------------+
+|             INSTR_RETIRED_ANY            |  FIXC0  |    1560877 |
+|           CPU_CLK_UNHALTED_CORE          |  FIXC1  |    6131154 |
+|           CPU_CLK_UNHALTED_REF           |  FIXC2  |    4697060 |
+| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |          0 |
+|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |       4400 |
+| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |          0 |
+| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |      -     |
++------------------------------------------+---------+------------+
+```
+
+here, with LIKWID also we see a huge difference in execution. This prompted us to examine the implementation in the
+[presoa commit](https://github.com/neuronsimulator/nrn/blob/d2f3b8e79fa4bca98b25440e713018ba0e263629/src/nrnoc/fadvance.cpp#L777),
+where we realized the instrumentation was incorrect. Instead of measuring the state-update kernel launch, it is
+always measuring access to the function pointer, not the function execution itself!
+
+```cpp
+nrn::Instrumentor::phase_begin(mechname.c_str());
+Pvmi s = memb_func[tml->index].state;
+nrn::Instrumentor::phase_end(mechname.c_str());
+
+// !! begin should be here instead
+(*s)(_nt, tml->ml, tml->index);
+// !! end should be here instead
+```
+
+### Identifying Regression
+
+Fixing the instrumentation in the `presoa` version helped us to get similar performance measurements but that still didn't explain why
+that new SoA data structure change has caused the regression. To investigate further, we sed LIKWID to compare various performance metrics.
+The same analysis could be done in VTune if you are more familiar with it. However, if you prefer LIKWID, using it via CLI might be
+easier than using a more feature-heavy tool like Intel VTune.
+
+Here is an example comparing the master branch and `presoa` for the `state-cdp5StCmod` kernel:
+
+```console
+export NRN_PROFILE_REGIONS=state-cdp5StCmod
+likwid-perfctr -C 0 -m -g FLOPS_DP x86_64/special -python protocols/01_SS.py
+```
+
+And then comparing LIKWID's output side-by-side:
+
+![VTune Comparison](images/nrn_likwid_presoa_master_comparison.png)
+
+There are two important observations worth mentioning:
+
+1. The number of instructions increased from 2.27 billion in the `presoa` to 2.87 billion in the `master` branch (~26% more instructions).
+2. At the same time, the number of floating instructions changed from 358 million in the `presoa` to 361 million in the `master` branch (i.e. ~1% extra floating point instructions).
+
+This indicates that while the amount of "useful work" (floating point instructions) is almost the same, the newer data structure implementation adds extra instructions.
+Next, we need to examine the generated code/assembly code to determine which aspect of C++ implementation causing such increases if any.
+
+> NOTE: Some VTune analysis hints at additional instructions due to the introduction of `mechanism_range.hpp`, but this needs further investigation.
diff --git a/docs/install/images/nrn_likwid_presoa_master_comparison.png b/docs/install/images/nrn_likwid_presoa_master_comparison.png
diff --git a/docs/install/images/nrn_likwid_vtune_instruction_comparison.png b/docs/install/images/nrn_likwid_vtune_instruction_comparison.png