Skip to content

Commit

Permalink
Docs: Using LIKWID with NEURON for performance analysis
Browse files Browse the repository at this point in the history
- Added insructions about how to build NEURON with LIKWID
  and how to use it with low overhead (see #2983)
- Added some info about performance regression in master
  • Loading branch information
pramodk committed Jul 14, 2024
1 parent 3a2336e commit e8b8ac9
Show file tree
Hide file tree
Showing 3 changed files with 323 additions and 1 deletion.
324 changes: 323 additions & 1 deletion docs/install/debug.md
Original file line number Diff line number Diff line change
Expand Up @@ -579,4 +579,326 @@ It's important to note that higher latencies do not necessarily indicate false s
instance, indirect-memory accesses with a strided pattern could lead to increased latencies.
One should examine how the memory object is used within a function to determine if false sharing
is a potential issue. Additionally, comparing access latencies with scenarios like single-threaded
execution or versions without such issues can provide further insights.
execution or versions without such issues can provide further insights.

#### Using LIKWID With NEURON

As described in the previous section, Intel VTune is a powerful tool for node-level performance analysis.
However, we might need alternative options like LIKWID profiling tools in scenarios such as:
1) VTune is not available due to non-Intel CPUs or lack of necessary permissions.
2) We prefer to precisely mark the code regions of interest rather than using a sampling-based approach.
3) The raw events/performance counters shown by VTune are overwhelming, and we want high-level metrics that typically used in HPC/Scientific Computing.
4) Or simply we want to cross-check VTune's results with another tool like LIKWID to ensure there are no false positives.

LIKWID offers a comprehensive set of tools for performance measurement on HPC platforms. It supports Intel, AMD, and ARM CPUs.
However, as it is not vendor-specific, it may lack support for specific CPUs (e.g., Intel Alder Lake). Despite this, LIKWID is still a valuable tool.
Let’s quickly see how we can use LIKWID with NEURON.

We assume LIKWID is installed with the necessary permissions (via `accessDaemon` or Linux `perf` mode).
Usage of LIKWID is covered in multiple other tutorials like [this](https://github.com/RRZE-HPC/likwid/wiki/TutorialStart) and [this](https://hpc.fau.de/research/tools/likwid/).
Here, we focus on its usage with NEURON.

LIKWID can measure performance counters on any binary like NEURON. For example, in the execution below,
we measure metrics from `FLOPS_DP` LIKWID group, i.e., double precision floating point related metrics:

```console
$ likwid-perfctr -C 0 -g FLOPS_DP ./x86_64/special -python test.py
--------------------------------------------------------------------------------
CPU name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
...
--------------------------------------------------------------------------------
Group 1: FLOPS_DP
+------------------------------------------+---------+------------+
| Event | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 8229299612 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 3491693048 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 2690601456 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 1663849 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 740794589 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | - |
+------------------------------------------+---------+------------+

+----------------------+------------+
| Metric | HWThread 0 |
+----------------------+------------+
| Runtime (RDTSC) [s] | 1.4446 |
| Runtime unhalted [s] | 1.5181 |
| Clock [MHz] | 2984.7862 |
| CPI | 0.4243 |
| DP [MFLOP/s] | 515.0941 |
| AVX DP [MFLOP/s] | 0 |
| AVX512 DP [MFLOP/s] | 0 |
| Packed [MUOPS/s] | 1.1517 |
| Scalar [MUOPS/s] | 512.7906 |
| Vectorization ratio | 0.2241 |
+----------------------+------------+
```

In this execution, we see information like the total number of instructions executed and contributions from SSE, AVX2, and AVX-512 instructions.

But, in the context of NEURON model execution, this is not sufficient because the above measurements summarize the full execution,
including model building and simulation phases. For detailed performance analysis, we need granular information. For example:
1) We want to compare hardware counters for phases like current update (`BREAKPOINT`) vs. state update (`DERIVATIVE`) due to their different properties (memory-bound vs. compute-bound).
2) We might want to analyze the performance of a specific MOD file and it's kernels.

This is where [LIKWID's Marker API](https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerC) comes into play.
Currently, Caliper doesn't integrate LIKWID as a service, but NEURON's profiler interface supports enabling LIKWID markers via the same
[Instrumentor API](https://github.com/neuronsimulator/nrn/blob/master/src/coreneuron/utils/profile/profiler_interface.h).

###### Building NEURON With LIKWID Support

Like Caliper, building LIKWID support in NEURON is easy:

```console
cmake .. \
-DNRN_ENABLE_PROFILING=ON \
-DNRN_PROFILER=likwid \
-DCMAKE_PREFIX_PATH=<likwid-install-prefix>/share/likwid \
-DCMAKE_INSTALL_PREFIX=`pwd`/install \
-DCMAKE_BUILD_TYPE=RelWithDebInfo
make -j && make install
```

Once built this way, LIKWID markers are added for the various simulation phases similar to those shown in Caliper and Vtune section.

###### Measurement with LIKWID

Measuring different metrics with LIKWID is easy, as seen earlier. By building LIKWID support via CMake, we now have enabled LIKWID markers
that help us see the performance counters for different phases of simulations. In the example below, we added the `-m` CLI option to enable markers:

```console
$ likwid-perfctr -C 0 -m -g FLOPS_DP ./x86_64/special -python test.py
...
...
Region psolve, Group 1: FLOPS_DP
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 10.688310 |
| call count | 1 |
+-------------------+------------+

+------------------------------------------+---------+------------+
| Event | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 5496569000 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 2753500000 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 2111229000 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 489202 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 730000000 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | - |
+------------------------------------------+---------+------------+
...
...
Region state-cdp5StCmod, Group 1: FLOPS_DP
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 0.353965 |
| call count | 400 |
+-------------------+------------+

+------------------------------------------+---------+------------+
| Event | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 2875111000 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 1057608000 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 810826000 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 380402 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 358722700 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | - |
+------------------------------------------+---------+------------+
...
Region state-Cav2_3, Group 1: FLOPS_DP
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 0.002266 |
| call count | 400 |
+-------------------+------------+
...
...
```

Here, we see performance counters for the `psolve` region, which includes the full simulation loop,
and a breakdown into channel-specific kernels like `state-cdp5StCmod` and `state-Cav2_3`.

###### Avoiding Measurement Overhead with `NRN_PROFILE_REGIONS`

For profiling like Caliper, merely measuring the execution time has negligible overhead.
However, with LIKWID, starting and stopping measurement using high-level API like `LIKWID_MARKER_START()` and
`LIKWID_MARKER_STOP()` can introduce significant overhead, especially when instrumentation covers small code regions.
This is the case in NEURON, where we instrument all simulation phases and individual mechanisms' state and current update kernels.
In small examples, such overhead could slow down execution by 10x.

To avoid this, NEURON provides an environment variable `NRN_PROFILE_REGIONS` to enable profiling for specific code regions.
For example, let's assume we want to understand hardware performance counters for two phases:
- `psolve`: entire simulation phase
- `state-hh`: one specific state update phase where we call `DERIVATIVE` block of the `hh.mod` file

These names are the same as those instrumented in the code using `Instrumentor::phase` API (see previous step).
We can specify these regions to profile as a *comma-separated list* via the `NRN_PROFILE_REGIONS` environment variable:

```console
$ export NRN_PROFILE_REGIONS=psolve,state-hh
$ likwid-perfctr -C 0 -m -g FLOPS_DP ./x86_64/special -python test.py
```

Now, we should get the performance counter report only for these two regions with relatively small execution overhead:

```console
Region psolve, Group 1: FLOPS_DP
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 10.688310 |
| call count | 1 |
+-------------------+------------+

+------------------------------------------+---------+------------+
| Event | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 5496569000 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 2753500000 |
...
...
Region state-hh, Group 1: FLOPS_DP
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 0.180081 |
| call count | 400 |
+-------------------+------------+

+------------------------------------------+---------+------------+
| Event | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 1341553000 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 540962900 |
...
```

> NOTE: Currently, NEURON uses marker APIs `LIKWID_MARKER_START()` and `LIKWID_MARKER_STOP()` from LIKWID.
> We should consider using `LIKWID_MARKER_REGISTER()` API to reduce overhead and prevent incorrect report counts for tiny code regions.
### An Example Of Cross-Checking VTune Results with LIKWID

This section isn't about comparing Intel VTune with LIKWID, but rather about how one can use LIKWID to confirm
performance results initially observed with Intel VTune.

While investigating a performance regression in the upcoming NEURON v9.0 (see [#2787](https://github.com/neuronsimulator/nrn/issues/2787)),
we compared the performance of the master branch with [an earlier commit](https://github.com/neuronsimulator/nrn/commit/b4ade55096d81c31d7e5db099cce949d56162490)
(referred to as `presoa`). This commit precedes the merging of the [new SoA data structure PR](https://github.com/neuronsimulator/nrn/pull/2027).
When applying a fix from [#2966](https://github.com/neuronsimulator/nrn/pull/2966) to the `presoa` version, we observed a 10-20% performance
regression in the master branch compared to the `presoa` version.

To investigate this, we generated profiles for both versions and compared them using VTune:

![VTune Comparison](images/nrn_likwid_vtune_instruction_comparison.png)

We noticed that the `psolve` region, which represents the entire simulation phase, was about 2.3 seconds slower.
However, the `state-cdp5StCmod` and `state-Cav2_3` kernels showed a huge discrepancy in the total number
of retired instructions (~142 billion vs. ~80 million) and execution time. This was puzzling, as such a huge
difference is unexpected unless the new implementation added many extra instructions or introduced a bug.
The `cdp5StCmod.mod` file, with its complex, large `KINETIC` block and `sparse` solver, made it challenging
to determine if the solver implementation had caused a regression in the #2027 PR.

Uncertain if VTune's sampling-based mechanism had accurately captured the executed instructions, we decided to
cross-check the measurements with LIKWID. Using LIKWID, we observed the following measurements. Note that this
measurement was done on a different machine (due to permissions issue) with different simulation duration and
hence the absolute results of LIKWID can not be compared with Vtune.

Using the master branch we get:

```console
Region state-cdp5StCmod, Group 1: FLOPS_DP
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 0.369202 |
| call count | 400 |
+-------------------+------------+

+------------------------------------------+---------+------------+
| Event | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 2875244000 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 1102965000 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 845614900 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 380402 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 358722700 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | - |
+------------------------------------------+---------+------------+
```

whereas with the `presoa`:

```console
Region state-cdp5StCmod, Group 1: FLOPS_DP
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 0.000044 |
| call count | 400 |
+-------------------+------------+

+------------------------------------------+---------+------------+
| Event | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 1560877 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 6131154 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 4697060 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 0 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 4400 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | - |
+------------------------------------------+---------+------------+
```

here, with LIKWID also we see a huge difference in execution. This prompted us to examine the implementation in the
[presoa commit](https://github.com/neuronsimulator/nrn/blob/d2f3b8e79fa4bca98b25440e713018ba0e263629/src/nrnoc/fadvance.cpp#L777),
where we realized the instrumentation was incorrect. Instead of measuring the state-update kernel launch, it is
always measuring access to the function pointer, not the function execution itself!

```cpp
nrn::Instrumentor::phase_begin(mechname.c_str());
Pvmi s = memb_func[tml->index].state;
nrn::Instrumentor::phase_end(mechname.c_str());

// !! begin should be here instead
(*s)(_nt, tml->ml, tml->index);
// !! end should be here instead
```
### Identifying Regression
Fixing the instrumentation in the `presoa` version helped us to get similar performance measurements but that still didn't explain why
that new SoA data structure change has caused the regression. To investigate further, we sed LIKWID to compare various performance metrics.
The same analysis could be done in VTune if you are more familiar with it. However, if you prefer LIKWID, using it via CLI might be
easier than using a more feature-heavy tool like Intel VTune.
Here is an example comparing the master branch and `presoa` for the `state-cdp5StCmod` kernel:
```console
export NRN_PROFILE_REGIONS=state-cdp5StCmod
likwid-perfctr -C 0 -m -g FLOPS_DP x86_64/special -python protocols/01_SS.py
```

And then comparing LIKWID's output side-by-side:

![VTune Comparison](images/nrn_likwid_presoa_master_comparison.png)

There are two important observations worth mentioning:

1. The number of instructions increased from 2.27 billion in the `presoa` to 2.87 billion in the `master` branch (~26% more instructions).
2. At the same time, the number of floating instructions changed from 358 million in the `presoa` to 361 million in the `master` branch (i.e. ~1% extra floating point instructions).

This indicates that while the amount of "useful work" (floating point instructions) is almost the same, the newer data structure implementation adds extra instructions.
Next, we need to examine the generated code/assembly code to determine which aspect of C++ implementation causing such increases if any.

> NOTE: Some VTune analysis hints at additional instructions due to the introduction of `mechanism_range.hpp`, but this needs further investigation.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e8b8ac9

Please sign in to comment.