-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multigroup Per-Thread Cache Conversion #2591
Conversation
…y one MG XS cache (rather than one per material), and refactoring functions needed to support calculate_xs in MG mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix @jtramm and the very thorough explanation of the whole situation! Other than the very minor line comments below, one thing that I would like to see before we merge is the addition of a test to cover this case so we don't mess it up in the future.
Co-authored-by: Paul Romano <[email protected]>
I've added a test in, and can confirm that the test fails with develop in event-based mode (e.g., The CI is giving failures that seem to be unrelated to anything I've added:
Any ideas on how to resolve the CI issue? |
Argh, sometimes I hate CI 😞 I'll look into that and see if I can get to the bottom of it |
@jtramm I think you'll need to push a new commit to this branch to get CI passing (can even be an empty commit) |
I added in a comment-only commit -- looks like CI is passing now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jtramm!
Co-authored-by: Paul Romano <[email protected]>
Overview
This PR converts per-thread multigroup cross section data caches into particle-owned caches, and is a partial remedy for #2160 and is a full fix for #2590. This PR closes #2590, but #2160 will still be left open as this PR does not solve all the performance issues.
As discussed in Issue #2160, the use of the per-thread cache involved access patterns like:
openmc/src/mgxs.cpp
Lines 428 to 439 in 33ead74
This existing strategy was problematic for two reasons:
It is canonical multithreaded "false sharing" due to multiple cache objects fitting on the same cache line, but being read/written to by different threads (Poor thread scaling in multi-group mode #2160).
It generates incorrect results in event-based mode (Multigroup MC Gives Wrong Answer in Event-Based Mode #2590), as per-thread particle states like this are meaningless.
To fix these issues, this PR simply moves the cache into the
ParticleData
class. I also experimented with completely removing the idea of caching MGXS data altogether, but found it was more optimal in terms of performance to retain the cache in the particle.Other Necessary Changes
Number of Caches
The original scheme involved per-thread caches both for all materials and for all nuclides (if present). For simplicity, to reduce memory footprint, and because the determination of temperature and angle parameters is not that expensive, the new particle-owned scheme only maintains a single cache, which now corresponds to only the most recent material index. I.e., in the previous implementation, a particle might traverse from:
Material A -> Material B -> Material A
without changing energy and be able to reuse a cache entry when re-entering Material A, as each thread had a separate cache for all materials. However, with the new scheme in this PR, the cache will be overwritten when the particle traverses from A -> B, such that determination of the temperature and angle indices will need to be redone when re-entering Material A. As we will see in the performance analysis section, the new scheme is overall faster, so it seems like it may be a net benefit to have more lookups if this saves memory space.
Interfaces
Previously, the
MGXS::get_xs()
function that utilized caches had the following interface:but this PR adds two integer arguments for the temperature index and angle index:
The means the caller needs to read/update the particle's cache themselves and provide this to the
MGXS::get_xs()
function. Thankfully, this is only done in two places:Mgxs::calculate_xs(Particle& p)
functionscore_general_mg()
functionWhile refactoring of the
calculate_xs()
function was straightforward, the refactoring of the tally scoring routines ended up being more painful, as each score typically makes a few calls toget_xs()
.Correctness
I re-used the multi-group multi-temperature pincell from #2590 and ran the same tests on the new branch. As shown in the table below, the parallel event-based mode now gives the correct answer (i.e., the same as if running with a serial thread in history-based mode). Therefore -- this PR closes #2590.
Performance Analysis
I tested performance of the PR against develop on the C5G7 benchmark. I also tested a version of the code that does not have any MGXS cache at all (i.e., the angle and temperature indices are looked up on-the-fly each time they are needed). Given that the C5G7 MGXS data is isotropic and isothermal, this was expected to be fast, but it appeared to have a performance cost to it.
I ran each configuration 10 times to get a sense of the variance in performance timings. I tested using two test systems: a dual-socket Xeon 8180M with 56 cores/112 threads total, and a dual-socket Xeon 6336Y node with 48 cores and 96 threads total. The 8180M node has been the only node so far that has shown the poor thread scaling performance, whereas the 6336Y node behaves well (as most architectures I've tested do on this problem).
2D C5G7 - 5 inactive - 5 active - 100k particles/batch - gcc 12.2.0 - 2x 8180M (56c/112t)
2D C5G7 - 5 inactive - 5 active - 100k particles/batch - gcc 12.2.0 - 2x 6336Y (48c/96t)
Basically, conversion of the thread-owned XS cache to a particle-owned one does not really improve the performance picture. Moving the cache to the particle has a minor speedup on the troublesome 8180M system, but also shows a minor slowdown on the 6336Y node. Removing the cache completely was also not a great idea, as this performed much slower than the particle-owned cache on both systems. Thus, #2160 will need to remain open as this PR did not really help with whatever the underlying cause of poor thread scaling on the 8180M is.
Other Potential Areas of Memory Contention to Investigate on the 8180M node
There are a few other areas that may have been causing memory contention:
Particle
initialization (due to calls toNew
ormalloc
in its associated constructors)To test out if these were the root causes, I implemented algorithms to remove all contention from these areas, but did not observe any additional speedups. Analysis with
perf
andvtune
was somewhat confounding, as they may not have been running correctly on this system due to cluster permissions issues so were identifying seemingly odd locations as being impractically expensive. Manual addition of history-based mode OpenMP timers (and examining performance timers in event-based mode) revealed that basically all the major areas of the code were running slow when a lot of threads were running. I.e., the slowdown did not appear to be limited to a particular location, rather, the entire program seemed to be running slowly.I also experimented with using
numactl
to manually manage where the allocations were happening and the thread affinity policies, and while these had marginal benefit, but did not explain the overall poor thread scaling.Overall, my guess is that there may be a system configuration issue with the 8180M node that is affecting us for some reason, or it may simply be due to a quirk of the cache architecture of skylake that has been remedied in newer architectures since then (as 2017's 8180M is now 6 years old). As discussed in #2160, other dual-socket CPUs I've tested on have had good thread scaling, even before this PR, so right now it doesn't seem worth worrying about further. However, if we see similar thread scaling issues on newer chips, I will plan to commit more time resources to solving this issue. For now, I will consider this a partial remedy for #2160, so will leave that issue open.
Checklist