Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commented out false sharing debug counters #436

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ryandmaclean
Copy link

@ryandmaclean ryandmaclean commented Oct 3, 2017

These changes fix the false sharing of a debug counter in the triangle intersection test of the ray tracer (which is heavily used, particularly for ambient lighting, in vrad.exe). Note that the debug counter doesn't work correctly regardless as the memory operations are non-atomic, thus it's completely useless.

(Note: For testing purposes, max thread count was increased to 32 on both branches, but not included in this pull request. I'd request that would also be done. MAX_THREADS and MAX_TOOLS_THREADS need to be increased from '16' to '32'.)

As threads cross CCX boundaries, cache coherency traffic from false sharing becomes a major bottleneck and overall execution times slow down. Similar effects are seen on dual socket Xeon configurations. On lower core count processors without a split cache hierarchy, only diminishing returns are seen with increased thread counts (thus poor but not negative scaling). This largely solves the scaling problem and even makes multi-socket configurations viable for vrad.

Below: Testing on a Ryzen Threadripper (16 cores, 32 threads) with map "nmo_cleopas" in No More Room in Hell. Build options: -threads 32 -both -final -StaticPropPolys

vrad_thread_scaling

Note: In all of the above tests, affinity masks are set in Process Lasso to simulate the behavior of different CPU core topologies. '8 threads' test is masked to 1 CCX (simulating a typical quad core w/ SMT), '16 threads' test is masked to 2 Ryzen CCX (simulating a Ryzen 7 1800x), 32 threads map to all 4 CCX's (representing a 16 core / 32 thread Threadripper 1950x)

  • One user reports a performance improvement of ~60% on a 6 core / 6 thread AMD FX-6300.

  • Another user testing this fix reports a performance improvement of ~38% on an i7-7700k on their own test map (outdoor maps generally see the most benefit due to large number of ambient light sampling).

Thus, even on lower core count configurations or CPU configurations with poor cache performance (like AMD FX), performance improvement can be significant, especially when using -final in vrad compile options. Ryzen CPUs benefit the most due to their unusual cache topology.

EDIT: I realized that my title is a bit of a misnomer, the threads aren't really 'false sharing', it's just 'sharing' (with a race condition), but the effect is identical and unintended (i'd imagine), so very much in the same vein.

EDIT 2: Perf improvement on fx-6300 was initially over-stated (I misread the results somehow). I expect the speedup is higher on an FX-8350, however. Note that if you really crank up -extrasky you can approach near-linear scaling as BuildFaceLights takes up a larger fraction of the total execution time, and since BuildFaceLights seems to scale almost linearly with core count now. (assuming sufficiently large scene such that the total work item count is high enough to keep all cores saturated)

@SharpOB
Copy link

SharpOB commented Oct 3, 2017

Nicely done. Given Valve's history with the repo (year since last closed PR), I doubt it'll be added, but nothing is stopping anyone from using the code regardless.

SCell555 added a commit to SCell555/DownFall that referenced this pull request Oct 4, 2017
@TravisWehrman
Copy link

Excellent find! I'm curious how you discovered this, was this revealed indirectly by a profiler?

@ryandmaclean
Copy link
Author

TravisWehrman commented 23 minutes ago
Excellent find! I'm curious how you discovered this, was this revealed indirectly by a profiler?

Yes, though I did a lot of testing to isolate it. :)

TheEMP added a commit to TheEMP/source-sdk-2013 that referenced this pull request Oct 12, 2017
As seen in this merge request
ValveSoftware#436 can improve
the vrad compile time by a massive margin.
@prototype5885
Copy link

could you upload the compiled vrad_dll.dll for source 2013?

@ryandmaclean
Copy link
Author

could you upload the compiled vrad_dll.dll for source 2013?

Sure! This one is compiled for sdk 2013 mp (I'm not sure if there's a difference between the mp and sp vrads? Haven't really checked). I've also made patched versions for CS:Go and CS:S.

Note: Due to an issue with the sdk 2013 vrad launcher failing to load the new dll (something to do with valve's dll loader?), I've also included a launcher that points to the 'vrad_dll-optimized.dll'. You'll need to point to the new vrad_optimized.exe in your vrad build tools path in the Hammer config to use it.

www.content.tophattwaffle.com/Content/modded_vrad_dll.zip

(Download hosted by TopHattWaffle, he has a lot of good content for level design tutorials etc. as well)

@ghost
Copy link

ghost commented Dec 12, 2017

Good job!

@ghost
Copy link

ghost commented Apr 12, 2018

After a recent CSGO update(2018/03/29),the vrad needs to update.

@ryandmaclean
Copy link
Author

Hey colorsxy, Valve included my fix in a patch about a month after I made the initial pull request, so any manually patched vrad dll's I may have released within that period are no longer relevant to cs: go map makers.

TotallyMehis added a commit to zm-reborn/zmr-game that referenced this pull request May 13, 2018
@ac1996
Copy link

ac1996 commented Aug 16, 2018

Hi, thank you so much for your work. Can you please compile fixed VRAD for SDK 2013 Single Player.
When compile map with VRAD_OPTIMIZED in SP - VRAD log say: "Cannot load the static props... encountered a stale map version. Re-vbsp the map."

If set VBSP, VVIS from MP and VRAD_OPTIMIZED -> COMPILE SUCCESSFUL, but hl2.exe application closes, but does not show an error.

I think it would be useful not only for me.
And once again, thank you for the work done - you made our life easier.

"After three hourse"

vrad,vbsp-threat-fix.zip

I change raytrace.cpp, released compile - vrad.exe,vrad_dll.dll, vbsp.exe and now it's work fine for SDK 2013 Single Player (SP).

P.S. Thank you. (AMD FX 8320, Compiling speed increased two or even three times)

@ellemti
Copy link

ellemti commented Dec 20, 2018

is there is just cs-go version?
if not is possible to make gmod/garry's mod version

@Kefta
Copy link

Kefta commented Dec 20, 2018

@ellemti This change was already merged into GMod.

@ellemti
Copy link

ellemti commented Dec 20, 2018

@Kefta i fell like compile are still longer lol

@cblpop
Copy link

cblpop commented Jul 14, 2019

My friend, maybe you have information, is this update hit SFM's vrad ???

untodesu pushed a commit to untodesu/refraction that referenced this pull request May 18, 2020
TotallyMehis added a commit to zm-reborn/zmr-game that referenced this pull request Oct 2, 2020
untodesu pushed a commit to untodesu/refraction that referenced this pull request Dec 22, 2020
wolfestridershooter added a commit to wolfestridershooter/source-sdk-2013-ce that referenced this pull request Nov 3, 2023
tschumann added a commit to tschumann/basis-source that referenced this pull request Jan 7, 2024
@jmcpit
Copy link

jmcpit commented Apr 16, 2024

Is this optimized vrad file also available for HL2DM?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants