Make async runtime scale better on SMT machines #850

Minoru · 2021-05-30T19:08:53Z

#844 added a new async runtime, but on SMT machines (eg. Intel's HyperThreading), it doesn't scale too well past the number of cores. Details are in #844 (comment), and there are some ideas further down the thread.

For now, the workaround is to use +RTS -Nx to limit the number of threads to the number of cores.

The text was updated successfully, but these errors were encountered:

fishtreesugar · 2021-05-31T13:09:21Z

According to the benchmark from scheduler, unliftio's pooledMapConcurrently's quite better than async's mapConcurrently, maybe we could give it a try

Minoru · 2021-05-31T20:49:54Z

Thanks for the pointer! That benchmark focuses on speed, not scalability, so I'm sceptical that it'll make any different. I don't have energy to invest into this right now, but if you do, please try and report back the results!

vaibhavsagar · 2021-07-18T09:41:34Z

I investigated using pooledMapConcurrently with this patch on top of #863 and it didn't seem to improve the SMT scaling (my laptop uses an Intel i7-8550U with 4 cores and 8 threads):

[nix-shell:~/code/website]$ hyperfine --parameter-scan threads 1 10 --prepare './result/bin/site clean' './result/bin/site build +RTS -N{threads}' 
Benchmark #1: ./result/bin/site build +RTS -N1
  Time (mean ± σ):      3.409 s ±  0.074 s    [User: 3.279 s, System: 0.138 s]
  Range (min … max):    3.295 s …  3.525 s    10 runs
 
Benchmark #2: ./result/bin/site build +RTS -N2
  Time (mean ± σ):      2.196 s ±  0.053 s    [User: 3.799 s, System: 0.362 s]
  Range (min … max):    2.113 s …  2.265 s    10 runs
 
Benchmark #3: ./result/bin/site build +RTS -N3
  Time (mean ± σ):      1.886 s ±  0.060 s    [User: 4.571 s, System: 0.583 s]
  Range (min … max):    1.790 s …  1.963 s    10 runs
 
Benchmark #4: ./result/bin/site build +RTS -N4
  Time (mean ± σ):      1.885 s ±  0.049 s    [User: 5.487 s, System: 0.897 s]
  Range (min … max):    1.833 s …  1.976 s    10 runs
 
Benchmark #5: ./result/bin/site build +RTS -N5
  Time (mean ± σ):      2.164 s ±  0.098 s    [User: 7.585 s, System: 1.639 s]
  Range (min … max):    2.014 s …  2.294 s    10 runs
 
Benchmark #6: ./result/bin/site build +RTS -N6
  Time (mean ± σ):      2.348 s ±  0.096 s    [User: 9.346 s, System: 2.417 s]
  Range (min … max):    2.174 s …  2.506 s    10 runs
 
Benchmark #7: ./result/bin/site build +RTS -N7
  Time (mean ± σ):      2.487 s ±  0.047 s    [User: 11.058 s, System: 3.255 s]
  Range (min … max):    2.414 s …  2.536 s    10 runs
 
Benchmark #8: ./result/bin/site build +RTS -N8
  Time (mean ± σ):      2.746 s ±  0.132 s    [User: 13.610 s, System: 4.138 s]
  Range (min … max):    2.565 s …  3.064 s    10 runs
 
Benchmark #9: ./result/bin/site build +RTS -N9
  Time (mean ± σ):      3.251 s ±  0.138 s    [User: 16.528 s, System: 4.896 s]
  Range (min … max):    3.097 s …  3.506 s    10 runs
 
Benchmark #10: ./result/bin/site build +RTS -N10
  Time (mean ± σ):      3.668 s ±  0.240 s    [User: 19.285 s, System: 5.075 s]
  Range (min … max):    3.385 s …  4.166 s    10 runs
 
Summary
  './result/bin/site build +RTS -N4' ran
    1.00 ± 0.04 times faster than './result/bin/site build +RTS -N3'
    1.15 ± 0.06 times faster than './result/bin/site build +RTS -N5'
    1.17 ± 0.04 times faster than './result/bin/site build +RTS -N2'
    1.25 ± 0.06 times faster than './result/bin/site build +RTS -N6'
    1.32 ± 0.04 times faster than './result/bin/site build +RTS -N7'
    1.46 ± 0.08 times faster than './result/bin/site build +RTS -N8'
    1.73 ± 0.09 times faster than './result/bin/site build +RTS -N9'
    1.81 ± 0.06 times faster than './result/bin/site build +RTS -N1'
    1.95 ± 0.14 times faster than './result/bin/site build +RTS -N10'

frasertweedale · 2021-09-07T01:59:34Z

I am also experiencing this scaling issue. The increased userland CPU time when using more
capabilities is strange. I would expect to see small (certainly sublinear) increases in CPU time
for additional capabilities. Instead, per @vaibhavsagar's benchmark above, the CPU time
appears to have superlinear growth and quickly overwhelms the advantage gained by parallel
execution.

Profiling didn't reveal anything interesting, the profiles looking overwhelmingly similar for
different numbers of capabilities, apart from total time.

I'm beginning to wonder if this issue might be in the GHC RTS, rather than Hakyll...

vaibhavsagar · 2021-09-07T05:42:12Z

@frasertweedale I did some investigation with ThreadScope afterwards that wasn't especially insightful, which is why I didn't mention it here, but it did show that some of the overhead was GC-related. When I minimised garbage collection using some of the suggestions here the observed performance did seem to scale better. I'm a relative novice when it comes to parallel Haskell so it's entirely possible that there's something simple that I'm missing.

frasertweedale · 2021-09-07T06:07:29Z

@vaibhavsagar thanks for the additional info. It is always helpful to mention the dead ends in the investigation. That way, people will know it has been done, and won't waste their time doing the same thing :)

frasertweedale · 2021-09-11T06:33:10Z

When using multiple capabilities, on GHC 8.8 I get the best results with +RTS -N -qg1, which disables parallel GC for the first generation. On my site this achieves ~70% productivity compared to using compared to ~40% for the default (parallel GC for all generations).

There must be something about Hakyll's design that makes parallel GC particularly inefficient. When actually using multiple capabilities there was an improvement in wall time GCing the second generation, although productivity still decreases considerably. For the first generation, the parallel GC performance is quite terrible.

I'd be interested to see how GHC 8.10+'s --nonmoving-gc RTS option performs, but it cannot be used for the first generation.

I'm suspending my investigation at this point. Single-threaded performance is good enough for me and even with -qg1 I gain little advantage from using multiple capabilities. I've only done these measurements on my Hakyll blog site. YMMV.

gwern · 2022-05-03T23:05:06Z

FWIW, I ran into severe performance problems apparently related to these changes when I recently upgraded Hakyll after a while. My writeup: https://groups.google.com/g/hakyll/c/5_evK9wCb7M/m/3oQYlX9PAAAJ

jaspervdj · 2022-05-20T07:48:36Z

I would like to look into this during ZuriHac 2022, I'm not sure if I'll have time before that. My current suspicion is that the combination of an MVar (I'm pretty sure an IORef + strict atomicModifyIORef would be enough) together with the use of big maps in this value are contributing towards this, but I haven't tested anything out yet.

Minoru · 2022-05-20T19:43:51Z

@jaspervdj #903 is much more pressing, if you're in the mood to dig into hard issues :) Sadly I didn't have enough energy to do that, even though I promised. It looks like the Store is not suitable for multithreaded use (which is not entirely surprising), and causes some random problems to us. I've been dragging my feet, but it looks like the async runtime has to be yanked altogether while we're looking for a fix to the Store and investigate Gwern's report.

jaspervdj · 2022-05-20T21:13:23Z

Yeah, I wonder if we should just roll back the concurrent runtime for now given these issues. Is the slight speedup for some sites worth the overhead for others? I’m not sure.

Doing a concurrent runtime still seems doable and worthwhile and I think we can get it with minimal overhead but it just requires a bit more investigation to update or remove some existing abstractions like Store.

jaspervdj · 2022-08-04T12:19:37Z

I have an implementation in https://github.com/jaspervdj/hakyll/tree/async-scheduler which is a bit rough but should generally work and allow us to scale much better. A few things like error handling and checking for cyclic deps still need to be improved though.

vaibhavsagar · 2023-08-26T10:17:39Z

Does #946 resolve this issue?

Minoru · 2023-08-26T14:51:37Z

@vaibhavsagar Not really, see the benchmark results here: #946 (review)

gwern · 2024-05-08T01:54:31Z

FWIW, I ran into severe performance problems

Update: I've been using a fork all this time, as mentioned, and so haven't seen any effect of the new scheduler. My Threadripper workstation has died, so I can no longer test high core counts. I've been restarting on a Ubuntu 24 laptop with just 8 virtual cores (4 real, IIRC), and running with 5-7 threads has not shown any major issues with the 4.14.0.0 HEAD (GHC 9.4.7).

Minoru added the bug label May 30, 2021

Minoru mentioned this issue May 30, 2021

Async runtime with graph-based dependency cycle checks #844

Merged

Minoru mentioned this issue Jul 17, 2021

Hakyll.Core.Runtime: use MVar instead of TVar #863

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make async runtime scale better on SMT machines #850

Make async runtime scale better on SMT machines #850

Minoru commented May 30, 2021

fishtreesugar commented May 31, 2021

Minoru commented May 31, 2021

vaibhavsagar commented Jul 18, 2021

frasertweedale commented Sep 7, 2021

vaibhavsagar commented Sep 7, 2021

frasertweedale commented Sep 7, 2021

frasertweedale commented Sep 11, 2021

gwern commented May 3, 2022

jaspervdj commented May 20, 2022

Minoru commented May 20, 2022

jaspervdj commented May 20, 2022

jaspervdj commented Aug 4, 2022

vaibhavsagar commented Aug 26, 2023

Minoru commented Aug 26, 2023

gwern commented May 8, 2024

Make async runtime scale better on SMT machines #850

Make async runtime scale better on SMT machines #850

Comments

Minoru commented May 30, 2021

fishtreesugar commented May 31, 2021

Minoru commented May 31, 2021

vaibhavsagar commented Jul 18, 2021

frasertweedale commented Sep 7, 2021

vaibhavsagar commented Sep 7, 2021

frasertweedale commented Sep 7, 2021

frasertweedale commented Sep 11, 2021

gwern commented May 3, 2022

jaspervdj commented May 20, 2022

Minoru commented May 20, 2022

jaspervdj commented May 20, 2022

jaspervdj commented Aug 4, 2022

vaibhavsagar commented Aug 26, 2023

Minoru commented Aug 26, 2023

gwern commented May 8, 2024