store forwarding does not have fixed latency #81

amonakov · 2020-07-27T21:38:03Z

You note the effect for Skylake on the wiki ("Minimum store-forwarding latency is 3 on new(ish) chips, but the load has to arrive at exactly the right time to achieve this"), but I'm seeing a similar effect on earlier CPUs too; tested with Sandybridge and Ivybridge so far.

loop:
mov [rsp], edi
dec ecx
mov edi, [rsp]
jnz loop

runs at 6.3 cyc/iter

loop: 
mov edi, [rsp] 
dec ecx 
mov [rsp], edi 
jnz loop

runs at 5 cyc/iter

loop:
mov [rsp], edi
dec ecx
nop
nop
nop
nop
nop
nop
mov edi, [rsp]
jnz loop

run a 5 cyc/iter

I'm also seeing stores being apparently re-issued a lot? For the second variant:

        5003448550      cycles:u                                                    
        4000194964      instructions:u            #    0.80  insn per cycle         
         666535816       uops_dispatched_port.port_0:u                                   
         333645348       uops_dispatched_port.port_1:u                                   
         833721986       uops_dispatched_port.port_2:u                                   
        1166399269       uops_dispatched_port.port_3:u                                   
        4976661537       uops_dispatched_port.port_4:u                                   
        1000143676       uops_dispatched_port.port_5:u                                   
        8977107631      uops_dispatched.core:u

What are your thoughts on this? Have you seen instruction replay explained somewhere? Did you see variable forwarding latency mentioned anywhere?

The text was updated successfully, but these errors were encountered:

amonakov · 2020-07-27T21:52:29Z

Oh, and

loop: 
mov edi, [rsp] 
dec ecx 
mov [rsp], edi 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
jnz loop

is even faster: runs at 4.1 cyc/iter on Sandybridge!

amonakov · 2020-07-28T13:38:52Z

It's also possible to have two interleaved chains, each forwarded in exactly 4 cycles on Sandybridge, though I didn't manage to add a third one:

mov eax, [rsp] 
mov [rsp], eax 
nop 
nop 
 
nop 
nop 
nop 
nop 
 
nop 
nop 
nop 
nop 
 
dec ebp 
mov ebx, [rsp+4] 
mov [rsp+4], ebx 
jnz loop

amonakov · 2020-07-28T17:01:25Z

Apparently this got improved in Ivybridge, as not just three, but four interleaved 4-cycle chains are easily achieved:

loop:
mov eax, [rsp]
mov [rsp], eax
nop
nop

mov edx, [rsp+12]
mov [rsp+12], edx
nop
nop

mov ecx, [rsp+8]
mov [rsp+8], ecx
nop
nop

dec ebp
mov ebx, [rsp+4]
mov [rsp+4], ebx
jnz loop

amonakov · 2020-07-28T19:08:39Z

Ooh, and I can stick an ALU op in the dependency chain without increasing overall latency, making best-case forwarding latency 3 cycles on SNB and IVB:

loop:
mov eax, [rsp]
not eax
mov [rsp], eax
nop

nop
nop
nop
nop

mov ebx, [rsp+4]
not ebx
mov [rsp+4], ebx
nop

dec ebp
nop
nop
jnz loop

travisdowns · 2020-07-28T19:21:56Z

From Twitter:

Yeah, that's my point. If you look at load-store-load-store... chain, counters show that as loads are not replayed, the "replay horizon" is just the single store insn, and yet instead of p4=2iters counter value we see p4=5iters value.

Right well I talk about replay and horizons it is only really clear how it works (and even then not really that clear) for register dependencies. So this applies to load-load or load-alu, etc, and there things mostly make sense. It also applies to load-store (e.g., the store data comes from the load, via register).

It doesn't, however apply to store-load, since that dependency is through memory and there is some "other" mechanism to handle it. So I think a load-store-load-store chain (with the dependency through the load and store data as in your examples) is quite different. From a reg-reg scheduling dep PoV I would see it as many decoupled load-store pairs, with some "other" mechanism (described in part here) handling the store-load part.

So at least to me it is not surprising there are no load replays here.

Of course, this doesn't really explain why there are so many store replays: but a long replay chain involving multiple stores (as I think you are suggesting does not occur) is not the only way: you could have the same store replayed multiple times. E.g., maybe in your second example, the store is dispatched 2 or 3 cycles after the load (best case latency?), then again every cycle thereafter when it finds that it didn't get its data. Or maybe multiple stores awoken by one load, because of a limitation in the way waiting uops are tracked in the "load matrix" which holds uops waiting on loads.

travisdowns · 2020-07-28T19:22:13Z

Ooh, and I can stick an ALU op in the dependency chain without increasing overall latency, making best-case forwarding latency 3 cycles on SNB and IVB:

So this loop runs in 4 cycles per iteration?

amonakov · 2020-07-28T19:26:39Z

So this loop runs in 4 cycles per iteration?

Amazingly, yes!

travisdowns · 2020-07-28T19:30:30Z

Amazingly, yes!

Yeah, I think that indicates best-case latency is 3 cycles on those chips too, better than I had thought (not that I ever tested it: it's just that SKL advertised improvements in SLF latency: but maybe that was in the "other" scenario where the load arrives too early: I generally measure 4.5 cycles in that case).

FWIW, maybe you can create a 3-cycle loop without the ALU op if you space out the load from the earlier store using a dependency chain, rather than nops, because this removes a lot of variables about what executes when. That's how I did all my testing.

travisdowns · 2020-07-28T19:35:03Z

You can check out the fwd_lat_delay_0, fwd_lat_delay_1 e.g,. tests in uarch-bench for an example.

BTW, this effect is a cause of several "I added extra code and things sped up" paradoxes that I looked into. The extra code (say an extra function call, etc) changes the from load arriving "too early" (slower forwarding) to "just right" (data is available, fast forwarding). For example, it explained this semi-famous case which I where I originally started looking into this. I explains several similar questions on StackOverflow.

amonakov · 2020-07-28T20:00:48Z

Indeed, this runs at 3 cycles per iteration too. Perfection.

loop:
mov [rsp], rdi
imul rsp, 1
mov rdi, [rsp]
dec ecx
jnz loop

travisdowns · 2020-07-28T21:28:25Z

BTW, if you want to run the basic store forwarding tests in uarch bench, they are available at:

./uarch-bench.sh --test-name=memory/store-fwd/*

Which gives results on my Skylake like:

** Running group memory/store-fwd : Store forwaring latency and throughput **
                               Benchmark    Cycles     Nanos
           Store forward latency delay 0      4.47      1.73
           Store forward latency delay 1      4.48      1.73
           Store forward latency delay 2      4.48      1.73
           Store forward latency delay 3      3.00      1.16
           Store forward latency delay 4      3.99      1.54
           Store forward latency delay 5      4.99      1.93
            Store fwd tput concurrency 1      4.47      1.72
            Store fwd tput concurrency 2      2.32      0.90
            Store fwd tput concurrency 3      1.62      0.63
            Store fwd tput concurrency 4      1.00      0.39
            Store fwd tput concurrency 5      1.00      0.39
            Store fwd tput concurrency 6      1.00      0.39
            Store fwd tput concurrency 7      1.00      0.39
            Store fwd tput concurrency 8      1.00      0.39
            Store fwd tput concurrency 9      1.00      0.39
           Store fwd tput concurrency 10      1.00      0.39

The "delay" is how many cycles of delay between the store and the load that will read that store, enforced using dependency chain to the load address to slow it down relative to the store. The magic number of 3 is clear here, as is the + "0.5" behavior of the shorter delays. The concurrency test is running multiple chains as you did, showing that there is no real throughput limit to store forwarding: you can get to 1 store-load per cycle once you have 4 chains.

Here are the same results with p2347 ops shown, run using:

./uarch-bench.sh --test-name=memory/store-fwd/* --timer=perf --extra-events=uops_dispatched_port.port_2#p2,uops_dispatched_port.port_3#p3,uops_dispatched_port.port_4#p4,uops_dispatched_port.port_7#p7

** Running group memory/store-fwd : Store forwaring latency and throughput **
                               Benchmark    Cycles        p2        p3        p4        p7
           Store forward latency delay 0      4.47      0.75      0.77      4.27      0.48
           Store forward latency delay 1      4.50      0.64      0.71      4.30      0.65
           Store forward latency delay 2      4.48      0.65      0.78      4.30      0.57
           Store forward latency delay 3      3.00      0.68      0.64      1.00      0.68
           Store forward latency delay 4      4.00      0.72      0.72      1.00      0.57
           Store forward latency delay 5      5.00      0.76      0.77      1.00      0.47
            Store fwd tput concurrency 1      4.48      0.75      0.78      4.27      0.48
            Store fwd tput concurrency 2      2.32      0.74      0.74      2.25      0.52
            Store fwd tput concurrency 3      1.63      0.76      0.72      1.58      0.53
            Store fwd tput concurrency 4      1.00      0.73      0.71      1.00      0.56
            Store fwd tput concurrency 5      1.00      0.74      0.72      1.00      0.54
            Store fwd tput concurrency 6      1.00      0.71      0.73      1.00      0.56
            Store fwd tput concurrency 7      1.00      0.71      0.73      1.00      0.56
            Store fwd tput concurrency 8      1.00      0.72      0.73      1.00      0.54
            Store fwd tput concurrency 9      1.00      0.72      0.74      1.00      0.54
           Store fwd tput concurrency 10      1.00      0.73      0.74      1.00      0.53

You can clearly see the extra p4 uops, almost 1 per cycle. They disappear at delay 3, as soon as you switch into the "other" mode. It would be interesting too to set up some benchmarks where the store address dependent on the previous store, instead of the data. I did play around with it at one point, and recall the performance was much worse: I guess it's a less optimized case and harder to handle because the hardware is not even sure if forwarding is going to happen until the address arrives (but the disambiguation predictors should help).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store forwarding does not have fixed latency #81

store forwarding does not have fixed latency #81

amonakov commented Jul 27, 2020

amonakov commented Jul 27, 2020

amonakov commented Jul 28, 2020

amonakov commented Jul 28, 2020

amonakov commented Jul 28, 2020

travisdowns commented Jul 28, 2020 •

edited

Loading

travisdowns commented Jul 28, 2020

amonakov commented Jul 28, 2020

travisdowns commented Jul 28, 2020

travisdowns commented Jul 28, 2020

amonakov commented Jul 28, 2020

travisdowns commented Jul 28, 2020 •

edited

Loading

store forwarding does not have fixed latency #81

store forwarding does not have fixed latency #81

Comments

amonakov commented Jul 27, 2020

amonakov commented Jul 27, 2020

amonakov commented Jul 28, 2020

amonakov commented Jul 28, 2020

amonakov commented Jul 28, 2020

travisdowns commented Jul 28, 2020 • edited Loading

travisdowns commented Jul 28, 2020

amonakov commented Jul 28, 2020

travisdowns commented Jul 28, 2020

travisdowns commented Jul 28, 2020

amonakov commented Jul 28, 2020

travisdowns commented Jul 28, 2020 • edited Loading

travisdowns commented Jul 28, 2020 •

edited

Loading

travisdowns commented Jul 28, 2020 •

edited

Loading