Intel Performance Quirks

Weird performance quirks discovered (not necessarily by me) on modern Intel chips. We don't cover things already mentioned in Agner's guides or the Intel optimization manual.

L1-miss loads other than the first hitting the same L2 line have a longer latency of 19 cycles vs 12

It seems to happen only if the two loads issue in the same cycle: if the loads issue in different cycles the penalty appears to be reduced to 1 or 2 cycles. Goes back to Sandy Bridge (probably), but not Nehalem (which can't issue two loads per cycle).

Details at RWT.

adc with a zero immediate, i.e., adc reg, 0 is twice as fast as with any other immediate or register source on Haswell-ish machines

Normally adc is 2 uops to p0156 and 2 cycles of latency, but in the special case the immediate zero is used, it only takes 1 uop and 1 cycle of latency on Haswell machines. This is pretty a important optimization for adc since adc reg, 0 is a common pattern used to accumulate the result of comparisons and other branchless techniques. Presumably the same optimization applies to sbb reg, 0. In Broadwell and beyond, adc is always a single uop with 1 cycle latency, regardless of the immediate or register source.

Discussion at the bottom of this SO answer and the comments. Test in uarch bench can be run with --test-name=misc/adc*.

Minimum store-forwarding latency is 3 on new(ish) chips, but the load has to arrive at exactly the right time to achieve this

In particular, if the load address is ready to go as soon as the store address is, or at most 1 or 2 cycles later (you have to look at the dependency chains leading into each address to determine this), you'll get a variable store forwarding delay of 4 or 5 cycles (seems to be a 50% chance of each, giving an average of 4.5 cycles). If the load address is available exactly 3 cycles later, you'll get a faster 3 cycle store forwarding (i.e., the load won't be further delayed at all). This can lead to weird effects like adding extra work speeding up a loop, or call/ret sequence, because it delayed the load enough to get the "ideal" store-forwarding latency.

Initial discussion here.

Stores to L1-miss but L2-hit cache lines are unexpectedly slow if interleaved with stores other lines

When stores that miss in L1 but hit in L2 are interleaved with stores that hit in L1 (for example), the throughput is much lower than one might expect (considering L2 latencies, fill buffer counts, and store buffer prefetching), and also weirdly bi-modal. In particular, each L2 store (paired with an L1 store) might take 9 or 18 cycles. This effect can be eliminated by prefetching the lines into L1 before the store (or using a demand load to achieve the same effect), at which point you'll get the expected performance (similar to what you'd get for independent loads). A corollary of this is that it only happens with "blind stores": stores that write to a location without first reading it: if it is read first, the load determines the caching behavior and the problem goes away.

Discussion at RWT and StackOverflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel Performance Quirks

L1-miss loads other than the first hitting the same L2 line have a longer latency of 19 cycles vs 12

adc with a zero immediate, i.e., adc reg, 0 is twice as fast as with any other immediate or register source on Haswell-ish machines

Minimum store-forwarding latency is 3 on new(ish) chips, but the load has to arrive at exactly the right time to achieve this

Stores to L1-miss but L2-hit cache lines are unexpectedly slow if interleaved with stores other lines

Clone this wiki locally