Intel Performance Quirks

Weird performance quirks discovered (not necessarily by me) on modern Intel chips. We don't cover things already mentioned in Agner's guides or the Intel optimization manual.

L1-miss loads other than the first hitting the same L2 line have a longer latency of 19 cycles vs 12

It seems to happen only if the two loads issue in the same cycle: if the loads issue in different cycles the penalty appears to be reduced to 1 or 2 cycles. Goes back to Sandy Bridge (probably), but not Nehalem (which can't issue two loads per cycle).

Benchmarks in uarch-bench can be run with --test-name=studies/memory/l2-doubleload/*.

Details at RWT.

adc with a zero immediate, i.e., adc reg, 0 is twice as fast as with any other immediate or register source on Haswell-ish machines

Normally adc is 2 uops to p0156 and 2 cycles of latency, but in the special case the immediate zero is used, it only takes 1 uop and 1 cycle of latency on Haswell machines. This is pretty a important optimization for adc since adc reg, 0 is a common pattern used to accumulate the result of comparisons and other branchless techniques. Presumably the same optimization applies to sbb reg, 0. In Broadwell and beyond, adc is usually a single uop with 1 cycle latency, regardless of the immediate or register source.

Discussion at the bottom of this SO answer and the comments. Test in uarch bench can be run with --test-name=misc/adc*.

Short form adc and sbb using the accumulator (rax, eax, ax, al) are two uops on Broadwell and Skylake

In Broadwell and Skylake, most uses of adc and sbb only take one uop, versus two on prior generations. However, the "short form" specially encoded versions, which use the rax register (and all the sub-registers like eax, ax and al), still take 2-uops.

This is likely to occur in practice with any immediate that doesn't fit in a single byte.

I first saw it mentioned by Andreas Abel in this comment thread.

Minimum store-forwarding latency is 3 on new(ish) chips, but the load has to arrive at exactly the right time to achieve this

In particular, if the load address is ready to go as soon as the store address is, or at most 1 or 2 cycles later (you have to look at the dependency chains leading into each address to determine this), you'll get a variable store forwarding delay of 4 or 5 cycles (seems to be a 50% chance of each, giving an average of 4.5 cycles). If the load address is available exactly 3 cycles later, you'll get a faster 3 cycle store forwarding (i.e., the load won't be further delayed at all). This can lead to weird effects like adding extra work speeding up a loop, or call/ret sequence, because it delayed the load enough to get the "ideal" store-forwarding latency.

Initial discussion here.

Stores to a cache line that is an L1-miss but L2-hit are unexpectedly slow if interleaved with stores to other lines

When stores that miss in L1 but hit in L2 are interleaved with stores that hit in L1 (for example), the throughput is much lower than one might expect (considering L2 latencies, fill buffer counts, and store buffer prefetching), and also weirdly bi-modal. In particular, each L2 store (paired with an L1 store) might take 9 or 18 cycles. This effect can be eliminated by prefetching the lines into L1 before the store (or using a demand load to achieve the same effect), at which point you'll get the expected performance (similar to what you'd get for independent loads). A corollary of this is that it only happens with "blind stores": stores that write to a location without first reading it: if it is read first, the load determines the caching behavior and the problem goes away.

Discussion at RWT and StackOverflow.

The 4-cycle load-to-load latency applies only in the load-feeds-load case

A load using an addressing mode like [reg + offset] may have a latency of 4 cycles, which is the often-quoted best-case latency for recent Intel L1 caches - but this only applies when the value of reg was the result of an earlier load, e.g, in a pure pointer chasing loop. If the value of reg was instead calculated from an ALU instruction, the past path is not applicable, so the load will take 5 cycles (in addition to whatever latency the ALU operation adds).

For example, the following pointer-chase runs at 4 cycles per iteration in a loop:

mov rax, [rax + 128]

While the following equivalent code, which adds only a single-cycle latency ALU operation to the dependency chain, runs in 6 cycles per iteration rather than the 5 you might expect:

mov rax, [rax]
add rax, 128

The add instruction feeds the subsequent load which disables the 4-cycle L1-hit latency path.

Peter Cordes mentioned this behavior somewhere on Stack Overflow but I can't find the place at the moment.

The 4-cycle best-case load latency fails and the load must be replayed when the base register points to a different page

Consider again a pointer-chasing loop like the following:

top:
mov  rax, [rax + 128]
test rax, rax
jnz  top

Since this meets the documented conditions for the 4-cycle best case latency (simple addressing mode, non-negaitve offset < 2048) we expect to see it take 4 cycles per iteration. Indeed, it usually does: but in the case that the base pointer [rax] and the full address including offset [rax + 128] fall into different 4k pages the load actually takes 9 or 10 cycles on recent Intel and dispatches twice (i.e., it is replayed after the different page condition is detected).

Apparently to achieve the 4-cycle latency the TLB lookup happens based on the base register even before the full address is calculated (which presumably happens in parallel), and this condition is checked and the load has to be replayed if the offset puts the load in a different page. On Haswell loads can "mispredict" continually in this manner, but on Skylake a load being mispredicted will force the next load to use the normal full 5-cycle path which won't mispredict, which leads to alternating 5 and 10 cycle loads when the full address is always on a different page, for an average latency of 7.5).

Originally reported by user harold on StackOverflow and investigated in depth by Peter Cordes.

Lines in L3 are faster to access if their last access by another core was a write

As reported in Evaluating the Cost of Atomic Operations on Modern Architectures, the speed to access a line in the L3 depends on whether the last accesses were read-only by some other cores, or a write by another core. In the read case, the line may be silently evicted from the L1/L2 of the reading core, which doesn't update the "core valid" bits and hence on a new access the L1/L2 of the core(s) that earlier accessed the lines must be snooped and invalidated even if they don't contain the line. In the case of a modified line, it is written back and hence the core valid bits are updated (cleared) and so no invalidation of other core's private caches needs to occur. Quoting from the above paper, pages 6-7:

In the S/E states executing an atomic on the data held by a different core (on the same CPU) is not influenced by the data location (L1, L2 or L3) ... The data is evicted silently, with neither writebacks nor updating the core valid bit in L3. Thus, all [subsequent] accesses snoop L1/L2, making the latency identical .... M cache lines are written back when evicted updating the core valid bits. Thus, there is no invalidation when reading an M line in L3 that is not present in any local cache. This explains why M lines have lower latency in L3 than E lines.

An address that would otherwise be complex may be treated as simple if the index register is zeroed via idiom

General purpose register loads have a best-case latency of either 4 cycles or 5 cycles depending mostly on whether the are simple or complex. In general, a simple address has the form [reg + offset] where offset < 2048, and complex address is anything with a too-large offset, or which involves an index register, like [reg1 + reg2 * 4]. However, in the special case that the index register is zero and has been zeroed, via a zeroing idiom, the address is treated as simple and is eligible for the 4-cycle latency.

So a pointer-chasing loop like this:

xor esi, esi
.loop
mov rdi, [rdi + rsi*4]
test rdi, rdi ;  exit on null ptr
jnz .loop

Can run at 4 cycles per iteration, but the identical loop runs at 5 cycles per iteration if the initial zeroing of rsi is changed from xor esi, esi to mov esi, 0, since former is a zeroing idiom while the latter is not.

This is mostly a curiosity since if you know the index register rsi is always zero (as in the above example), you'd simply omit it from the addressing entirely. However, perhaps you have a scenario where rsi is often but not always zero, in that case a check for zero followed by an explicit xor-zeroing could speed it up by a cycle when the index is zero:

test rsi, rsi
jnz notzero
xor esi, esi  ; semantically redundant, but speeds things up!
notzero:
; loop goes here

Perhaps more interesting that this fairly obscure optimization possibility is the implication for the micro-architecture. It implies that the decision on whether an address generation is simple or complex is decided at least as late as the rename stage, since that is where this zeroed register information is available (dynamically). It means that simple-vs-complex is not decided earlier, near the decode stage, based on the static form of the address - as one might have expected (as I did).

Reported and discussed on RWT.

Unconfirmed

I haven't confirmed quirks in this section myself, but they have been reported elsewhere.

Dirty data in the L2 comes into L1 in the dirty state so it needs to be written back when evicted

This seems like it would obviously affect L1 <-> L2 throughput since you would need an additional L1 access to read the line to write back and an additional access to L2 to accept the evicted line (and also probably an L2 -> L3 writeback), but on the linked post the claim was actually that it increased the latency of L2-resident pointer chasing. It isn't clear on what uarch the results were obtained.

It is possible to have 100% 4K aliasing even when no loads alias

This is the "Only Y" case in this StackOverflow answer, where the Y load should always be before the store, and at the same relative offset, so the aliasing should be zero - but Hadi reports 100% 4k false aliasing (as determined by the performance counter). Essentially, a loop with a body like this:

z[i] = y[i];

Shows aliasing even when (z - y) % 4096 == 0 which is not what you would expect because the load of y[i] comes before the store to z[i] and it is those two loads from the same iteration that have "aliasing" potential (same bottom 12 bits), but since the load comes first there should be no aliasing...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly