Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rep stos appearing in benchmarked region #32

Open
travisdowns opened this issue Feb 19, 2018 · 0 comments
Open

rep stos appearing in benchmarked region #32

travisdowns opened this issue Feb 19, 2018 · 0 comments

Comments

@travisdowns
Copy link
Owner

travisdowns commented Feb 19, 2018

If you take a look at the core region of the innermost method in a benchmark in the libpfc case, you find a rep stos call inside the timed region as follows:

  40792a:       shl    rdx,0x20
  40792e:       or     rdx,rax
  407931:       add    QWORD PTR [rbp+0x28],rdx
  407935:       mov    rcx,0x3
  40793c:       rdpmc  
  40793e:       shl    rdx,0x20
  407942:       or     rdx,rax
  407945:       add    QWORD PTR [rbp+0x30],rdx
  407949:       lfence 
  40794c:       mov    rdi,QWORD PTR [rsp]
  407950:       mov    rsi,QWORD PTR [rsp+0x8]
  407955:       call   47f680 <dep_add_rax_rax>
  40795a:       mov    rdi,rbx
  40795d:       mov    rax,r12
  407960:       mov    ecx,0x7
  407965:       rep stos QWORD PTR es:[rdi],rax    <<< this guy
  407968:       lfence 
  40796b:       mov    rcx,0x40000000
  407972:       rdpmc  
  407974:       shl    rdx,0x20
  407978:       or     rdx,rax
  40797b:       add    QWORD PTR [rbx],rdx
  40797e:       mov    rcx,0x40000001
  407985:       rdpmc  
  407987:       shl    rdx,0x20
  40798b:       or     rdx,rax
  40798e:       add    QWORD PTR [rbx+0x8],rdx
  407992:       mov    rcx,0x40000002
  407999:       rdpmc  
  40799b:       shl    rdx,0x20

The code before and after is issuing rdpmc to read the performance counters, and the actual timed called is dep_add_rax_rax, but the presence of the rep stos is unfortunate, since it's slow, invokes microcode and so on. It's there because of:

struct LibpfcNow {
    PFC_CNT cnt[TOTAL_COUNTERS];
    ...

and

static now_t now() {
        LibpfcNow now = {};

which zero-initializes the counter array. The existing macro either add (PFC_END as shown above) or sub from the array location, so we require zero init since otherwise the garbage will be picked up. In principle though the array is just replaced with the current value, so this isn't necessary - we have have a new PFC_ macro which just mov in the absolute value.

In principle, the effect is cancelled out by the use of dummy_bench (or any other bench), but it would still be nice to eliminate all unnecessary code in the benchmarked region, especially rep instructions and those which modify memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant