Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some benchmarks are measured with wrong approach #96

Open
radarek opened this issue Jan 25, 2016 · 17 comments
Open

Some benchmarks are measured with wrong approach #96

radarek opened this issue Jan 25, 2016 · 17 comments
Labels

Comments

@radarek
Copy link

radarek commented Jan 25, 2016

tl;dr When we measure small things we have to remove all overheads which have impact. Wrapping with a method and block call is quite big overhead for most of benchmarks in this repository.

Suppose you want compare performance of "2 + 3" vs "2 * 3" calls. If you do it with a approach used in this repository you will get this results:

require "benchmark/ips"

def slow
  2 * 2
end

def fast
  2 + 2
end

Benchmark.ips do |x|
  x.report("2 * 2") { slow }
  x.report("2 + 2") { fast }
  x.compare!
end

(simplified output)

Comparison:
               2 + 2:  8304760.0 i/s
               2 * 2:  7535516.6 i/s - 1.10x slower

But there is one problem. Calling a method + surrounding block ({ slow }) has bigger overhead than calling Fixnum#+ or Fixnum#* itself. Thea easiest way to observe it is to repeat benchmarked operations in one call. Like this:

require "benchmark/ips"

def slow
  2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2;
end

def fast
  2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2;
end

Benchmark.ips do |x|
  x.report("2 * 2") { slow }
  x.report("2 + 2") { fast }
  x.compare!
end
Comparison:
               2 + 2:  4680545.3 i/s
               2 * 2:  3468681.3 i/s - 1.35x slower

See how results changed? Writing our benchmarks in this way would be quite problematic. Fortunately benchmark-ips gem has answer for that. Benchmark::IPS::Job#report method allows to pass string which will be compiled before benchmark is run. Passing right string allows to measure it properly:

require "benchmark/ips"

Benchmark.ips do |x|
  x.report("2 * 2", "2 * 2;" * 1_000)
  x.report("2 + 2", "2 + 2;" * 1_000)
  x.compare!
end
Comparison:
               2 + 2:    91567.5 i/s
               2 * 2:    57994.4 i/s - 1.58x slower

This is greatly explained here: https://docs.omniref.com/ruby/2.2.1/symbols/Benchmark/bm#annotation=4095926&line=182

How does this affect fast-ruby benchmarks? All benchmarks that call small things are flawed. One of them is Array#length vs Array#size vs Array#count benchmark. Here is the original code and result obtained on my computer:

require 'benchmark/ips'

ARRAY = [*1..100]

Benchmark.ips do |x|
  x.report("Array#length") { ARRAY.length }
  x.report("Array#size") { ARRAY.size }
  x.report("Array#count") { ARRAY.count }
  x.compare!
end
Comparison:
          Array#size:  8679483.2 i/s
        Array#length:  8664450.7 i/s - 1.00x slower
         Array#count:  7237299.5 i/s - 1.20x slower

The same benchmark measure with described approach gives different numbers:

require 'benchmark/ips'

ARRAY = [*1..100]

Benchmark.ips do |x|
  x.report("Array#length", "ARRAY.length;" * 1_000)
  x.report("Array#size",   "ARRAY.size;"   * 1_000)
  x.report("Array#count",  "ARRAY.count;"  * 1_000)
  x.compare!
end
Comparison:
          Array#size:   113902.4 i/s
        Array#length:   113655.9 i/s - 1.00x slower
         Array#count:    28753.4 i/s - 3.96x slower

Difference: 1.20x slower vs 3.96x slower.

My guess is that it affects most of benchmarks.

@radarek
Copy link
Author

radarek commented Jan 25, 2016

tl;dr Some benchmarks have other overhead. For example call vs send vs method_missing claims that send is only 1.14x slower and method_missing 1.42x slower than direct call. After removing all overheads we ended up with a result saying that send is 2.48x slower and method_missing 4.73x slower.

Another thing is that some benchmarks adds even more overhead. Let's consider call vs send vs method_missing. Current version is following:

require "benchmark/ips"

class MethodCall
  def method
  end

  def method_missing(_method,*args)
    method
  end
end

def fastest
  method = MethodCall.new
  method.method
end

def slow
  method = MethodCall.new
  method.send(:method)
end

def slowest
  method = MethodCall.new
  method.not_exist
end

Benchmark.ips do |x|
  x.report("call")           { fastest }
  x.report("send")           { slow    }
  x.report("method_missing") { slowest }
  x.compare!
end

And result on my computer is:

Comparison:
                call:  3998871.4 i/s
                send:  3497710.2 i/s - 1.14x slower
      method_missing:  2807780.4 i/s - 1.42x slower

First problem is that every call creates new object. Let's remove it. The easiest way is to add class methods instead of instance.

require "benchmark/ips"

class MethodCall
  class<<self
    def method
    end

    def method_missing(_method,*args)
      method
    end
  end
end

def fastest
  MethodCall.method
end

def slow
  MethodCall.send(:method)
end

def slowest
  MethodCall.not_exist
end

Benchmark.ips do |x|
  x.report("call")           { fastest }
  x.report("send")           { slow    }
  x.report("method_missing") { slowest }
  x.compare!
end
Comparison:
                call:  7291953.3 i/s
                send:  5741528.9 i/s - 1.27x slower
      method_missing:  4081021.3 i/s - 1.79x slower

Next step is to use trick described in first comment (pass code as string to report method).

require "benchmark/ips"

class MethodCall
  class<<self
    def method
    end

    def method_missing(_method,*args)
      method
    end
  end
end

Benchmark.ips do |x|
  x.report("call",           "MethodCall.method;"        * 1_000)
  x.report("send",           "MethodCall.send(:method);" * 1_000)
  x.report("method_missing", "MethodCall.not_exist;"     * 1_000)
  x.compare!
end
Comparison:
                call:    32605.5 i/s
                send:    14935.5 i/s - 2.18x slower
      method_missing:     7816.2 i/s - 4.17x slower

We can go even further. Every call of a form Foo.bar needs constant lookup. We can remove it using global variable, which is a little bit faster than constant lookup:

require "benchmark/ips"

class MethodCall
  class<<self
    def method
    end

    def method_missing(_method,*args)
      method
    end
  end
end

$MethodCall = MethodCall

Benchmark.ips do |x|
  x.report("call",           "$MethodCall.method;"        * 1_000)
  x.report("send",           "$MethodCall.send(:method);" * 1_000)
  x.report("method_missing", "$MethodCall.not_exist;"     * 1_000)
  x.compare!
end
Comparison:
                call:    38591.5 i/s
                send:    15539.0 i/s - 2.48x slower
      method_missing:     8165.6 i/s - 4.73x slower

We started with original benchmark claiming that send is only 1.14x slower and method_missing 1.42x slower than direct call. After removing all overheads we ended up with a result saying that send is 2.48x slower and method_missing 4.73x slower. Big difference, right?

@nateberkopec
Copy link
Collaborator

👏 This is great work! I'll wait for the other collaborators to weigh in here, but I think this is great research. PRs welcome with these changes.

@radarek radarek changed the title Some benchmarks are measured with wrong methodology Some benchmarks are measured with wrong approach Jan 25, 2016
radarek added a commit to radarek/fast-ruby that referenced this issue Jan 25, 2016
@jonathanhefner
Copy link
Collaborator

Something you might also consider is CPU instruction cache. Although I'm not an expert in the black art of optimizing code for the CPU cache, I do know cached instructions can skew benchmarks. I typically sidestep this issue by calling the method in question on multiple different pieces of data.

To further your first example:

require "benchmark/ips"

NUMS = (1..1000)
FAST = NUMS.map{|n| "#{n} + #{n}" }.join(';')
SLOW = NUMS.map{|n| "#{n} * 2" }.join(';')

Benchmark.ips do |x|
  x.report("n + n", FAST)
  x.report("n * 2", SLOW)
  x.compare!
end

This results in:

Calculating -------------------------------------
               n + n     68.816k (± 0.3%) i/s -    350.829k in   5.098103s
               n * 2     57.513k (± 0.2%) i/s -    290.680k in   5.054154s

Comparison:
               n + n:    68816.4 i/s
               n * 2:    57513.3 i/s - 1.20x slower

@ixti
Copy link
Collaborator

ixti commented May 25, 2016

👏 🙇 Great work!

@eregon
Copy link

eregon commented Jun 21, 2016

@radarek Sounds like a good way to reduce overhead.
Of course, overhead for Fixnum#+ will always be whatever is around (or it should be billions i/s) :)

$MethodCall = MethodCall Why using a global variable here instead of the constant?
I think a constant would be both more practical and convey the fact it does not change.

@eregon
Copy link

eregon commented Jun 22, 2016

Every call of a form Foo.bar needs constant lookup. We can remove it using global variable, which is a little bit faster than constant lookup

@radarek I missed that, sorry. This is interesting. I would not count on it for other implementations or future improvements though, reading constants ought to be as good as reading global variables. Global variables are also not the right design tool here.

@nateberkopec
Copy link
Collaborator

reading constants ought to be as good as reading global variables.

I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.

I agree that in normal practice, constants would make more sense, but in the case of benchmarking, we're trying to eliminate as much overhead as possible.

@thedarkone
Copy link

I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.

AFAIK this is completely backwards for JRuby+Truffle. I think this was an intentional design choice, constants are supposed to be constant (eg redefining a const emits a warning, you can't lexically do X=1 in normal methods etc.) There are no comparable constraints on globals, therefore JRuby+Truffle emits code that expects globals to change.

Re nesting, I would expect there to be no perf difference between A::B::C::D::F::G.foo and X.foo on JRuby+Truffle.

(This is of course only true if the code in question isn't continually changing constants.)

@eregon
Copy link

eregon commented Jun 22, 2016

@nateberkopec

I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.

This seems a limitation currently in MRI, but there is no reason why constants cannot have an inline cache too. Then the only difference is a cheap version check.
JIT implementations with deoptimization can actually fold Ruby constants as actual constants in the machine code. Nesting is just a different way to do the lookup, but it should not affect the caching (it just makes the invalidation more complex).

Actually, I am most worried that the benchmarks would use totally unidiomatic code.

@thedarkone Maybe JRuby+Truffle should do the same trick as JRuby and speculate on globals being not modified. Then it would be the same performance if such assumption holds, but it just seems the wrong tool in the first place.

@nateberkopec
Copy link
Collaborator

@thedarkone TIL. Paging @chrisseaton on JRuby+Truffle. I thought the way constant lookup worked in Ruby was a language-level limitation, but it looks like it may be wrong!

In any case, if constant lookup and global lookup are equally fast in some implementations, and constant lookup is slower in some implementations, we should still use globals.

@chrisseaton
Copy link

JRuby+Truffle's constant lookup and caching uses almost exactly the same mechanism and code as method calls, and technically it's perfect as far as I'm aware - it leaves no guards, no checks, no indirection, in the generated code.

JRuby+Truffle's globals however are fields on a hidden globals object. We read them like an instance variable and, as @eregon says, we don't do anything to turn them into constants at the moment. It doesn't happen for free as the globals object is, unsurprisingly, globally visible. We should fix that now and make globals constant. It would be a good beginner task if anyone is interested.

To comment more generally on this thread, the problem you're trying to solve is a really tricky one. I've submitted a few PRs to benchmark-ips about how deep the rabbit hole goes evanphx/benchmark-ips#58 when you start to involve a sophisticated JIT.

@thedarkone
Copy link

thedarkone commented Jun 22, 2016

@chrisseaton I vaguely remember @pitr-ch changing globals into direct volatiles (as per proposed mem model). I though my reasoning above is why JRubyTruffle is not speculating on "constant-ness" of globals.

Maybe JRuby+Truffle should do the same trick as JRuby and speculate on globals being not modified. Then it would be the same performance if such assumption holds, but it just seems the wrong tool in the first place.

@eregon I would propose treating globals as constants (compilation constant), until proven otherwise (have a mod counter, after 2 or 10 changes switching to repeated volatile reads).

@chrisseaton
Copy link

Yes the modification counter is probably the key.

We're also missing the feature in Ruby that lets you hook into globals to observe reads and writes - so we should really dig into the globals implementation in general.

@eregon
Copy link

eregon commented Jun 22, 2016

@thedarkone For the memory model, speculating them as constants would still be fine as invalidation is thread-safe (it waits for every thread to throw the compiled code before continuing).

@eregon
Copy link

eregon commented Jun 22, 2016

This seems a limitation currently in MRI, but there is no reason why constants cannot have an inline cache too. Then the only difference is a cheap version check.

It seems I was wrong about MRI :) There is an inline cache for constants as one would expect:

 ruby --dump=insns -e 'A=42; loop {A}'
...
0000 trace            256                                             (   1)
0002 trace            1
0004 getinlinecache   11, <is:0>
0007 getconstant      :A
0009 setinlinecache   <is:0>
0011 trace            512
0013 leave            

Instruction 0004 is the inline cache (but getconstant itself has no cache), checking a couple things and jumping to 0011 if it succeeds. This might be a bit more expensive than getglobal which only calls a function pointer which itself reads the global's value.
The difference between the two seems fairly small on my machine, less than 5% on 40M i/s.

@stereobooster
Copy link

Another type of overhead is unintentional allocations, like this

# each {} is the new object - {}.object_id == {}.object_id => false
h.fetch(:a, {}).fetch(:b, {}).fetch(:c, {}).fetch(:d, {}).fetch(:e, nil)

# freeze is to make sure we only read from object, but not write to it
EMPTY_OBJECT = {}.freeze

# same code without unintentional allocations
h.fetch(:a, EMPTY_OBJECT).fetch(:b, EMPTY_OBJECT).fetch(:c, EMPTY_OBJECT).fetch(:d, EMPTY_OBJECT).fetch(:e, nil)

Those objects can trigger GC, which will make benchmark slower

@scooler
Copy link

scooler commented Dec 31, 2021

I guess this is a bit stale for me, in terms of topics, but I didn't want to create a new one.
This example IMO is also a bit misleading: https://github.com/JuanitoFatas/fast-ruby#normal-way-to-apply-method-vs-method-code

The difference claimed (4x) goes down quite a when you remove the method call. (for me it went from 2.58x to 2.17x) for:

def do_something(n)
  4*n + 2
end

def fast
  [1, 2, 3].map { |n| do_something(n) }
end

METHOD_HANDLE = method(:do_something)
def slow
  [1, 2, 3].map(&METHOD_HANDLE)
end

Benchmark.ips do |x|
  x.report("normal")  { fast }
  x.report("&method") { slow }
  x.compare!
end

I guess the above suffers from the issue described originally in this thread, but I didn't want to compare the method-calling block-using version with method call caching string-using version.

Also when I did something like this:

def fast
  [1, 2, 3].map { |n| n.even? }
end

def slow
  [1, 2, 3].map(&:even?)
end

Benchmark.ips do |x|
  x.report("normal")  { fast }
  x.report("&method") { slow }
  x.compare!
end

the diff went down to 1.25x (on ruby 2.7.4)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants