Some benchmarks are measured with wrong approach #96

radarek · 2016-01-25T10:17:32Z

tl;dr When we measure small things we have to remove all overheads which have impact. Wrapping with a method and block call is quite big overhead for most of benchmarks in this repository.

Suppose you want compare performance of "2 + 3" vs "2 * 3" calls. If you do it with a approach used in this repository you will get this results:

require "benchmark/ips"

def slow
  2 * 2
end

def fast
  2 + 2
end

Benchmark.ips do |x|
  x.report("2 * 2") { slow }
  x.report("2 + 2") { fast }
  x.compare!
end

(simplified output)

Comparison:
               2 + 2:  8304760.0 i/s
               2 * 2:  7535516.6 i/s - 1.10x slower

But there is one problem. Calling a method + surrounding block ({ slow }) has bigger overhead than calling Fixnum#+ or Fixnum#* itself. Thea easiest way to observe it is to repeat benchmarked operations in one call. Like this:

require "benchmark/ips"

def slow
  2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2; 2*2;
end

def fast
  2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2; 2+2;
end

Benchmark.ips do |x|
  x.report("2 * 2") { slow }
  x.report("2 + 2") { fast }
  x.compare!
end

Comparison:
               2 + 2:  4680545.3 i/s
               2 * 2:  3468681.3 i/s - 1.35x slower

See how results changed? Writing our benchmarks in this way would be quite problematic. Fortunately benchmark-ips gem has answer for that. Benchmark::IPS::Job#report method allows to pass string which will be compiled before benchmark is run. Passing right string allows to measure it properly:

require "benchmark/ips"

Benchmark.ips do |x|
  x.report("2 * 2", "2 * 2;" * 1_000)
  x.report("2 + 2", "2 + 2;" * 1_000)
  x.compare!
end

Comparison:
               2 + 2:    91567.5 i/s
               2 * 2:    57994.4 i/s - 1.58x slower

This is greatly explained here: https://docs.omniref.com/ruby/2.2.1/symbols/Benchmark/bm#annotation=4095926&line=182

How does this affect fast-ruby benchmarks? All benchmarks that call small things are flawed. One of them is Array#length vs Array#size vs Array#count benchmark. Here is the original code and result obtained on my computer:

require 'benchmark/ips'

ARRAY = [*1..100]

Benchmark.ips do |x|
  x.report("Array#length") { ARRAY.length }
  x.report("Array#size") { ARRAY.size }
  x.report("Array#count") { ARRAY.count }
  x.compare!
end

Comparison:
          Array#size:  8679483.2 i/s
        Array#length:  8664450.7 i/s - 1.00x slower
         Array#count:  7237299.5 i/s - 1.20x slower

The same benchmark measure with described approach gives different numbers:

require 'benchmark/ips'

ARRAY = [*1..100]

Benchmark.ips do |x|
  x.report("Array#length", "ARRAY.length;" * 1_000)
  x.report("Array#size",   "ARRAY.size;"   * 1_000)
  x.report("Array#count",  "ARRAY.count;"  * 1_000)
  x.compare!
end

Comparison:
          Array#size:   113902.4 i/s
        Array#length:   113655.9 i/s - 1.00x slower
         Array#count:    28753.4 i/s - 3.96x slower

Difference: 1.20x slower vs 3.96x slower.

My guess is that it affects most of benchmarks.

The text was updated successfully, but these errors were encountered:

radarek · 2016-01-25T11:39:25Z

tl;dr Some benchmarks have other overhead. For example call vs send vs method_missing claims that send is only 1.14x slower and method_missing 1.42x slower than direct call. After removing all overheads we ended up with a result saying that send is 2.48x slower and method_missing 4.73x slower.

Another thing is that some benchmarks adds even more overhead. Let's consider call vs send vs method_missing. Current version is following:

require "benchmark/ips"

class MethodCall
  def method
  end

  def method_missing(_method,*args)
    method
  end
end

def fastest
  method = MethodCall.new
  method.method
end

def slow
  method = MethodCall.new
  method.send(:method)
end

def slowest
  method = MethodCall.new
  method.not_exist
end

Benchmark.ips do |x|
  x.report("call")           { fastest }
  x.report("send")           { slow    }
  x.report("method_missing") { slowest }
  x.compare!
end

And result on my computer is:

Comparison:
                call:  3998871.4 i/s
                send:  3497710.2 i/s - 1.14x slower
      method_missing:  2807780.4 i/s - 1.42x slower

First problem is that every call creates new object. Let's remove it. The easiest way is to add class methods instead of instance.

require "benchmark/ips"

class MethodCall
  class<<self
    def method
    end

    def method_missing(_method,*args)
      method
    end
  end
end

def fastest
  MethodCall.method
end

def slow
  MethodCall.send(:method)
end

def slowest
  MethodCall.not_exist
end

Benchmark.ips do |x|
  x.report("call")           { fastest }
  x.report("send")           { slow    }
  x.report("method_missing") { slowest }
  x.compare!
end

Comparison:
                call:  7291953.3 i/s
                send:  5741528.9 i/s - 1.27x slower
      method_missing:  4081021.3 i/s - 1.79x slower

Next step is to use trick described in first comment (pass code as string to report method).

require "benchmark/ips"

class MethodCall
  class<<self
    def method
    end

    def method_missing(_method,*args)
      method
    end
  end
end

Benchmark.ips do |x|
  x.report("call",           "MethodCall.method;"        * 1_000)
  x.report("send",           "MethodCall.send(:method);" * 1_000)
  x.report("method_missing", "MethodCall.not_exist;"     * 1_000)
  x.compare!
end

Comparison:
                call:    32605.5 i/s
                send:    14935.5 i/s - 2.18x slower
      method_missing:     7816.2 i/s - 4.17x slower

We can go even further. Every call of a form Foo.bar needs constant lookup. We can remove it using global variable, which is a little bit faster than constant lookup:

require "benchmark/ips"

class MethodCall
  class<<self
    def method
    end

    def method_missing(_method,*args)
      method
    end
  end
end

$MethodCall = MethodCall

Benchmark.ips do |x|
  x.report("call",           "$MethodCall.method;"        * 1_000)
  x.report("send",           "$MethodCall.send(:method);" * 1_000)
  x.report("method_missing", "$MethodCall.not_exist;"     * 1_000)
  x.compare!
end

Comparison:
                call:    38591.5 i/s
                send:    15539.0 i/s - 2.48x slower
      method_missing:     8165.6 i/s - 4.73x slower

We started with original benchmark claiming that send is only 1.14x slower and method_missing 1.42x slower than direct call. After removing all overheads we ended up with a result saying that send is 2.48x slower and method_missing 4.73x slower. Big difference, right?

nateberkopec · 2016-01-25T12:18:54Z

👏 This is great work! I'll wait for the other collaborators to weigh in here, but I think this is great research. PRs welcome with these changes.

…chmark. See fastruby#96 for detailed explanation

jonathanhefner · 2016-05-25T00:04:46Z

Something you might also consider is CPU instruction cache. Although I'm not an expert in the black art of optimizing code for the CPU cache, I do know cached instructions can skew benchmarks. I typically sidestep this issue by calling the method in question on multiple different pieces of data.

To further your first example:

require "benchmark/ips"

NUMS = (1..1000)
FAST = NUMS.map{|n| "#{n} + #{n}" }.join(';')
SLOW = NUMS.map{|n| "#{n} * 2" }.join(';')

Benchmark.ips do |x|
  x.report("n + n", FAST)
  x.report("n * 2", SLOW)
  x.compare!
end

This results in:

Calculating -------------------------------------
               n + n     68.816k (± 0.3%) i/s -    350.829k in   5.098103s
               n * 2     57.513k (± 0.2%) i/s -    290.680k in   5.054154s

Comparison:
               n + n:    68816.4 i/s
               n * 2:    57513.3 i/s - 1.20x slower

ixti · 2016-05-25T00:21:49Z

👏 🙇 Great work!

eregon · 2016-06-21T07:18:01Z

@radarek Sounds like a good way to reduce overhead.
Of course, overhead for Fixnum#+ will always be whatever is around (or it should be billions i/s) :)

$MethodCall = MethodCall Why using a global variable here instead of the constant?
I think a constant would be both more practical and convey the fact it does not change.

eregon · 2016-06-22T12:28:39Z

Every call of a form Foo.bar needs constant lookup. We can remove it using global variable, which is a little bit faster than constant lookup

@radarek I missed that, sorry. This is interesting. I would not count on it for other implementations or future improvements though, reading constants ought to be as good as reading global variables. Global variables are also not the right design tool here.

nateberkopec · 2016-06-22T18:46:37Z

reading constants ought to be as good as reading global variables.

I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.

I agree that in normal practice, constants would make more sense, but in the case of benchmarking, we're trying to eliminate as much overhead as possible.

thedarkone · 2016-06-22T19:26:54Z

I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.

AFAIK this is completely backwards for JRuby+Truffle. I think this was an intentional design choice, constants are supposed to be constant (eg redefining a const emits a warning, you can't lexically do X=1 in normal methods etc.) There are no comparable constraints on globals, therefore JRuby+Truffle emits code that expects globals to change.

Re nesting, I would expect there to be no perf difference between A::B::C::D::F::G.foo and X.foo on JRuby+Truffle.

(This is of course only true if the code in question isn't continually changing constants.)

eregon · 2016-06-22T20:06:46Z

@nateberkopec

I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation.

This seems a limitation currently in MRI, but there is no reason why constants cannot have an inline cache too. Then the only difference is a cheap version check.
JIT implementations with deoptimization can actually fold Ruby constants as actual constants in the machine code. Nesting is just a different way to do the lookup, but it should not affect the caching (it just makes the invalidation more complex).

Actually, I am most worried that the benchmarks would use totally unidiomatic code.

@thedarkone Maybe JRuby+Truffle should do the same trick as JRuby and speculate on globals being not modified. Then it would be the same performance if such assumption holds, but it just seems the wrong tool in the first place.

nateberkopec · 2016-06-22T20:08:56Z

@thedarkone TIL. Paging @chrisseaton on JRuby+Truffle. I thought the way constant lookup worked in Ruby was a language-level limitation, but it looks like it may be wrong!

In any case, if constant lookup and global lookup are equally fast in some implementations, and constant lookup is slower in some implementations, we should still use globals.

chrisseaton · 2016-06-22T20:27:07Z

JRuby+Truffle's constant lookup and caching uses almost exactly the same mechanism and code as method calls, and technically it's perfect as far as I'm aware - it leaves no guards, no checks, no indirection, in the generated code.

JRuby+Truffle's globals however are fields on a hidden globals object. We read them like an instance variable and, as @eregon says, we don't do anything to turn them into constants at the moment. It doesn't happen for free as the globals object is, unsurprisingly, globally visible. We should fix that now and make globals constant. It would be a good beginner task if anyone is interested.

To comment more generally on this thread, the problem you're trying to solve is a really tricky one. I've submitted a few PRs to benchmark-ips about how deep the rabbit hole goes evanphx/benchmark-ips#58 when you start to involve a sophisticated JIT.

thedarkone · 2016-06-22T20:40:22Z

@chrisseaton I vaguely remember @pitr-ch changing globals into direct volatiles (as per proposed mem model). I though my reasoning above is why JRubyTruffle is not speculating on "constant-ness" of globals.

Maybe JRuby+Truffle should do the same trick as JRuby and speculate on globals being not modified. Then it would be the same performance if such assumption holds, but it just seems the wrong tool in the first place.

@eregon I would propose treating globals as constants (compilation constant), until proven otherwise (have a mod counter, after 2 or 10 changes switching to repeated volatile reads).

chrisseaton · 2016-06-22T20:44:54Z

Yes the modification counter is probably the key.

We're also missing the feature in Ruby that lets you hook into globals to observe reads and writes - so we should really dig into the globals implementation in general.

eregon · 2016-06-22T21:25:09Z

@thedarkone For the memory model, speculating them as constants would still be fine as invalidation is thread-safe (it waits for every thread to throw the compiled code before continuing).

eregon · 2016-06-22T22:00:02Z

This seems a limitation currently in MRI, but there is no reason why constants cannot have an inline cache too. Then the only difference is a cheap version check.

It seems I was wrong about MRI :) There is an inline cache for constants as one would expect:

 ruby --dump=insns -e 'A=42; loop {A}'
...
0000 trace            256                                             (   1)
0002 trace            1
0004 getinlinecache   11, <is:0>
0007 getconstant      :A
0009 setinlinecache   <is:0>
0011 trace            512
0013 leave

Instruction 0004 is the inline cache (but getconstant itself has no cache), checking a couple things and jumping to 0011 if it succeeds. This might be a bit more expensive than getglobal which only calls a function pointer which itself reads the global's value.
The difference between the two seems fairly small on my machine, less than 5% on 40M i/s.

stereobooster · 2017-10-20T18:18:32Z

Another type of overhead is unintentional allocations, like this

# each {} is the new object - {}.object_id == {}.object_id => false
h.fetch(:a, {}).fetch(:b, {}).fetch(:c, {}).fetch(:d, {}).fetch(:e, nil)

# freeze is to make sure we only read from object, but not write to it
EMPTY_OBJECT = {}.freeze

# same code without unintentional allocations
h.fetch(:a, EMPTY_OBJECT).fetch(:b, EMPTY_OBJECT).fetch(:c, EMPTY_OBJECT).fetch(:d, EMPTY_OBJECT).fetch(:e, nil)

Those objects can trigger GC, which will make benchmark slower

scooler · 2021-12-31T15:58:20Z

I guess this is a bit stale for me, in terms of topics, but I didn't want to create a new one.
This example IMO is also a bit misleading: https://github.com/JuanitoFatas/fast-ruby#normal-way-to-apply-method-vs-method-code

The difference claimed (4x) goes down quite a when you remove the method call. (for me it went from 2.58x to 2.17x) for:

def do_something(n)
  4*n + 2
end

def fast
  [1, 2, 3].map { |n| do_something(n) }
end

METHOD_HANDLE = method(:do_something)
def slow
  [1, 2, 3].map(&METHOD_HANDLE)
end

Benchmark.ips do |x|
  x.report("normal")  { fast }
  x.report("&method") { slow }
  x.compare!
end

I guess the above suffers from the issue described originally in this thread, but I didn't want to compare the method-calling block-using version with method call caching string-using version.

Also when I did something like this:

def fast
  [1, 2, 3].map { |n| n.even? }
end

def slow
  [1, 2, 3].map(&:even?)
end

Benchmark.ips do |x|
  x.report("normal")  { fast }
  x.report("&method") { slow }
  x.compare!
end

the diff went down to 1.25x (on ruby 2.7.4)

nateberkopec added the General label Jan 25, 2016

radarek changed the title ~~Some benchmarks are measured with wrong methodology~~ Some benchmarks are measured with wrong approach Jan 25, 2016

radarek added a commit to radarek/fast-ruby that referenced this issue Jan 25, 2016

Remove overhead of surrounding block from length-vs-size-vs-count ben…

5f943ce

…chmark. See fastruby#96 for detailed explanation

radarek mentioned this issue Jan 25, 2016

Remove overhead of surrounding block from length-vs-size-vs-count benchmark #97

Closed

jonathanhefner mentioned this issue May 26, 2016

Add String#start_with? vs String#[].== #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some benchmarks are measured with wrong approach #96

Some benchmarks are measured with wrong approach #96

radarek commented Jan 25, 2016

radarek commented Jan 25, 2016

nateberkopec commented Jan 25, 2016

jonathanhefner commented May 25, 2016

ixti commented May 25, 2016

eregon commented Jun 21, 2016

eregon commented Jun 22, 2016

nateberkopec commented Jun 22, 2016

thedarkone commented Jun 22, 2016

eregon commented Jun 22, 2016

nateberkopec commented Jun 22, 2016

chrisseaton commented Jun 22, 2016

thedarkone commented Jun 22, 2016 •

edited

Loading

chrisseaton commented Jun 22, 2016

eregon commented Jun 22, 2016

eregon commented Jun 22, 2016

stereobooster commented Oct 20, 2017

scooler commented Dec 31, 2021

Some benchmarks are measured with wrong approach #96

Some benchmarks are measured with wrong approach #96

Comments

radarek commented Jan 25, 2016

radarek commented Jan 25, 2016

nateberkopec commented Jan 25, 2016

jonathanhefner commented May 25, 2016

ixti commented May 25, 2016

eregon commented Jun 21, 2016

eregon commented Jun 22, 2016

nateberkopec commented Jun 22, 2016

thedarkone commented Jun 22, 2016

eregon commented Jun 22, 2016

nateberkopec commented Jun 22, 2016

chrisseaton commented Jun 22, 2016

thedarkone commented Jun 22, 2016 • edited Loading

chrisseaton commented Jun 22, 2016

eregon commented Jun 22, 2016

eregon commented Jun 22, 2016

stereobooster commented Oct 20, 2017

scooler commented Dec 31, 2021

thedarkone commented Jun 22, 2016 •

edited

Loading