-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some benchmarks are measured with wrong approach #96
Comments
tl;dr Some benchmarks have other overhead. For example Another thing is that some benchmarks adds even more overhead. Let's consider require "benchmark/ips"
class MethodCall
def method
end
def method_missing(_method,*args)
method
end
end
def fastest
method = MethodCall.new
method.method
end
def slow
method = MethodCall.new
method.send(:method)
end
def slowest
method = MethodCall.new
method.not_exist
end
Benchmark.ips do |x|
x.report("call") { fastest }
x.report("send") { slow }
x.report("method_missing") { slowest }
x.compare!
end And result on my computer is:
First problem is that every call creates new object. Let's remove it. The easiest way is to add class methods instead of instance. require "benchmark/ips"
class MethodCall
class<<self
def method
end
def method_missing(_method,*args)
method
end
end
end
def fastest
MethodCall.method
end
def slow
MethodCall.send(:method)
end
def slowest
MethodCall.not_exist
end
Benchmark.ips do |x|
x.report("call") { fastest }
x.report("send") { slow }
x.report("method_missing") { slowest }
x.compare!
end
Next step is to use trick described in first comment (pass code as string to report method). require "benchmark/ips"
class MethodCall
class<<self
def method
end
def method_missing(_method,*args)
method
end
end
end
Benchmark.ips do |x|
x.report("call", "MethodCall.method;" * 1_000)
x.report("send", "MethodCall.send(:method);" * 1_000)
x.report("method_missing", "MethodCall.not_exist;" * 1_000)
x.compare!
end
We can go even further. Every call of a form require "benchmark/ips"
class MethodCall
class<<self
def method
end
def method_missing(_method,*args)
method
end
end
end
$MethodCall = MethodCall
Benchmark.ips do |x|
x.report("call", "$MethodCall.method;" * 1_000)
x.report("send", "$MethodCall.send(:method);" * 1_000)
x.report("method_missing", "$MethodCall.not_exist;" * 1_000)
x.compare!
end
We started with original benchmark claiming that send is only 1.14x slower and method_missing 1.42x slower than direct call. After removing all overheads we ended up with a result saying that send is 2.48x slower and method_missing 4.73x slower. Big difference, right? |
👏 This is great work! I'll wait for the other collaborators to weigh in here, but I think this is great research. PRs welcome with these changes. |
…chmark. See fastruby#96 for detailed explanation
Something you might also consider is CPU instruction cache. Although I'm not an expert in the black art of optimizing code for the CPU cache, I do know cached instructions can skew benchmarks. I typically sidestep this issue by calling the method in question on multiple different pieces of data. To further your first example: require "benchmark/ips"
NUMS = (1..1000)
FAST = NUMS.map{|n| "#{n} + #{n}" }.join(';')
SLOW = NUMS.map{|n| "#{n} * 2" }.join(';')
Benchmark.ips do |x|
x.report("n + n", FAST)
x.report("n * 2", SLOW)
x.compare!
end This results in:
|
👏 🙇 Great work! |
@radarek Sounds like a good way to reduce overhead.
|
@radarek I missed that, sorry. This is interesting. I would not count on it for other implementations or future improvements though, reading constants ought to be as good as reading global variables. Global variables are also not the right design tool here. |
I don't think so. Constants will always have to go through the lookup chain, because constants can be nested inside of each other, whereas globals will not. This will be true regardless of implementation. I agree that in normal practice, constants would make more sense, but in the case of benchmarking, we're trying to eliminate as much overhead as possible. |
AFAIK this is completely backwards for JRuby+Truffle. I think this was an intentional design choice, constants are supposed to be constant (eg redefining a const emits a warning, you can't lexically do Re nesting, I would expect there to be no perf difference between (This is of course only true if the code in question isn't continually changing constants.) |
This seems a limitation currently in MRI, but there is no reason why constants cannot have an inline cache too. Then the only difference is a cheap version check. Actually, I am most worried that the benchmarks would use totally unidiomatic code. @thedarkone Maybe JRuby+Truffle should do the same trick as JRuby and speculate on globals being not modified. Then it would be the same performance if such assumption holds, but it just seems the wrong tool in the first place. |
@thedarkone TIL. Paging @chrisseaton on JRuby+Truffle. I thought the way constant lookup worked in Ruby was a language-level limitation, but it looks like it may be wrong! In any case, if constant lookup and global lookup are equally fast in some implementations, and constant lookup is slower in some implementations, we should still use globals. |
JRuby+Truffle's constant lookup and caching uses almost exactly the same mechanism and code as method calls, and technically it's perfect as far as I'm aware - it leaves no guards, no checks, no indirection, in the generated code. JRuby+Truffle's globals however are fields on a hidden globals object. We read them like an instance variable and, as @eregon says, we don't do anything to turn them into constants at the moment. It doesn't happen for free as the globals object is, unsurprisingly, globally visible. We should fix that now and make globals constant. It would be a good beginner task if anyone is interested. To comment more generally on this thread, the problem you're trying to solve is a really tricky one. I've submitted a few PRs to |
@chrisseaton I vaguely remember @pitr-ch changing globals into direct volatiles (as per proposed mem model). I though my reasoning above is why JRubyTruffle is not speculating on "constant-ness" of globals.
@eregon I would propose treating globals as constants (compilation constant), until proven otherwise (have a mod counter, after 2 or 10 changes switching to repeated volatile reads). |
Yes the modification counter is probably the key. We're also missing the feature in Ruby that lets you hook into globals to observe reads and writes - so we should really dig into the globals implementation in general. |
@thedarkone For the memory model, speculating them as constants would still be fine as invalidation is thread-safe (it waits for every thread to throw the compiled code before continuing). |
It seems I was wrong about MRI :) There is an inline cache for constants as one would expect: ruby --dump=insns -e 'A=42; loop {A}'
...
0000 trace 256 ( 1)
0002 trace 1
0004 getinlinecache 11, <is:0>
0007 getconstant :A
0009 setinlinecache <is:0>
0011 trace 512
0013 leave Instruction 0004 is the inline cache (but |
Another type of overhead is unintentional allocations, like this # each {} is the new object - {}.object_id == {}.object_id => false
h.fetch(:a, {}).fetch(:b, {}).fetch(:c, {}).fetch(:d, {}).fetch(:e, nil)
# freeze is to make sure we only read from object, but not write to it
EMPTY_OBJECT = {}.freeze
# same code without unintentional allocations
h.fetch(:a, EMPTY_OBJECT).fetch(:b, EMPTY_OBJECT).fetch(:c, EMPTY_OBJECT).fetch(:d, EMPTY_OBJECT).fetch(:e, nil) Those objects can trigger GC, which will make benchmark slower |
I guess this is a bit stale for me, in terms of topics, but I didn't want to create a new one. The difference claimed (4x) goes down quite a when you remove the
I guess the above suffers from the issue described originally in this thread, but I didn't want to compare the method-calling block-using version with method call caching string-using version. Also when I did something like this:
the diff went down to 1.25x (on ruby 2.7.4) |
tl;dr When we measure small things we have to remove all overheads which have impact. Wrapping with a method and block call is quite big overhead for most of benchmarks in this repository.
Suppose you want compare performance of "2 + 3" vs "2 * 3" calls. If you do it with a approach used in this repository you will get this results:
(simplified output)
But there is one problem. Calling a method + surrounding block (
{ slow }
) has bigger overhead than callingFixnum#+
orFixnum#*
itself. Thea easiest way to observe it is to repeat benchmarked operations in one call. Like this:See how results changed? Writing our benchmarks in this way would be quite problematic. Fortunately benchmark-ips gem has answer for that.
Benchmark::IPS::Job#report
method allows to pass string which will be compiled before benchmark is run. Passing right string allows to measure it properly:This is greatly explained here: https://docs.omniref.com/ruby/2.2.1/symbols/Benchmark/bm#annotation=4095926&line=182
How does this affect fast-ruby benchmarks? All benchmarks that call small things are flawed. One of them is
Array#length vs Array#size vs Array#count
benchmark. Here is the original code and result obtained on my computer:The same benchmark measure with described approach gives different numbers:
Difference: 1.20x slower vs 3.96x slower.
My guess is that it affects most of benchmarks.
The text was updated successfully, but these errors were encountered: