Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking some very simple Flux models #2069

Open
mcabbott opened this issue Nov 6, 2024 · 5 comments
Open

Benchmarking some very simple Flux models #2069

mcabbott opened this issue Nov 6, 2024 · 5 comments

Comments

@mcabbott
Copy link
Contributor

mcabbott commented Nov 6, 2024

On some extremely simple Flux models, Enzyme seems to be slower than Zygote for me. What's going wrong here?

julia> using Flux, Enzyme, Test, BenchmarkTools

julia> mlp = Chain(Flux.flatten, Dense(28^2 => 32, tanh), Dense(32 => 10));

julia> img = rand32(28, 28, 1, 128);

julia> @inferred mlp(img);  # type-stable

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3]
3-element Vector{Float32}:
 -15.980308
   6.2900686
 -79.44746

julia> Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3]
3-element Vector{Float32}:
 -15.980312
   6.2900686
 -79.44745

julia> @btime $mlp($img);
  min 10.958 μs, mean 14.119 μs (6 allocations, 43.09 KiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img);
  min 38.250 μs, mean 67.356 μs (86 allocations, 596.27 KiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img);
  min 75.125 μs, mean 119.919 μs (55 allocations, 579.61 KiB)

# a slightly bigger model

julia> lenet = Chain(  # from the model zoo
           Conv((5, 5), 1=>6, relu),
           MaxPool((2, 2)),
           Conv((5, 5), 6=>16, relu),
           MaxPool((2, 2)),
           Flux.flatten,
           Dense(256 => 120, relu),
           Dense(120 => 84, relu), 
           Dense(84 => 10),
       );

julia> @inferred lenet(img);  # type-stable

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
6-element Vector{Float32}:
 10.119315
  0.0
...

julia> Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
6-element Vector{Float32}:
 10.119322
  0.0
...

julia> @btime $lenet($img);
  min 655.583 μs, mean 1.107 ms (160 allocations, 5.60 MiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 4.979 ms, mean 6.300 ms (558 allocations, 14.18 MiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 8.347 ms, mean 9.752 ms (538 allocations, 15.42 MiB)

# tweak Enzyme to see if details matter...

julia> tmp_loss(m,x) = sum(abs2, m(x));  # give it a name

julia> @btime Enzyme.gradient(Reverse, tmp_loss, $lenet, $img);
  min 8.260 ms, mean 9.766 ms (538 allocations, 15.42 MiB)

julia> @btime Enzyme.gradient(Reverse, tmp_loss, $lenet, Const($img));
  min 8.030 ms, mean 9.235 ms (479 allocations, 14.75 MiB)

julia> @btime Enzyme.autodiff(Reverse, tmp_loss, Active, $(Duplicated(lenet, deepcopy(lenet))), Const($img));
  min 7.642 ms, mean 8.638 ms (359 allocations, 14.57 MiB)

Versions:

(jl_w98UzC) pkg> st Enzyme
Status `/private/var/folders/yq/4p2zwd614y59gszh7y9ypyhh0000gn/T/jl_w98UzC/Project.toml`
  [7da242da] Enzyme v0.13.14

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 11 × Apple M3 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 4 default, 0 interactive, 2 GC (on 5 virtual cores)
Environment:
  JULIA_NUM_THREADS = 4
@mcabbott
Copy link
Contributor Author

mcabbott commented Nov 6, 2024

Trying this on another computer, with Julia 1.11, I see similar slowdown on the small model, and a failure on the larger one.

julia> @btime $mlp($img);
  173.251 μs (13 allocations: 42.36 KiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img);
  494.602 μs (69 allocations: 588.97 KiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img);
  884.058 μs (91 allocations: 586.92 KiB)

# Larger model fails:

julia> Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
ERROR: 
No create nofree of empty function (julia.gc_loaded) julia.gc_loaded)
 at context:   call fastcc void @julia__PoolDims_14_107488({ [2 x i64], [2 x i64], i64, [2 x i64], [4 x i64], [2 x i64] }* noalias nocapture nofree noundef nonnull writeonly sret({ [2 x i64], [2 x i64], i64, [2 x i64], [4 x i64], [2 x i64] }) align 8 dereferenceable(104) %5, [2 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(64) %35, [4 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(96) %34, [4 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(32) %44, [2 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(112) %36) #268, !dbg !297 (julia__PoolDims_14_107488)

Stacktrace:
 [1] PoolDims
   @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:20
 [2] PoolDims
   @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:43
 [3] MaxPool
   @ ~/.julia/packages/Flux/htpCe/src/layers/conv.jl:728
 [4] macro expansion
   @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53
 [5] _applychain
   @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53


Stacktrace:
  [1] PoolDims
    @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:20 [inlined]
  [2] PoolDims
    @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:43 [inlined]
  [3] MaxPool
    @ ~/.julia/packages/Flux/htpCe/src/layers/conv.jl:728 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53 [inlined]
  [5] _applychain
    @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53
  [6] Chain
    @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:51 [inlined]
  [7] #19
    @ ./REPL[31]:1 [inlined]
  [8] diffejulia__19_105996_inner_242wrap
    @ ./REPL[31]:0
  [9] macro expansion
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:8305 [inlined]
 [10] enzyme_call
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7868 [inlined]
 [11] CombinedAdjointThunk
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7641 [inlined]
 [12] autodiff
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:491 [inlined]
 [13] autodiff
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:512 [inlined]
 [14] macro expansion
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:1678 [inlined]
 [15] gradient(rm::ReverseMode{…}, f::var"#19#20", x::Chain{…}, args::Array{…})
    @ Enzyme ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:1661
 [16] top-level scope
    @ REPL[31]:1
Some type information was truncated. Use `show(err)` to see complete types.

(jl_KEzUxT) pkg> st Enzyme
Status `/tmp/jl_KEzUxT/Project.toml`
  [7da242da] Enzyme v0.13.14

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, broadwell)
Threads: 4 default, 0 interactive, 2 GC (on 12 virtual cores)
Environment:
  JULIA_NUM_THREADS = 4

@wsmoses
Copy link
Member

wsmoses commented Nov 10, 2024

Sorry finally getting around to this.

So for the first case, I don't see that much of a gap (though definitely it would be good to improve):


julia> fn(m, x) = sum(abs2, m(x))
fn (generic function with 2 methods)

julia> @btime $fn($mlp, $img);

  250.295 μs (6 allocations: 42.59 KiB)

julia> @btime Flux.gradient($fn, $mlp, $img);
  713.314 μs (84 allocations: 595.17 KiB)

julia> dmlp = Enzyme.make_zero(mlp);

julia> dimg = Enzyme.make_zero(img);

julia> @btime Enzyme.autodiff(Reverse, $fn, $(Duplicated(mlp, dmlp)), $(Duplicated(img, dimg)));
  800.866 μs (11 allocations: 85.16 KiB)

@mcabbott
Copy link
Contributor Author

Surprised how different those numbers are. I realised I have AppleAccelerate loaded, and if I run with --startup-file=no to use OpenBLAS instead, the relative difference is much smaller. (In fact the absolute difference is almost cut in half too.)

julia> @btime $mlp($img);
  min 104.833 μs, mean 109.179 μs (6 allocations, 43.09 KiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img);  # Zygote, allocating
  min 243.792 μs, mean 305.012 μs (84 allocations, 596.17 KiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img);  # allocating
  min 266.292 μs, mean 329.010 μs (55 allocations, 579.61 KiB)

julia> @btime Enzyme.autodiff(Reverse, $((m,x) -> sum(abs2, m(x))), $(Duplicated(mlp, Enzyme.make_zero(mlp))), $(Duplicated(img, Enzyme.make_zero(img))));  # pre-allocated
  min 256.916 μs, mean 270.453 μs (11 allocations, 86.16 KiB)

(Same machine & versions as above.)

@wsmoses
Copy link
Member

wsmoses commented Nov 11, 2024

huh, so what exactly causes it to be slow. AppleAccelerate itself?

@mcabbott
Copy link
Contributor Author

mcabbott commented Nov 11, 2024

Don't know. For the other model, changing to OpenBlas gives a slightly larger time-difference instead. (And a slightly smaller ratio).

julia> @btime $lenet($img);  # was min 655.583 μs, mean 1.107 ms with AppleAccelerate above
  min 839.916 μs, mean 1.910 ms (160 allocations, 5.60 MiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 7.980 ms, mean 9.273 ms (556 allocations, 14.18 MiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 11.960 ms, mean 13.037 ms (538 allocations, 15.42 MiB)

julia> @btime Enzyme.autodiff(Reverse, $((m,x) -> sum(abs2, m(x))), $(Duplicated(lenet, Enzyme.make_zero(lenet))), $(Duplicated(img, Enzyme.make_zero(img))));
  min 12.017 ms, mean 13.615 ms (415 allocations, 14.85 MiB)

The times here: #2069 (comment) on a different computer also don't involve AppleAccelerate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants