Benchmarking some very simple Flux models #2069

mcabbott · 2024-11-06T21:36:08Z

On some extremely simple Flux models, Enzyme seems to be slower than Zygote for me. What's going wrong here?

julia> using Flux, Enzyme, Test, BenchmarkTools

julia> mlp = Chain(Flux.flatten, Dense(28^2 => 32, tanh), Dense(32 => 10));

julia> img = rand32(28, 28, 1, 128);

julia> @inferred mlp(img);  # type-stable

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3]
3-element Vector{Float32}:
 -15.980308
   6.2900686
 -79.44746

julia> Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3]
3-element Vector{Float32}:
 -15.980312
   6.2900686
 -79.44745

julia> @btime $mlp($img);
  min 10.958 μs, mean 14.119 μs (6 allocations, 43.09 KiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img);
  min 38.250 μs, mean 67.356 μs (86 allocations, 596.27 KiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img);
  min 75.125 μs, mean 119.919 μs (55 allocations, 579.61 KiB)

# a slightly bigger model

julia> lenet = Chain(  # from the model zoo
           Conv((5, 5), 1=>6, relu),
           MaxPool((2, 2)),
           Conv((5, 5), 6=>16, relu),
           MaxPool((2, 2)),
           Flux.flatten,
           Dense(256 => 120, relu),
           Dense(120 => 84, relu), 
           Dense(84 => 10),
       );

julia> @inferred lenet(img);  # type-stable

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
6-element Vector{Float32}:
 10.119315
  0.0
...

julia> Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
6-element Vector{Float32}:
 10.119322
  0.0
...

julia> @btime $lenet($img);
  min 655.583 μs, mean 1.107 ms (160 allocations, 5.60 MiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 4.979 ms, mean 6.300 ms (558 allocations, 14.18 MiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 8.347 ms, mean 9.752 ms (538 allocations, 15.42 MiB)

# tweak Enzyme to see if details matter...

julia> tmp_loss(m,x) = sum(abs2, m(x));  # give it a name

julia> @btime Enzyme.gradient(Reverse, tmp_loss, $lenet, $img);
  min 8.260 ms, mean 9.766 ms (538 allocations, 15.42 MiB)

julia> @btime Enzyme.gradient(Reverse, tmp_loss, $lenet, Const($img));
  min 8.030 ms, mean 9.235 ms (479 allocations, 14.75 MiB)

julia> @btime Enzyme.autodiff(Reverse, tmp_loss, Active, $(Duplicated(lenet, deepcopy(lenet))), Const($img));
  min 7.642 ms, mean 8.638 ms (359 allocations, 14.57 MiB)

Versions:

(jl_w98UzC) pkg> st Enzyme
Status `/private/var/folders/yq/4p2zwd614y59gszh7y9ypyhh0000gn/T/jl_w98UzC/Project.toml`
  [7da242da] Enzyme v0.13.14

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 11 × Apple M3 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 4 default, 0 interactive, 2 GC (on 5 virtual cores)
Environment:
  JULIA_NUM_THREADS = 4

mcabbott · 2024-11-06T22:00:05Z

Trying this on another computer, with Julia 1.11, I see similar slowdown on the small model, and a failure on the larger one.

julia> @btime $mlp($img);
  173.251 μs (13 allocations: 42.36 KiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img);
  494.602 μs (69 allocations: 588.97 KiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img);
  884.058 μs (91 allocations: 586.92 KiB)

# Larger model fails:

julia> Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
ERROR: 
No create nofree of empty function (julia.gc_loaded) julia.gc_loaded)
 at context:   call fastcc void @julia__PoolDims_14_107488({ [2 x i64], [2 x i64], i64, [2 x i64], [4 x i64], [2 x i64] }* noalias nocapture nofree noundef nonnull writeonly sret({ [2 x i64], [2 x i64], i64, [2 x i64], [4 x i64], [2 x i64] }) align 8 dereferenceable(104) %5, [2 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(64) %35, [4 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(96) %34, [4 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(32) %44, [2 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(112) %36) #268, !dbg !297 (julia__PoolDims_14_107488)

Stacktrace:
 [1] PoolDims
   @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:20
 [2] PoolDims
   @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:43
 [3] MaxPool
   @ ~/.julia/packages/Flux/htpCe/src/layers/conv.jl:728
 [4] macro expansion
   @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53
 [5] _applychain
   @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53


Stacktrace:
  [1] PoolDims
    @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:20 [inlined]
  [2] PoolDims
    @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:43 [inlined]
  [3] MaxPool
    @ ~/.julia/packages/Flux/htpCe/src/layers/conv.jl:728 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53 [inlined]
  [5] _applychain
    @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53
  [6] Chain
    @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:51 [inlined]
  [7] #19
    @ ./REPL[31]:1 [inlined]
  [8] diffejulia__19_105996_inner_242wrap
    @ ./REPL[31]:0
  [9] macro expansion
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:8305 [inlined]
 [10] enzyme_call
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7868 [inlined]
 [11] CombinedAdjointThunk
    @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7641 [inlined]
 [12] autodiff
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:491 [inlined]
 [13] autodiff
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:512 [inlined]
 [14] macro expansion
    @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:1678 [inlined]
 [15] gradient(rm::ReverseMode{…}, f::var"#19#20", x::Chain{…}, args::Array{…})
    @ Enzyme ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:1661
 [16] top-level scope
    @ REPL[31]:1
Some type information was truncated. Use `show(err)` to see complete types.

(jl_KEzUxT) pkg> st Enzyme
Status `/tmp/jl_KEzUxT/Project.toml`
  [7da242da] Enzyme v0.13.14

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, broadwell)
Threads: 4 default, 0 interactive, 2 GC (on 12 virtual cores)
Environment:
  JULIA_NUM_THREADS = 4

wsmoses · 2024-11-10T23:13:40Z

Sorry finally getting around to this.

So for the first case, I don't see that much of a gap (though definitely it would be good to improve):


julia> fn(m, x) = sum(abs2, m(x))
fn (generic function with 2 methods)

julia> @btime $fn($mlp, $img);

  250.295 μs (6 allocations: 42.59 KiB)

julia> @btime Flux.gradient($fn, $mlp, $img);
  713.314 μs (84 allocations: 595.17 KiB)

julia> dmlp = Enzyme.make_zero(mlp);

julia> dimg = Enzyme.make_zero(img);

julia> @btime Enzyme.autodiff(Reverse, $fn, $(Duplicated(mlp, dmlp)), $(Duplicated(img, dimg)));
  800.866 μs (11 allocations: 85.16 KiB)

mcabbott · 2024-11-10T23:47:13Z

Surprised how different those numbers are. I realised I have AppleAccelerate loaded, and if I run with --startup-file=no to use OpenBLAS instead, the relative difference is much smaller. (In fact the absolute difference is almost cut in half too.)

julia> @btime $mlp($img);
  min 104.833 μs, mean 109.179 μs (6 allocations, 43.09 KiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img);  # Zygote, allocating
  min 243.792 μs, mean 305.012 μs (84 allocations, 596.17 KiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img);  # allocating
  min 266.292 μs, mean 329.010 μs (55 allocations, 579.61 KiB)

julia> @btime Enzyme.autodiff(Reverse, $((m,x) -> sum(abs2, m(x))), $(Duplicated(mlp, Enzyme.make_zero(mlp))), $(Duplicated(img, Enzyme.make_zero(img))));  # pre-allocated
  min 256.916 μs, mean 270.453 μs (11 allocations, 86.16 KiB)

(Same machine & versions as above.)

wsmoses · 2024-11-11T01:14:48Z

huh, so what exactly causes it to be slow. AppleAccelerate itself?

mcabbott · 2024-11-11T01:31:29Z

Don't know. For the other model, changing to OpenBlas gives a slightly larger time-difference instead. (And a slightly smaller ratio).

julia> @btime $lenet($img);  # was min 655.583 μs, mean 1.107 ms with AppleAccelerate above
  min 839.916 μs, mean 1.910 ms (160 allocations, 5.60 MiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 7.980 ms, mean 9.273 ms (556 allocations, 14.18 MiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 11.960 ms, mean 13.037 ms (538 allocations, 15.42 MiB)

julia> @btime Enzyme.autodiff(Reverse, $((m,x) -> sum(abs2, m(x))), $(Duplicated(lenet, Enzyme.make_zero(lenet))), $(Duplicated(img, Enzyme.make_zero(img))));
  min 12.017 ms, mean 13.615 ms (415 allocations, 14.85 MiB)

The times here: #2069 (comment) on a different computer also don't involve AppleAccelerate.

This was referenced Nov 6, 2024

Use NNlib.bias_act! FluxML/Flux.jl#2327

Merged

Benchmarking some very simple Flux models compintell/Mooncake.jl#361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking some very simple Flux models #2069

Benchmarking some very simple Flux models #2069

mcabbott commented Nov 6, 2024 •

edited

Loading

mcabbott commented Nov 6, 2024

wsmoses commented Nov 10, 2024

mcabbott commented Nov 10, 2024

wsmoses commented Nov 11, 2024

mcabbott commented Nov 11, 2024 •

edited

Loading

Benchmarking some very simple Flux models #2069

Benchmarking some very simple Flux models #2069

Comments

mcabbott commented Nov 6, 2024 • edited Loading

mcabbott commented Nov 6, 2024

wsmoses commented Nov 10, 2024

mcabbott commented Nov 10, 2024

wsmoses commented Nov 11, 2024

mcabbott commented Nov 11, 2024 • edited Loading

mcabbott commented Nov 6, 2024 •

edited

Loading

mcabbott commented Nov 11, 2024 •

edited

Loading