Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check performance of libtrixi against standard Trixi.jl #141

Open
2 of 9 tasks
benegee opened this issue Oct 24, 2023 · 9 comments
Open
2 of 9 tasks

Check performance of libtrixi against standard Trixi.jl #141

benegee opened this issue Oct 24, 2023 · 9 comments
Labels

Comments

@benegee
Copy link
Collaborator

benegee commented Oct 24, 2023

Is the runtime of the shim library variant and the PackageCompiler variant comparable to running Trixi.jl in a pure Julia setting?

  • Define a suitable elixir testcase
    • 3D Euler baroclinic instability
    • p4est for now, should be changed to t8code
  • Metrics
    • Raw performance (rhs! or PID)
    • Same but with N, 2N and 3N time steps
    • Time of a single step --> to figure out recurring overhead
    • Time of a single function call to Trixi.jl (e.g., ndims) --> to get precise number for overhead
    • Time to first solution --> how long until we get a first result (or total startup latency)
    • Time to second solution --> to figure out how much of startup latency is for compilation
  • Gather metrics from all ranks when using MPI
@benegee benegee added testing performance We are greedy labels Oct 24, 2023
@sloede sloede changed the title Check performance of PackageCompiler.jl libtrixi Check performance of libtrixi against standard Trixi.jl Oct 24, 2023
@benegee
Copy link
Collaborator Author

benegee commented Oct 25, 2023

To get a first impression, here are some results obtained on my local machine:

Command for running Trixi.jl: time julia --project=. -e 'using Trixi; trixi_include("elixir.jl")'

Command for running LibTrixi.jl: time JULIA_DEPOT_PATH=~/install/libtrixi-julia/julia-depot julia --project=./libtrixi-julia ~/libtrixi/examples/trixi_controller_simple.jl libelixir.jl

Command for running libtrixi: time ./bin/trixi_controller_simple_c ../libtrixi-julia libelexir.jl (with libtrixi compiled using shim library and PackageCompiler, respectively)

Output of time:

Trixi.jl LibTrixi.jl shim PackageCompiler
tree2d dgsem advection amr (15 steps) 18.36 20.57 21.62 17.64
p4est2d dgsem euler sedov (554 steps) 26.37 28.83 30.66 27.91

Trixi's summary callback total time:

Trixi.jl LibTrixi.jl shim PackageCompiler
tree2d dgsem advection amr (15 steps) 0.58 1.96 1.96 1.95
p4est2d dgsem euler sedov (554 steps) 2.6 4.0 3.99 5.32

@benegee
Copy link
Collaborator Author

benegee commented Oct 25, 2023

Rocinante

Output of time:

Trixi.jl LibTrixi.jl shim PackageCompiler
tree2d dgsem advection amr (15 steps) 24.14 26.56 28.91
p4est2d dgsem euler sedov (554 steps) 36.60 39.83 43.05
p4est3d dgsem euler taylor green (692 steps) 228 227 229

Trixi's summary callback total time:

Trixi.jl LibTrixi.jl shim PackageCompiler
tree2d dgsem advection amr (15 steps) 0.65 2.84 2.86
p4est2d dgsem euler sedov (554 steps) 3.84 6.07 6.07
p4est3d dgsem euler taylor green (692 steps) 190 187 186

@benegee
Copy link
Collaborator Author

benegee commented Oct 26, 2023

So, my first impression is that we are loosing some time compared to the pure Julia setting, but we already loose it with LibTrixi.jl

@benegee
Copy link
Collaborator Author

benegee commented Oct 26, 2023

Here is the detailed Trixi.jl output on rocinante for the p4est3d dgsem euler taylor green test case:

Trixi.jl

Output
────────────────────────────────────────────────────────────────────────────────────────────────────
 Simulation running 'CompressibleEulerEquations3D' with DGSEM(polydeg=3)
────────────────────────────────────────────────────────────────────────────────────────────────────
 #timesteps:                692                run time:       1.90082541e+02 s
 Δt:             2.43871376e-03                └── GC time:    0.00000000e+00 s (0.000%)
 sim. time:      2.00000000e+00 (100.000%)     time/DOF/rhs!:  1.96351542e-07 s
                                               PID:            2.07344420e-07 s
 #DOFs per field:        262144                alloc'd memory:        524.510 MiB
 #elements:                4096

 Variable:       rho              rho_v1           rho_v2           rho_v3           rho_e
 L2 error:       8.95069643e-05   3.10321961e-01   3.10321961e-01   1.35978802e-03   3.46208376e-01
 Linf error:     5.37090306e-04   8.77762329e-01   8.77762329e-01   4.05597986e-03   1.04587655e+00
 ∑∂S/∂U ⋅ Uₜ :  -5.46066761e-05
────────────────────────────────────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished.  Final time: 2.0  Time steps: 692 (accepted), 692 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────

 ─────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                     Time                    Allocations
                                ───────────────────────   ────────────────────────
        Tot / % measured:             190s /  95.8%           37.6MiB /  58.4%

 Section                ncalls     time    %tot     avg     alloc    %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────
 rhs!                    3.46k     179s   98.0%  51.6ms   9.33KiB    0.0%    2.76B
   volume integral       3.46k    91.6s   50.3%  26.5ms     0.00B    0.0%    0.00B
   interface flux        3.46k    50.6s   27.7%  14.6ms     0.00B    0.0%    0.00B
   prolong2interfaces    3.46k    17.1s    9.4%  4.95ms     0.00B    0.0%    0.00B
   surface integral      3.46k    14.1s    7.7%  4.08ms     0.00B    0.0%    0.00B
   reset ∂u/∂t           3.46k    2.69s    1.5%   776μs     0.00B    0.0%    0.00B
   Jacobian              3.46k    2.51s    1.4%   725μs     0.00B    0.0%    0.00B
   ~rhs!~                3.46k   41.2ms    0.0%  11.9μs   9.33KiB    0.0%    2.76B
   prolong2boundaries    3.46k   3.19ms    0.0%   922ns     0.00B    0.0%    0.00B
   mortar flux           3.46k   1.05ms    0.0%   303ns     0.00B    0.0%    0.00B
   prolong2mortars       3.46k    632μs    0.0%   183ns     0.00B    0.0%    0.00B
   boundary flux         3.46k    410μs    0.0%   118ns     0.00B    0.0%    0.00B
   source terms          3.46k   77.3μs    0.0%  22.3ns     0.00B    0.0%    0.00B
 calculate dt              693    2.13s    1.2%  3.07ms     0.00B    0.0%    0.00B
 analyze solution            8    1.46s    0.8%   182ms   21.9MiB  100.0%  2.74MiB
 ─────────────────────────────────────────────────────────────────────────────────


real    3m47.490s

LibTrixi.jl

Output
────────────────────────────────────────────────────────────────────────────────────────────────────
 Simulation running 'CompressibleEulerEquations3D' with DGSEM(polydeg=3)
────────────────────────────────────────────────────────────────────────────────────────────────────
 #timesteps:                692                run time:       1.86016236e+02 s
 Δt:             2.43871376e-03                └── GC time:    0.00000000e+00 s (0.000%)
 sim. time:      2.00000000e+00 (100.000%)     time/DOF/rhs!:  1.94844512e-07 s
                                               PID:            2.06342051e-07 s
 #DOFs per field:        262144                alloc'd memory:        624.456 MiB
 #elements:                4096

 Variable:       rho              rho_v1           rho_v2           rho_v3           rho_e
 L2 error:       8.95069643e-05   3.10321961e-01   3.10321961e-01   1.35978802e-03   3.46208376e-01
 Linf error:     5.37090306e-04   8.77762329e-01   8.77762329e-01   4.05597986e-03   1.04587655e+00
 ∑∂S/∂U ⋅ Uₜ :  -5.46066761e-05
────────────────────────────────────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished.  Final time: 2.0  Time steps: 692 (accepted), 692 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────

*** Trixi controller ***   Finalize Trixi simulation
 ─────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                     Time                    Allocations
                                ───────────────────────   ────────────────────────
        Tot / % measured:             186s /  95.2%           87.5MiB /  25.1%

 Section                ncalls     time    %tot     avg     alloc    %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────
 rhs!                    3.46k     174s   98.0%  50.2ms   9.33KiB    0.0%    2.76B
   volume integral       3.46k    91.1s   51.4%  26.3ms     0.00B    0.0%    0.00B
   interface flux        3.46k    47.3s   26.7%  13.7ms     0.00B    0.0%    0.00B
   prolong2interfaces    3.46k    16.6s    9.3%  4.79ms     0.00B    0.0%    0.00B
   surface integral      3.46k    13.6s    7.7%  3.93ms     0.00B    0.0%    0.00B
   reset ∂u/∂t           3.46k    2.66s    1.5%   770μs     0.00B    0.0%    0.00B
   Jacobian              3.46k    2.49s    1.4%   720μs     0.00B    0.0%    0.00B
   ~rhs!~                3.46k   44.7ms    0.0%  12.9μs   9.33KiB    0.0%    2.76B
   prolong2boundaries    3.46k   2.71ms    0.0%   784ns     0.00B    0.0%    0.00B
   mortar flux           3.46k   1.38ms    0.0%   399ns     0.00B    0.0%    0.00B
   prolong2mortars       3.46k    983μs    0.0%   284ns     0.00B    0.0%    0.00B
   boundary flux         3.46k    390μs    0.0%   113ns     0.00B    0.0%    0.00B
   source terms          3.46k    110μs    0.0%  31.9ns     0.00B    0.0%    0.00B
 calculate dt              693    2.12s    1.2%  3.06ms     0.00B    0.0%    0.00B
 analyze solution            8    1.46s    0.8%   183ms   21.9MiB  100.0%  2.74MiB
 ─────────────────────────────────────────────────────────────────────────────────


real    3m46.918s

Shim

Output
────────────────────────────────────────────────────────────────────────────────────────────────────
 Simulation running 'CompressibleEulerEquations3D' with DGSEM(polydeg=3)
────────────────────────────────────────────────────────────────────────────────────────────────────
 #timesteps:                692                run time:       1.85059574e+02 s
 Δt:             2.43871376e-03                └── GC time:    0.00000000e+00 s (0.000%)
 sim. time:      2.00000000e+00 (100.000%)     time/DOF/rhs!:  1.90740414e-07 s
                                               PID:            2.02063818e-07 s
 #DOFs per field:        262144                alloc'd memory:        624.088 MiB
 #elements:                4096

 Variable:       rho              rho_v1           rho_v2           rho_v3           rho_e
 L2 error:       8.95069643e-05   3.10321961e-01   3.10321961e-01   1.35978802e-03   3.46208376e-01
 Linf error:     5.37090306e-04   8.77762329e-01   8.77762329e-01   4.05597986e-03   1.04587655e+00
 ∑∂S/∂U ⋅ Uₜ :  -5.46066761e-05
────────────────────────────────────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished.  Final time: 2.0  Time steps: 692 (accepted), 692 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────


*** Trixi controller ***   Finalize Trixi simulation
 ─────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                     Time                    Allocations
                                ───────────────────────   ────────────────────────
        Tot / % measured:             185s /  95.2%           87.0MiB /  25.2%

 Section                ncalls     time    %tot     avg     alloc    %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────
 rhs!                    3.46k     173s   98.0%  49.9ms   9.33KiB    0.0%    2.76B
   volume integral       3.46k    91.3s   51.7%  26.4ms     0.00B    0.0%    0.00B
   interface flux        3.46k    45.9s   26.0%  13.3ms     0.00B    0.0%    0.00B
   prolong2interfaces    3.46k    16.4s    9.3%  4.73ms     0.00B    0.0%    0.00B
   surface integral      3.46k    14.1s    8.0%  4.08ms     0.00B    0.0%    0.00B
   reset ∂u/∂t           3.46k    2.56s    1.5%   740μs     0.00B    0.0%    0.00B
   Jacobian              3.46k    2.53s    1.4%   731μs     0.00B    0.0%    0.00B
   ~rhs!~                3.46k   49.3ms    0.0%  14.2μs   9.33KiB    0.0%    2.76B
   prolong2boundaries    3.46k   2.48ms    0.0%   718ns     0.00B    0.0%    0.00B
   prolong2mortars       3.46k   1.33ms    0.0%   385ns     0.00B    0.0%    0.00B
   mortar flux           3.46k   1.28ms    0.0%   369ns     0.00B    0.0%    0.00B
   source terms          3.46k    362μs    0.0%   105ns     0.00B    0.0%    0.00B
   boundary flux         3.46k   81.6μs    0.0%  23.6ns     0.00B    0.0%    0.00B
 calculate dt              693    2.11s    1.2%  3.05ms     0.00B    0.0%    0.00B
 analyze solution            8    1.44s    0.8%   180ms   21.9MiB  100.0%  2.74MiB
 ─────────────────────────────────────────────────────────────────────────────────


*** Trixi controller ***   Finalize Trixi

real    3m49.557s

@sloede
Copy link
Member

sloede commented Oct 26, 2023

Thanks a lot for getting these numbers! A few notes:

  • First, we should define a suitable elixir testcase. I would argue for a 3D Euler simulation on a P4estMesh, since this is the most relevant one in practice
  • When comparing raw performance, we should only consider rhs! timer output or PID.
  • Besides raw performance (which is for real world problem arguably the most important measure), we should carefully consider what we want to measure. Then, how we can make sure that we really measure it, and not something else (or a mixture).
  • The following metrics could be interesting:
    • Raw performance (rhs! or PID) --> most important for performance
    • Time of a single step --> to figure out recurring overhead
    • Time of a single function call to Trixi.jl (e.g., ndims) --> to get precise number for overhead
    • Time to first solution --> how long until we get a first result (or total startup latency)
    • Time to second solution --> to figure out how much of startup latency is for compilation

There might be other measures that are interesting as well (or maybe better than the ones I posted). In any case, I think it would be worthwhile to discuss if we want to do this properly now and if yes, do it, or if we just want to get some rough numbers.

@benegee
Copy link
Collaborator Author

benegee commented Oct 26, 2023

Thanks for the plan!

I am all for doing it properly.

BUT: I just checked a 3D test case as you suggested and it seems that with the increased load, the differences just vanish. See the updated number for rocinante above.

@sloede
Copy link
Member

sloede commented Oct 26, 2023

BUT: I just checked a 3D test case as you suggested and it seems that with the increased load, the differences just vanish. See the updated number for rocinante above.

I'd say that this is to some extent expected, but at the same time good news: Performance differences that only occur for toy problems generally do not matter (as much). I think it will be good to have also the PID, not just the integral numbers.

Once we add some select measurements on specific parts (as suggested above), we hopefully have enough to support the claim "as fast as pure Julia for all practical purposes"

@benegee
Copy link
Collaborator Author

benegee commented May 31, 2024

Juwels

Trixi's summary callback rhs! time, minimum of 5 runs in seconds

Trixi.jl LibTrixi.jl shim PackageCompiler
p4est2d dgsem euler sedov 7.9 11.3 11.2 11.1
p4est3d dgsem euler baroclinic instability 327.0 332.0 329.0 334.0

@sloede
Copy link
Member

sloede commented May 31, 2024

Thank you for sharing (and recording) these numbers! Surely this is not the time for a single call to rhs!, or is it? I think it might be interesting to see, e.g., the 3D example with N, 2N and 3N time steps. I'd assume that it would show a constant initial overhead that gets amortized over time (or where do the differences come from 🤔?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants