Improvements to Normalization Functions #95

avik-pal · 2024-07-20T23:19:07Z

Current Status

batchnorm
- CUDA is heavily optimized and calls into cuDNN
- batchnorm -- similar to what is done for groupnorm (added in feat: improved fallback BN implementation #106)

groupnorm has custom kernels for the forward and backward pass. This is achieved by restructuring all the arrays into 4D and constructing either loops or KA kernels. This gives us around 2-3x performance boost over standard cases (verified against Flux as well).
- This particular form however causes slowdown in the 2D case. Also it is better to rewrite the reduction directly as loops for the backward pass. Similar to perf: fusing activation functions and other misc perf improvements #126 -- fixed in perf: rework CPU groupnorm implementation #134
instancenorm -- very similar to groupnorm.
layernorm
- We should change the default to what Pytorch does. This case is very simple to optimize using LoopVectorization on CPU and KernelAbstractions on GPU.
- The general broadcasting case is very hard to optimize, best we could do is fuse into a single GPU kernel but this is not worth it much.

MIOpen for AMDGPU batchnorm. With the new batchnorm kernels, we should test whether this is even worth it. Though I would need someone who has access to ROCm capable GPUs to test this.
- Seems like the kernels are quite performant, and we are close to cuDNN in performance even with the naive way the kernels are written.

The text was updated successfully, but these errors were encountered:

avik-pal mentioned this issue Jul 22, 2024

Use LoopVectorization for faster Loops #100

Open

16 tasks