Examples for beginners to write your own high-performance AI operators. We introduced optimizations tricks like using shared memory and pipeline rearrangement to maximize the throughput. We also provided an example for using CUTLASS to implement an FC + ReLU fused operator.
- Eigen: CPU linear algebra template library
- OpenMP: Enable multi-threads acceleration on CPU
- CUDA toolkit: Compile GPU kernels and analyse GPU executions
- Gflags: Commandline flags library released by Google
- CUTLASS: GPU GEMM template library
- Eigen: Use package manager, e.g.
apt install libeigen3-dev
, or download from the official website and build from source. - OpenMP: Most time the compilers have already integrated with OpenMP. If your compiler does not support OpenMP,
try
apt install libgomp-dev
orapt install libomp-dev
for GCC or Clang separately. - CUDA toolkit: It's recommended to install following the official instructions.
- Gflags: Use package manager, e.g.
apt install libgflags-dev
, or download from the official website and build from source. - CUTLASS: We have registered it to our git module, so you do not have to install by yourself.
Once you have installed the dependencies, you can use the following instruction to compile the project:
git clone [email protected]:openmlsys/openmlsys-cuda.git
cd openmlsys-cuda
git submodule init && git submodule sync
mkdir build && cd build
cmake ..
make -j4
first_attempt
: The naive implementationgemm
: Collection of implementations using different optimization tricksfc_relu
: Example for fusing FC and ReLU by using CUTLASS