Skip to content

Latest commit

 

History

History
115 lines (88 loc) · 3.8 KB

INSTALL.md

File metadata and controls

115 lines (88 loc) · 3.8 KB

Build Instructions

We strongly recommend starting with a release tarball available on the GitHub Release Page when building from source for production uses.

aws-ofi-nccl requires a working installation of Libfabric (v1.18.0 or newer). You can find the instructions for installing libfabric at libfabric installation.

The plugin uses GNU autotools for its build system. You can build it as follows:

$ ./configure
$ make
$ sudo make install

If you want to install the plugin in a custom path, use the --prefix configure flag to provide the path. You can also point the build to custom dependencies with the following flags:

  --with-libfabric=PATH   Path to non-standard libfabric installation
  --with-cuda=PATH        Path to non-standard CUDA installation
  --with-mpi=PATH         Path to non-standard MPI installation
  --with-hwloc=PATH       Path to non-standard HWLOC installation

By default, the configure script attempts to auto-detect whether it is running on an AWS EC2 instance, and if so enables AWS-specific optimizations. These optimizations can be enabled regardless of build machine with the following config option:

  --enable-platform-aws   Enable AWS-specific configuration and optimizations.
                          (default: Enabled if on EC2 instance)

To enable trace messages for debugging (disabled by default), use the following config option:

   --enable-trace         Enable printing trace messages

To enable UBSAN (Undefined Behaviour Sanitizer), use the following config option:

   --enable-ubsan         Enable undefined behaviour checks with UBSAN

To enable memory access checks with ASAN (disabled by default), use the following config option:

   --enable-asan           Enable ASAN memory access checks

In case plugin is configured with --enable-asan and the executable binary is not compiled and linked with ASAN support, it is required to preload the ASAN library, i.e., run the application with export LD_PRELOAD=<path to libasan.so>.

In case plugin is configured with --enable-asan and the plugin is run within a CUDA application, environment variable ASAN_OPTIONS needs to include protect_shadow_gap=0. Otherwise, ASAN will crash on out-of-memory. NCCL currently has some memory leaks and ASAN reports memory leaks by default on process exit. To avoid warnings on such memory leaks, e.g., only invalid memory accesses are of interest, add detect_leaks=0 to ASAN_OPTIONS.

To enable memory access checks with valgrind (disabled by default), use the following config option:

   -with-valgrind[=PATH]  Enable valgrind memory access checks

Use optional parameter PATH to point the build to a custom path where valgrind is installed. The memory access checkers ASAN and valgrind are mutually exclusive.

In case plugin allocates a block of memory to store multiple structures, redzones are added between adjacent objects such that memory access checker can detect access out of the boundaries of these objects. The redzones are dedicated memory areas that are marked as not accessible by memory access checkers. The default size of redzones is 16 bytes in case memory access checks are enabled and 0 otherwise. To control the size of redzones, use the following config option:

   MEMCHECK_REDZONE_SIZE=REDZONE_SIZE   Size of redzones in bytes

Redzones are required to be a multiple of 8 due to ASAN shadow-map granularity.

LTTNG tracing is documented in the doc/tracing.md file.

To enable LTTNG tracing, use the following configuration option:

  --with-lttng=PATH       Path to LTTNG installation

By default, tests are built. To disable building tests, use the following config option:

   --disable-tests        Disable build of tests.