Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

CUB 2.0.0

Compare
Choose a tag to compare
@alliepiper alliepiper released this 15 Aug 16:53
· 246 commits to main since this release

Summary

The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime debug_synchronous debugging flags.

This release also includes several new features. DeviceHistogram now supports __half and better handles various edge cases. WarpReduce now performs correctly when restricted to a single-thread “warp”, and will use the __reduce_add_sync accelerated intrinsic (introduced with Ampere) when appropriate. DeviceRadixSort learned to handle the case where begin_bit == end_bit.

Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.

Breaking Changes

  • #448 Add libcu++ dependency (v1.8.0+).
  • #448: The following macros are no longer defined by default. They can be re-enabled by defining CUB_PROVIDE_LEGACY_ARCH_MACROS. These will be completely removed in a future release.
    • CUB_IS_HOST_CODE: Replace with NV_IF_TARGET.
    • CUB_IS_DEVICE_CODE: Replace with NV_IF_TARGET.
    • CUB_INCLUDE_HOST_CODE: Replace with NV_IF_TARGET.
    • CUB_INCLUDE_DEVICE_CODE: Replace with NV_IF_TARGET.
  • #486: CUB’s CUDA Runtime support macros have been updated to support NV_IF_TARGET. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.
    • CUB_RUNTIME_FUNCTION: Execution space annotations for functions that invoke CUDA Runtime APIs.
      • Old behavior:
        • RDC enabled: Defined to __host__ __device__
        • RDC not enabled:
          • NVCC host pass: Defined to __host__ __device__
          • NVCC device pass: Defined to __host__
      • New behavior:
        • RDC enabled: Defined to __host__ __device__
        • RDC not enabled: Defined to __host__
    • CUB_RUNTIME_ENABLED: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:
      • RDC enabled: Macro is defined.
      • RDC not enabled:
        • NVCC host pass: Macro is defined.
        • NVCC device pass: Macro is not defined.
    • CUB_RDC_ENABLED: New macro, may be combined with NV_IF_TARGET to replace most usages of CUB_RUNTIME_ENABLED. Behavior:
      • RDC enabled: Macro is defined.
      • RDC not enabled: Macro is not defined.
  • #509: A compile-time error is now emitted when a __device__-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).
    • Due to limitations in the CUDA programming model, the result of this query is unreliable, and will silently return an incorrect result. This leads to difficult to debug errors.
    • When using libcu++ 1.9.0, an error will be emitted with information about work-arounds:
      • Use a named function object with a __device__-only implementation of operator().
      • Use a __host__ __device__ lambda.
      • Use cuda::proclaim_return_type (Added in libcu++ 1.9.0)
  • #509: Use the result type of the binary reduction operator for accumulating intermediate results in the DeviceReduce algorithm, following guidance from http://wg21.link/P2322R6.
    • This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
    • In addition to the behavioral changes, the interfaces for the Dispatch*Reduce layer have changed:
      • DispatchReduce:
        • Now accepts accumulator type as last parameter.
        • Now accepts initializer type instead of output iterator value type.
        • Constructor now accepts init as initial type instead of output iterator value type.
      • DispatchSegmentedReduce:
        • Accepts accumulator type as last parameter.
        • Accepts initializer type instead of output iterator value type.
    • Thread operators now accept parameters using different types: Equality, Inequality, InequalityWrapper, Sum, Difference, Division, Max, ArgMax, Min, ArgMin.
    • ThreadReduce now accepts accumulator type and uses a different type for prefix.
  • #511: Use the result type of the binary operator for accumulating intermediate results in the DeviceScan, DeviceScanByKey, and DeviceReduceByKey algorithms, following guidance from http://wg21.link/P2322R6.
    • This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
    • In addition to the behavioral changes, the interfaces for the Dispatch layer have changed:
      • DispatchScan now accepts accumulator type as a template parameter.
      • DispatchScanByKey now accepts accumulator type as a template parameter.
      • DispatchReduceByKey now accepts accumulator type as the last template parameter.
  • #527: Deprecate the debug_synchronous flags on device algorithms.
    • This flag no longer has any effect. Define CUB_DEBUG_SYNC during compilation to enable these checks.
    • Moving this option from run-time to compile-time avoids the compilation overhead of unused debugging paths in production code.

New Features

  • #514: Support __half in DeviceHistogram.
  • #516: Add support for single-threaded invocations of WarpReduce.
  • #516: Use __reduce_add_sync hardware acceleration for WarpReduce on supported architectures.

Bug Fixes

  • #481: Fix the device-wide radix sort implementations to simply copy the input to the output when begin_bit == end_bit.
  • #487: Fix DeviceHistogram::Even for a variety of edge cases:
    • Bin ids are now correctly computed when mixing different types for SampleT and LevelT.
    • Bin ids are now correctly computed when LevelT is an integral type and the number of levels does not evenly divide the level range.
  • #508: Ensure that temp_storage_bytes is properly set in the AdjacentDifferenceCopy device algorithms.
  • #508: Remove excessive calls to the binary operator given to the AdjacentDifferenceCopy device algorithms.
  • #533: Fix debugging utilities when RDC is disabled.

Other Enhancements

  • #448: Removed special case code for unsupported CUDA architectures.
  • #448: Replace several usages of __CUDA_ARCH__ with <nv/target> to handle host/device code divergence.
  • #448: Mark unused PTX arch parameters as legacy.
  • #476: Enabled additional debug logging for the onesweep radix sort implementation. Thanks to @canonizer for this contribution.
  • #480: Add CUB_DISABLE_BF16_SUPPORT to avoid including the cuda_bf16.h header or using the __nv_bfloat16 type.
  • #486: Add debug log messages for post-kernel debug synchronizations.
  • #490: Clarify documentation for in-place usage of DeviceScan algorithms.
  • #494: Clarify documentation for in-place usage of DeviceHistogram algorithms.
  • #495: Clarify documentation for in-place usage of DevicePartition algorithms.
  • #499: Clarify documentation for in-place usage of Device*Sort algorithms.
  • #500: Clarify documentation for in-place usage of DeviceReduce algorithms.
  • #501: Clarify documentation for in-place usage of DeviceRunLengthEncode algorithms.
  • #503: Clarify documentation for in-place usage of DeviceSelect algorithms.
  • #518: Fix typo in WarpMergeSort documentation.
  • #519: Clarify segmented sort documentation regarding the handling of elements that are not included in any segment.