Summary

The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime debug_synchronous debugging flags.

This release also includes several new features. DeviceHistogram now supports __half and better handles various edge cases. WarpReduce now performs correctly when restricted to a single-thread “warp”, and will use the __reduce_add_sync accelerated intrinsic (introduced with Ampere) when appropriate. DeviceRadixSort learned to handle the case where begin_bit == end_bit.

Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.

Breaking Changes

#448 Add libcu++ dependency (v1.8.0+).
#448: The following macros are no longer defined by default. They can be re-enabled by defining CUB_PROVIDE_LEGACY_ARCH_MACROS. These will be completely removed in a future release.
- CUB_IS_HOST_CODE: Replace with NV_IF_TARGET.
- CUB_IS_DEVICE_CODE: Replace with NV_IF_TARGET.
- CUB_INCLUDE_HOST_CODE: Replace with NV_IF_TARGET.
- CUB_INCLUDE_DEVICE_CODE: Replace with NV_IF_TARGET.
#486: CUB’s CUDA Runtime support macros have been updated to support NV_IF_TARGET. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.
- CUB_RUNTIME_FUNCTION: Execution space annotations for functions that invoke CUDA Runtime APIs.
  - Old behavior:
    - RDC enabled: Defined to __host__ __device__
    - RDC not enabled:
      - NVCC host pass: Defined to __host__ __device__
      - NVCC device pass: Defined to __host__
  - New behavior:
    - RDC enabled: Defined to __host__ __device__
    - RDC not enabled: Defined to __host__
- CUB_RUNTIME_ENABLED: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:
  - RDC enabled: Macro is defined.
  - RDC not enabled:
    - NVCC host pass: Macro is defined.
    - NVCC device pass: Macro is not defined.
- CUB_RDC_ENABLED: New macro, may be combined with NV_IF_TARGET to replace most usages of CUB_RUNTIME_ENABLED. Behavior:
  - RDC enabled: Macro is defined.
  - RDC not enabled: Macro is not defined.
#509: A compile-time error is now emitted when a __device__-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).
- Due to limitations in the CUDA programming model, the result of this query is unreliable, and will silently return an incorrect result. This leads to difficult to debug errors.
- When using libcu++ 1.9.0, an error will be emitted with information about work-arounds:
  - Use a named function object with a __device__-only implementation of operator().
  - Use a __host__ __device__ lambda.
  - Use cuda::proclaim_return_type (Added in libcu++ 1.9.0)
#509: Use the result type of the binary reduction operator for accumulating intermediate results in the DeviceReduce algorithm, following guidance from http://wg21.link/P2322R6.
- This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for the Dispatch*Reduce layer have changed:
  - DispatchReduce:
    - Now accepts accumulator type as last parameter.
    - Now accepts initializer type instead of output iterator value type.
    - Constructor now accepts init as initial type instead of output iterator value type.
  - DispatchSegmentedReduce:
    - Accepts accumulator type as last parameter.
    - Accepts initializer type instead of output iterator value type.
- Thread operators now accept parameters using different types: Equality, Inequality, InequalityWrapper, Sum, Difference, Division, Max, ArgMax, Min, ArgMin.
- ThreadReduce now accepts accumulator type and uses a different type for prefix.
#511: Use the result type of the binary operator for accumulating intermediate results in the DeviceScan, DeviceScanByKey, and DeviceReduceByKey algorithms, following guidance from http://wg21.link/P2322R6.
- This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for the Dispatch layer have changed:
  - DispatchScan now accepts accumulator type as a template parameter.
  - DispatchScanByKey now accepts accumulator type as a template parameter.
  - DispatchReduceByKey now accepts accumulator type as the last template parameter.
#527: Deprecate the debug_synchronous flags on device algorithms.
- This flag no longer has any effect. Define CUB_DEBUG_SYNC during compilation to enable these checks.
- Moving this option from run-time to compile-time avoids the compilation overhead of unused debugging paths in production code.

New Features

#514: Support __half in DeviceHistogram.
#516: Add support for single-threaded invocations of WarpReduce.
#516: Use __reduce_add_sync hardware acceleration for WarpReduce on supported architectures.

Bug Fixes

#481: Fix the device-wide radix sort implementations to simply copy the input to the output when begin_bit == end_bit.
#487: Fix DeviceHistogram::Even for a variety of edge cases:
- Bin ids are now correctly computed when mixing different types for SampleT and LevelT.
- Bin ids are now correctly computed when LevelT is an integral type and the number of levels does not evenly divide the level range.
#508: Ensure that temp_storage_bytes is properly set in the AdjacentDifferenceCopy device algorithms.
#508: Remove excessive calls to the binary operator given to the AdjacentDifferenceCopy device algorithms.
#533: Fix debugging utilities when RDC is disabled.

Other Enhancements

#448: Removed special case code for unsupported CUDA architectures.
#448: Replace several usages of __CUDA_ARCH__ with <nv/target> to handle host/device code divergence.
#448: Mark unused PTX arch parameters as legacy.
#476: Enabled additional debug logging for the onesweep radix sort implementation. Thanks to @canonizer for this contribution.
#480: Add CUB_DISABLE_BF16_SUPPORT to avoid including the cuda_bf16.h header or using the __nv_bfloat16 type.
#486: Add debug log messages for post-kernel debug synchronizations.
#490: Clarify documentation for in-place usage of DeviceScan algorithms.
#494: Clarify documentation for in-place usage of DeviceHistogram algorithms.
#495: Clarify documentation for in-place usage of DevicePartition algorithms.
#499: Clarify documentation for in-place usage of Device*Sort algorithms.
#500: Clarify documentation for in-place usage of DeviceReduce algorithms.
#501: Clarify documentation for in-place usage of DeviceRunLengthEncode algorithms.
#503: Clarify documentation for in-place usage of DeviceSelect algorithms.
#518: Fix typo in WarpMergeSort documentation.
#519: Clarify segmented sort documentation regarding the handling of elements that are not included in any segment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUB 2.0.0

Summary

Breaking Changes

New Features

Bug Fixes

Other Enhancements

Contributors