Thrust-test

A set of simple examples showing the use of Thrust algorithms using both raw and wrapped pointers.

The examples are organized in two folders:

old: these examples use cudaMalloc and cudaMemcpy (pre-CUDA 6)
new: these examples use Unified Memory with cudaMallocManaged (CUDA 6 or newer)

NOTE: using Thrust with managed memory requires the latest development version Thrust v1.8, available from https://github.com/thrust/thrust (the CUDA Toolkit only provides Thrust v1.7).

Unified Memory and Thrust

By default, Thrust relies on implicit algorithm dispatch, using tags associated with its vector containers. For example, the system tag for the iterators of thrust::device_vector is thrust::cuda::tag, so algorithms dispatched on such iterators will be parallelized in the CUDA system. This will not work with memory allocated through cudaMallocManaged. To prevent the need to introduce new vectors or to wrap existing managed memory simply to use a parallel algorithm, Thrust algorithms can be invoked with an explicitly specified execution policy. This approach is illustrated in the example below, where the array mA could also be directly passed, as is, to a host function or a CUDA kernel.

#include <iostream>
#include <cmath>
#include <thrust/reduce.h>
#include <thrust/system/cuda/execution_policy.h>
#include <thrust/system/omp/execution_policy.h>

const int ARRAY_SIZE = 1000;

int main(int argc, char **argv) {
    double* mA;
    cudaMallocManaged(&mA, ARRAY_SIZE * sizeof(double));
    thrust::sequence(mA, mA + ARRAY_SIZE, 1);
    double maximumGPU = thrust::reduce(thrust::cuda::par, mA, mA + ARRAY_SIZE, 0.0,      
                                       thrust::maximum<double>());
    cudaDeviceSynchronize();
    double maximumCPU = thrust::reduce(thrust::omp::par, mA, mA + ARRAY_SIZE, 0.0,    
                                       thrust::maximum<double>());
    std::cout << "GPU reduce: “ 
              << (std::fabs(maximumGPU ‐ ARRAY_SIZE) < 1e‐10 ? "Passed" : "Failed");
    std::cout << "CPU reduce: “ 
              << (std::fabs(maximumCPU ‐ ARRAY_SIZE) < 1e‐10 ? "Passed" : "Failed");
    cudaFree(mA);
    return 0;
}

With this model, the programmer specifies only the Thrust backend of interest (how the algorithm should be parallelized), without being concerned about the system being able to dereference the iterators provided to the algorithm (where the data "lives"). This is consistent with the simpler programming and memory management enabled by Unified Memory.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
new		new
old		old
CMakeLists.txt		CMakeLists.txt
README.md		README.md
SBELUtils.cmake		SBELUtils.cmake

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thrust-test

Unified Memory and Thrust

About

Releases

Packages

Contributors 2

Languages

uwsbel/Thrust-test

Folders and files

Latest commit

History

Repository files navigation

Thrust-test

Unified Memory and Thrust

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages