Thrust简单入门

原英文教程地址https://docs.nvidia.com/cuda/thrust/index.html#introduction
将其简单翻译压缩一下

简单介绍:
Thrust 是一个类似STL的 CUDA C++ 模板库
如果装了CUDA的话他不用进一步安装即可使用

thrust提供两个矢量容器
host_vector 和 device_vector。
host_vector 存储在主机内存中
device_vector居住在GPU设备内存中。
类似于stl::vector

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <iostream>

int main(void)
{
    // H存储了四个整数
    thrust::host_vector<int> H(4);

    // 初始化
    H[0] = 14;
    H[1] = 20;
    H[2] = 38;
    H[3] = 46;

    // 返回数组长度
    std::cout << "H has size " << H.size() << std::endl;

    // 打印内容
    for(int i = 0; i < H.size(); i++)
        std::cout << "H[" << i << "] = " << H[i] << std::endl;

    // 改变大小
    H.resize(2);

    std::cout << "H now has size " << H.size() << std::endl;

    // 内存到显存的数据拷贝
    thrust::device_vector<int> D = H;

    // 修改内容
    D[0] = 99;
    D[1] = 88;

    // 打印D的内容
    for(int i = 0; i < D.size(); i++)
        std::cout << "D[" << i << "] = " << D[i] << std::endl;

    // 函数返回时,H,D的资源会自动释放
    return 0;
}

这个例子说了 “=”可以用于显存上的数组与内存上的数组内容互相拷贝
还要注意一个单独的元素device_vector可以使用标准括号表示来访问。但是，因为每个访问都需要呼叫cudaMemcpy，他们应该谨慎使用。稍后会给出一些更高效的方式。

将矢量的所有元素初始化为特定值或将一组特定值从一个矢量复制到另一个矢量通常很有用。推力提供了几种方法来完成这些操作。

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/sequence.h>

#include <iostream>

int main(void)
{
    // 将数组初始化为1
    thrust::device_vector<int> D(10, 1);

    // 用9设置D的前7个元素
    thrust::fill(D.begin(), D.begin() + 7, 9);

    // 用D的前五个元素初始化H
    thrust::host_vector<int> H(D.begin(), D.begin() + 5);

    // 将 H 元素修改为 0, 1, 2, 3, ...
    thrust::sequence(H.begin(), H.end());

    // 拷贝H的元素到D
    thrust::copy(H.begin(), H.end(), D.begin());

    // 打印D
    for(int i = 0; i < D.size(); i++)
        std::cout << "D[" << i << "] = " << D[i] << std::endl;

    return 0;
}

这里我们已经介绍了 fill, copy,和 sequence的使用方式。

2.1
在这里我们需要注意命名冲突,命名空间的使用是很重要的

2.2
上面的迭代器与一般的指针和迭代器不同的是他们携带了他们是用于内存/显存的信息
这决定了算法的调用是使用何种实现
这种类型的判断是在编译的时候进行的,这被称为static dispatching(静态调用)

如果使用原生指针去调用thrust的算法,会默认调用host版
如果原生指针使用的是显存,那么需要先用thrust::device_ptr进行封装

size_t N = 10;

// 原生指针指向显存
int * raw_ptr;
cudaMalloc((void **) &raw_ptr, N * sizeof(int));

// 用 device_ptr 封装原生指针
thrust::device_ptr<int> dev_ptr(raw_ptr);

thrust::fill(dev_ptr, dev_ptr + N, (int) 0);

device_ptr转换为原生指针:

size_t N = 10;

thrust::device_ptr<int> dev_ptr = thrust::device_malloc<int>(N);

int * raw_ptr = thrust::raw_pointer_cast(dev_ptr);

区分迭代器与原生指针的另一个原因是可以用于遍历多种数据结构
如std::list
虽然thrust没有提供这种容器的实现,但兼容他们

#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <list>
#include <vector>

int main(void)
{
    // create an STL list with 4 values
    std::list<int> stl_list;

    stl_list.push_back(10);
    stl_list.push_back(20);
    stl_list.push_back(30);
    stl_list.push_back(40);

    // initialize a device_vector with the list
    thrust::device_vector<int> D(stl_list.begin(), stl_list.end());

    // copy a device_vector into an STL vector
    std::vector<int> stl_vector(D.size());
    thrust::copy(D.begin(), D.end(), stl_vector.begin());

    return 0;
}

3.算法
thrust提供了很多公用的并行算法
thrust中的所有算法都有device和host两种实现
除了thrust:copy之外,其他的算法的迭代器参数理应位于同一个位置(显存or内存),不能混用
如果违反,编译器将报错

3.1 Transformations
下面展示了一些算法以及仿函数的使用
除了下面展示的negate,modulus,thrust
还提供了plus and multiplies等多种仿函数在thrust/functional.h

#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/replace.h>
#include <thrust/functional.h>
#include <iostream>

int main(void)
{
    // allocate three device_vectors with 10 elements
    thrust::device_vector<int> X(10);
    thrust::device_vector<int> Y(10);
    thrust::device_vector<int> Z(10);

    // initialize X to 0,1,2,3, ....
    thrust::sequence(X.begin(), X.end());

    //  Y = -X
    thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());

    // fill Z with twos
    thrust::fill(Z.begin(), Z.end(), 2);

    //  Y = X mod 2
    thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus<int>());

    // replace all the ones in Y with tens
    thrust::replace(Y.begin(), Y.end(), 1, 10);

    // print Y
    thrust::copy(Y.begin(), Y.end(), std::ostream_iterator<int>(std::cout, "\n"));

    return 0;    
}

下面展示了 saxpy (ax+y) 的两种实现

struct saxpy_functor
{
    const float a;

    saxpy_functor(float _a) : a(_a) {}

    __host__ __device__
        float operator()(const float& x, const float& y) const { 
            return a * x + y;
        }
};

void saxpy_fast(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y)
{
    // Y <- A * X + Y
    thrust::transform(X.begin(), X.end(), Y.begin(), Y.begin(), saxpy_functor(A));
}

void saxpy_slow(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y)
{
    thrust::device_vector<float> temp(X.size());

    // temp <- A
    thrust::fill(temp.begin(), temp.end(), A);

    // temp <- A * X
    thrust::transform(X.begin(), X.end(), temp.begin(), temp.begin(), thrust::multiplies<float>());

    // Y <- A * X + Y
    thrust::transform(temp.begin(), temp.end(), Y.begin(), Y.begin(), thrust::plus<float>());
}

如果忽略temp的分配开销
saxpy_fast共有2N次读,1N写的开销
saxpy_slow有4N次读,3N次写的开销
在像SAXPY这样的算法中，通常值得应用kernel fusion（将多个操作组合到单个kernel中）以最小化内存读写的开销

thrust::transform 只支持具有一个或两个输入参数的转换（例如, f(x) → y and f(x,x)->y）。
当转换使用两个以上的输入参数时，有必要使用不同的方法。例如thrust::zip_iterator 和 thrust::for_each

3.2 Reductions(归约)
归约算法使用二元操作符将一个序列的值转换为单个值
如求和,求最大值

int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus());

这里前两个参数提供归约序列,后两个提供初始值和归约运算符
下面三个是等价的0和plus是默认参数

int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus());
int sum = thrust::reduce(D.begin(), D.end(), (int) 0);
int sum = thrust::reduce(D.begin(), D.end())

虽然thrust::reduce足以实现各种各样的归约，Thrust提供了一些便利的附加功能。
例如，thrust::count 以给定顺序返回特定值的实例数量：

#include <thrust/count.h>
#include <thrust/device_vector.h>
...
// put three 1s in a device_vector
thrust::device_vector<int> vec(5,0);
vec[1] = 1;
vec[3] = 1;
vec[4] = 1;

// count the 1s
int result = thrust::count(vec.begin(), vec.end(), 1);
// result is three

thrust还提供了其他归约运算如thrust::count_if, thrust::min_element, thrust::max_element, thrust::is_sorted, thrust::inner_product, 等等见参考文档

下面用一个计算向量范数的实现方案来介绍reduce与transform的合并操作

#include <thrust/transform_reduce.h>
#include <thrust/functional.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <cmath>

// square<T> computes the square of a number f(x) -> x*x
template <typename T>
struct square
{
    __host__ __device__
        T operator()(const T& x) const { 
            return x * x;
        }
};

int main(void)
{
    // initialize host array
    float x[4] = {1.0, 2.0, 3.0, 4.0};

    // transfer to device
    thrust::device_vector<float> d_x(x, x + 4);

    // setup arguments
    square<float>        unary_op;
    thrust::plus<float> binary_op;
    float init = 0;

    // compute norm
    float norm = std::sqrt( thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op) );

    std::cout << norm << std::endl;

    return 0;
}

这里需要提供单元运算符和二元运算符

3.3 Prefix-Sums
数组前缀和和扫描操作是许多并行算法重要的一块
下面展示一个扫描操作inclusive scan (默认使用加运算)

#include <thrust/scan.h>

int data[6] = {1, 0, 2, 2, 1, 3};

thrust::inclusive_scan(data, data + 6, data); // in-place scan

// data is now {1, 1, 3, 5, 6, 9}

Sn = a₁ + … a_n

有一个类似的操作exclusive scan,不过他的结果向右偏移了一个位置

#include <thrust/scan.h>

int data[6] = {1, 0, 2, 2, 1, 3};

thrust::exclusive_scan(data, data + 6, data); // in-place scan

// data is now {0, 1, 1, 3, 5, 6}

Sn = a₁ + … a_n-1

3.4. Reordering

Thrust 给分区(partitioning) 和流压缩(stream compaction)提供了以下算法支持

copy_if : 通过判定条件复制元素

partition : 根据判定条件对元素重新排序(错例在后)

remove 和 remove_if : 移除判定条件为false的元素

unique: 去重

具体使用详见文档

3.5 sorting (排序)
thrust提供了thrust::sort 和 thrust::stable_sort 类比与stl的

#include <thrust/sort.h>

...
const int N = 6;
int A[N] = {1, 4, 2, 8, 5, 7};

thrust::sort(A, A + N);

// A is now {1, 2, 4, 5, 7, 8}

通过键值排序 thrust::sort_by_key 和 thrust::stable_sort_by_key

#include <thrust/sort.h>

...
const int N = 6;
int    keys[N] = {  1,   4,   2,   8,   5,   7};
char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};

thrust::sort_by_key(keys, keys + N, values);

// keys is now   {  1,   2,   4,   5,   7,   8}
// values is now {'a', 'c', 'b', 'e', 'f', 'd'}

他们还接受一个比较运算符(仿函数)

#include <thrust/sort.h>
#include <thrust/functional.h>

...
const int N = 6;
int A[N] = {1, 4, 2, 8, 5, 7};

thrust::stable_sort(A, A + N, thrust::greater<int>());

// A is now {8, 7, 5, 4, 2, 1}

4.魔幻迭代器

4.1 constant_iterator
这种迭代器用于返回常量,无论索引为何

#include <thrust/iterator/constant_iterator.h>
...
// create iterators
thrust::constant_iterator<int> first(10);
thrust::constant_iterator<int> last = first + 3;

first[0]   // returns 10
first[1]   // returns 10
first[100] // returns 10

// sum of [first, last)
thrust::reduce(first, last);   // returns 30 (i.e. 3 * 10)

对于常数序列,这是一个高效便利的方式

4.2 counting_iterator
如果需要一个自增的序列,这个迭代器是一个很好的选择

#include <thrust/iterator/counting_iterator.h>
...
// create iterators
thrust::counting_iterator<int> first(10);
thrust::counting_iterator<int> last = first + 3;

first[0]   // returns 10
first[1]   // returns 11
first[100] // returns 110

// sum of [first, last)
thrust::reduce(first, last);   // returns 33 (i.e. 10 + 11 + 12)

constant_iterator 和 counting_iterator并无存储数组,只是实时计算返回

4.3 transform_iterator

#include <thrust/iterator/transform_iterator.h>
// initialize vector
thrust::device_vector<int> vec(3);
vec[0] = 10; vec[1] = 20; vec[2] = 30;

// create iterator (type omitted)
...
first = thrust::make_transform_iterator(vec.begin(), negate<int>());
...
last  = thrust::make_transform_iterator(vec.end(),   negate<int>());

first[0]   // returns -10
first[1]   // returns -20
first[2]   // returns -30

// sum of [first, last)
thrust::reduce(first, last);   // returns -60 (i.e. -10 + -20 + -30)

这种迭代器行如其名
这里没有使用transform_iterator而是用make_transform_iterator是因为考虑到过长的迭代器类型
下面的代码避免了first与last的创造后的不必要的存储操作

thrust::reduce(thrust::make_transform_iterator(vec.begin(), negate<int>()),
           thrust::make_transform_iterator(vec.end(),   negate<int>()));

4.4 permutation_iterator
这种迭代器通过一个键值来间接访问原序列,即打乱顺序

#include <thrust/iterator/permutation_iterator.h>

...

// gather locations
thrust::device_vector<int> map(4);
map[0] = 3;
map[1] = 1;
map[2] = 0;
map[3] = 5;

// array to gather from
thrust::device_vector<int> source(6);
source[0] = 10;
source[1] = 20;
source[2] = 30;
source[3] = 40;
source[4] = 50;
source[5] = 60;

// fuse gather with reduction: 
//   sum = source[map[0]] + source[map[1]] + ...
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map),
                         thrust::make_permutation_iterator(source.begin(), map.end()));

迭代器的初始和结束由map决定

4.5 zip_iterator
和python的zip差不多吧
结合多个独立序列

#include <thrust/iterator/zip_iterator.h>
...
// initialize vectors
thrust::device_vector<int>  A(3);
thrust::device_vector<char> B(3);
A[0] = 10;  A[1] = 20;  A[2] = 30;
B[0] = 'x'; B[1] = 'y'; B[2] = 'z';

// create iterator (type omitted)
first = thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin()));
last  = thrust::make_zip_iterator(thrust::make_tuple(A.end(),   B.end()));

first[0]   // returns tuple(10, 'x')
first[1]   // returns tuple(20, 'y')
first[2]   // returns tuple(30, 'z')

// maximum of [first, last)
thrust::maximum< tuple<int,char> > binary_op;
thrust::tuple<int,char> init = first[0];
thrust::reduce(first, last, init, binary_op); // returns tuple(30, 'z')