PyCUDA学习笔记

最近公司项目需要使用GPU进行加速,没有人会CUDA编程,所以开始了这个学习。做深度学习,CUDA编程也算是必备技能了吧。


1. 介绍

CUDA是NVIDIA的利用GPU进行并行计算的API。PyCUDA是用python封装的一个库,除了进行计算的kernel函数仍需用C++编程以外,其余的像内存分配、设备控制、编译等工作都可以使用python进行,还是便利了不少。经过我的资料查找,我认为学习顺序如下:

  1. 推荐先阅读CUDA编程入门极简教程 ,介绍了CUDA编程的原理,异构编程执行逻辑,编程基础,CUDA编程的逻辑层和物理层介绍
  2. 学习PyCUDA Tutorial,熟悉PyCUDA的API
  3. 学习PyCUDA Examples,学习实际应用中的代码

官方站点如下:

本文实际是对官方文档、例子的翻译,加入了一点注释和自己的思考。可以直接食用本文,也可学习上方列出的官方资料。不过学习本文之前最好先阅读CUDA编程入门极简教程

写完本文后,笔者也试着实现工作中需要使用GPU的一个算法。感觉几个坐标系比较容易弄混:图片的宽高、block的x,y坐标、thread的x,y坐标。仅仅是写出可用,不追求极致的效率、并行的话,其余的部分倒没有太难。

2. 安装

Linux下的官方的PyCUDA安装步骤大概如下:

  1. 确保CUDA已正确安装并配置
  2. 安装CUDA对应版本的gcc
  3. 安装Boost C++
  4. 安装numpy
  5. 安装PyCUDA

但搞深度学习的一般都使用了docker,带深度学习框架的镜像中通常都有CUDA、boost、numpy等,实测只需进行第5步:

# 安装依赖库
apt-get install build-essential python-dev python-setuptools libboost-python-dev libboost-thread-dev -y

#安装PyCUDA
pip install pycuda

还是非常简单方便的。

3. Tutorial

本章是对官方Tutorial的一个翻译,供日后查阅。

Getting started

import与初始化

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

Transferring Data

数据的转移。

将数据从host(CPU与内存)转移到device(GPU与显存)。数据通常是numpy格式,但其实任何满足python buffer接口的类型都可以,甚至是str

创建一个$4 \times 4$的随机数矩阵:

import numpy
a = numpy.random.randn(4,4)

a中是双精度浮点数,而NVIDIA设备通常只支持单精度:

a = a.astype(numpy.float32)

在device上为a分配所需的显存:

a_gpu = cuda.mem_alloc(a.nbytes)

将数据转移到GPU:

cuda.memcpy_htod(a_gpu, a)

Executing a Kernel

CUDA计算:kernel函数

本节将把a_gpu中的值翻倍。

首先,编写对应的CUDA C代码,并送入pycuda.compiler.SourceModule

mod = SourceModule("""
  __global__ void doublify(float *a)
  {
    int idx = threadIdx.x + threadIdx.y*4;
    a[idx] *= 2;
  }
  """)

如果没有报错,则代码已成功编译并载入了device。我们得到一个对我们pycuda.driver.Function的引用并调用它。使用a_gpu为参数,block size为$4 \times 4$:

func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))

最后,我们从GPU获得结果并打印它:

a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled
print a

结果如下:

[[ 0.51360393  1.40589952  2.25009012  3.02563429]
 [-0.75841576 -1.18757617  2.72269917  3.12156057]
 [ 0.28826082 -2.92448163  1.21624792  2.86353827]
 [ 1.57651746  0.63500965  2.21570683 -0.44537592]]
[[ 0.25680196  0.70294976  1.12504506  1.51281714]
 [-0.37920788 -0.59378809  1.36134958  1.56078029]
 [ 0.14413041 -1.46224082  0.60812396  1.43176913]
 [ 0.78825873  0.31750482  1.10785341 -0.22268796]]

Shortcuts for Explicit Memory Copies

Argument handler pycuda.driver.In, pycuda.driver.Out, 和pycuda.driver.InOut能简化内存转移。上个例子中,如果可以覆盖a,则不需要创建a_gpu:

func(cuda.InOut(a), block=(4, 4, 1))

Bonus: Abstracting Away the Complications

代码简化方法,无需编写kernel C++代码

使用pycuda.gpuarray.GPUArray,可以获得上个例子的同样效果:

import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy

a_gpu = gpuarray.to_gpu(numpy.random.randn(4,4).astype(numpy.float32))
a_doubled = (2*a_gpu).get()
print a_doubled
print a_gpu

Advanced Topics

Structures

假定我们有如下structure,用于对一个可变长度的array进行翻倍。ptr是array的指针,datalen规定了长度,__padding用于64位对齐:

mod = SourceModule("""
struct DoubleOperation {
    int datalen, __padding; // so 64-bit ptrs can be aligned
    float *ptr;
};

__global__ void double_array(DoubleOperation *a) {
    a = &a[blockIdx.x];
    for (int idx = threadIdx.x; idx < a->datalen; idx += blockDim.x) {
        a->ptr[idx] *= 2;
    }
}
""")

grid中的每个block(这是CUDA中的概念,见官方文档,或本文第一章中推荐的入门博客)会为array中的一个元素进行翻倍。里面的for循环可以允许比threads多的data elements被翻倍,但如果确定threads数量足够的话,这样效率不高。

接下来,为structure创建一个wrapper类,并初始化两个array:

class DoubleOpStruct:
    # 计算内存大小,8是固定的8个字节,struct中datalen和__padding的占用。intp是numpy中用作指针的数据类型,nbytes意思是占用的字节数。这里不直接写4或8,而是通过numpy获取指针所需字节数,避免写死到64位或32位环境
    mem_size = 8 + numpy.intp(0).nbytes
    '''构造函数,第一个参数为python数组,第二个参数为要转移到的指针地址 '''
    def __init__(self, array, struct_arr_ptr):
        self.data = cuda.to_device(array)
        self.shape, self.dtype = array.shape, array.dtype
        # 转移datalen
        cuda.memcpy_htod(int(struct_arr_ptr), numpy.getbuffer(numpy.int32(array.size)))
        # 转移数组
        cuda.memcpy_htod(int(struct_arr_ptr) + 8, numpy.getbuffer(numpy.intp(int(self.data))))
    def __str__(self):
        return str(cuda.from_device(self.data, self.shape, self.dtype))

# 分配两个struct大小的内存,返回指针
struct_arr = cuda.mem_alloc(2 * DoubleOpStruct.mem_size)
# 通过偏移得到第二个的指针
do2_ptr = int(struct_arr) + DoubleOpStruct.mem_size

# 初始化两个wrapper类,分别为[1, 2, 3]和[0, 4]
array1 = DoubleOpStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)
array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), do2_ptr)
print("original arrays", array1, array2)

上方代码使用了 pycuda.driver.to_device()pycuda.driver.from_device()两个函数进行了值的allocate和copy。并演示了对内存已分配的block的用法。

接下来的代码对两个数组进行了翻倍,并只对第二个进行了翻倍:

func = mod.get_function("double_array")
func(struct_arr, block = (32, 1, 1), grid=(2, 1))
print("doubled arrays", array1, array2)

func(numpy.intp(do2_ptr), block = (32, 1, 1), grid=(1, 1))
print("doubled second only", array1, array2, "\n")

注意对两个array进行翻倍时,grid尺寸为(2,1),而单个时为(1,1),即每个grid负责一个array的计算。注意对两个翻倍仅需执行一次,因为分配了两个gird。而通过偏移得到的指针需要用numpy.intp()函数封装一下,需要注意。

这里有点困惑的地方是block尺寸的设置,两个都设为了(32,1,1)。我更倾向于这里的32是随便写的,因为第一个例子里的4*4矩阵翻倍,block设为了(4,4,1),即block里共4*4个threads,每个只为矩阵中的1个元素翻倍。我认为本例的kernel函数中由于有for循环,故block可以随意设置。笔者尝试过(1,1,1),同样能够见效。如果这里理解有误,欢迎留言告知。

官方Tutorial到此结束。关于后续的学习,它推荐了Device Interface以及官方Example。

4. Device Interface

设备相关接口,只做简单记录不进行全文翻译,详细可查看Device Interface

可以查询版本,定义了Exception,定义了常量,定义了device和context,并发和流的相关函数,memory相关函数,以及在device上编程需要的ModuleFunctions

5. Metaprogramming

传统的软件开发,人们编写程序完成一个任务。而在metaprogramming中,人们编写程序来编写程序完成一个任务。

这听起来很复杂,我们先来看看为啥这是一个好主意。

Why metaprogramming?

Automated Tuning

CUDA程序员很多时间都用在了代码tuning上:

  • 每个block的thread数量为多少最优?
  • 我每次需要处理多少数据?
  • 哪些数据需加载到共享内存中,以及corresponding blocks需多大?

幸运的话你可以形成一种模式,很快地选出最快的版本。但当下一代硬件出现时,这会失效。

PyCUDA的解决方案是:

Forget heuristics. Benchmark at run time and use whatever works fastest.

这也是PyCUDA相比CUDA的重要优点:它可以让你在代码运行时进行决策。

Data Types

代码运行时可能会需要处理多种数据类型。如单精度或双精度浮点数。你可以预编译两个版本,但何苦呢。只用在需要时生成需要的代码即可。

Specialize Code for the Given Problem

假如你需要实现一个库,需要完成某些任务。想象一下如果你能为被需要的任务生成代码,而不是让代码不必要的泛化从而变慢,这有多爽。PyCUDA能让这变成现实。

Constants are Faster than Variables

如果你每次运行时问题的大小变化很大,但你为唯一尺寸的数据进行了大量kernel调用时,你可以考虑把代码中的数据尺寸编译为常量。这会带来显著的性能提升,主要来自fetch时间的减少和register pressure的降低。不仅如此,乘以一个常量也比变量乘变量执行得更高效。

Loop Unrolling

CUDA编程手册介绍了nvcc以及其loop unroll功能。但据PyCUDA作者实验,在V2.1后就不是这么回事了。而通过元编程,可以自动地把循环unroll到需要的size。

Metaprogramming using a Templating Engine

如果元编程需求比较简单,那么模板引擎是比较好的选择。Python有很多模板引擎,推荐Jinja 2Cheetah

下面的代码是在可配置的block size下进行向量加法的简单元程序:

from jinja2 import Template

tpl = Template("""
    __global__ void add(
            {{ type_name }} *tgt,
            {{ type_name }} *op1,
            {{ type_name }} *op2)
    {
      int idx = threadIdx.x +
        {{ thread_block_size }} * {{block_size}}
        * blockIdx.x;

      {% for i in range(block_size) %}
              {% set offset = i*thread_block_size %}
              tgt[idx + {{ offset }}] =
                op1[idx + {{ offset }}]
                + op2[idx + {{ offset }}];
          {% endfor %}
    }""")

rendered_tpl = tpl.render(
    type_name="float", block_size=block_size,
    thread_block_size=thread_block_size)

mod = SourceModule(rendered_tpl)

可运行的代码见examples/demo_meta_template.py。还有使用Cheetah的模板元编程的矩阵乘法:demo_meta_matrixmul_cheetah.pydemo_meta_matrixmul_cheetah.template.cu

Metaprogramming using codepy

更复杂的元编程,模板引擎就无法满足了。codepy提供了一种从python数据结构生成CUDA源代码的方法。

下方例子与上一个例子用途一致:

from codepy.cgen import FunctionBody, \
    FunctionDeclaration, Typedef, POD, Value, \
    Pointer, Module, Block, Initializer, Assign
from codepy.cgen.cuda import CudaGlobal

mod = Module([
    FunctionBody(
        CudaGlobal(FunctionDeclaration(
            Value("void", "add"),
            arg_decls=[Pointer(POD(dtype, name))
                for name in ["tgt", "op1", "op2"]])),
        Block([
            Initializer(
                POD(numpy.int32, "idx"),
                "threadIdx.x + %d*blockIdx.x"
                % (thread_block_size*block_size)),
            ]+[
            Assign(
                "tgt[idx+%d]" % (o*thread_block_size),
                "op1[idx+%d] + op2[idx+%d]" % (
                    o*thread_block_size,
                    o*thread_block_size))
            for o in range(block_size)]))])

mod = SourceModule(mod)

完整代码见examples/demo_meta_codepy.py

官方文档后续还介绍了Profiler、JIT编译、OpenGL、GPU Arrays、一些接口。不一一赘述了,接下来开始学习一些Example。

6. Examples

官方共给出了42个demo,本文只会对博主觉得需要学习的进行说明。

hello_gpu.py

这个demo使用CUDA计算了两个长度为400的随机向量的积,并将结果与python计算的积相减,结果应为0向量。

from __future__ import print_function
from __future__ import absolute_import
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy
import numpy.linalg as la
from pycuda.compiler import SourceModule

mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1))

print(dest-a*b)

demo.py

Tutorial中的代码,为一个(4*4)矩阵翻倍,分别是原始写法、通过InOut的简便写法、以及使用gpuarray的无kernel写法。

# Sample source code from the Tutorial Introduction in the documentation.
from __future__ import print_function
from __future__ import absolute_import
import pycuda.driver as cuda
import pycuda.autoinit  # noqa
from pycuda.compiler import SourceModule

import numpy
a = numpy.random.randn(4, 4)

a = a.astype(numpy.float32)

a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)

cuda.memcpy_htod(a_gpu, a)

mod = SourceModule("""
    __global__ void doublify(float *a)
    {
      int idx = threadIdx.x + threadIdx.y*4;
      a[idx] *= 2;
    }
    """)

func = mod.get_function("doublify")
func(a_gpu, block=(4, 4, 1), grid=(1, 1), shared=0)

a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print("original array:")
print(a)
print("doubled with kernel:")
print(a_doubled)

# alternate kernel invocation -------------------------------------------------

func(cuda.InOut(a), block=(4, 4, 1))
print("doubled with InOut:")
print(a)

# part 2 ----------------------------------------------------------------------

import pycuda.gpuarray as gpuarray
a_gpu = gpuarray.to_gpu(numpy.random.randn(4, 4).astype(numpy.float32))
a_doubled = (2*a_gpu).get()

print("original array:")
print(a_gpu)
print("doubled with gpuarray:")
print(a_doubled)

demo_struct.py

Tutorial中的Advance topic里的代码

# prepared invocations and structures -----------------------------------------
from __future__ import print_function
from __future__ import absolute_import
import pycuda.driver as cuda
import pycuda.autoinit
import numpy
from pycuda.compiler import SourceModule

class DoubleOpStruct:
    mem_size = 8 + numpy.uintp(0).nbytes
    def __init__(self, array, struct_arr_ptr):
        self.data = cuda.to_device(array)
        self.shape, self.dtype = array.shape, array.dtype
        """
        numpy.getbuffer() needed due to lack of new-style buffer interface for
        scalar numpy arrays as of numpy version 1.9.1

        see: https://github.com/inducer/pycuda/pull/60
        """
        cuda.memcpy_htod(int(struct_arr_ptr),
                         numpy.getbuffer(numpy.int32(array.size)))
        cuda.memcpy_htod(int(struct_arr_ptr) + 8,
                         numpy.getbuffer(numpy.uintp(int(self.data))))

    def __str__(self):
        return str(cuda.from_device(self.data, self.shape, self.dtype))


struct_arr = cuda.mem_alloc(2 * DoubleOpStruct.mem_size)
do2_ptr = int(struct_arr) + DoubleOpStruct.mem_size

array1 = DoubleOpStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)
array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), do2_ptr)

print("original arrays")
print(array1)
print(array2)

mod = SourceModule("""
    struct DoubleOperation {
        int datalen, __padding; // so 64-bit ptrs can be aligned
        float *ptr;
    };


    __global__ void double_array(DoubleOperation *a)
    {
        a = a + blockIdx.x;
        for (int idx = threadIdx.x; idx < a->datalen; idx += blockDim.x)
        {
            float *a_ptr = a->ptr;
            a_ptr[idx] *= 2;
        }
    }
    """)
func = mod.get_function("double_array")
func(struct_arr, block=(32, 1, 1), grid=(2, 1))

print("doubled arrays")
print(array1)
print(array2)

func(numpy.uintp(do2_ptr), block=(32, 1, 1), grid=(1, 1))
print("doubled second only")
print(array1)
print(array2)

if cuda.get_version() < (4, ):
    func.prepare("P", block=(32, 1, 1))
    func.prepared_call((2, 1), struct_arr)
else:
    func.prepare("P")
    block = (32, 1, 1)
    func.prepared_call((2, 1), block, struct_arr)


print("doubled again")
print(array1)
print(array2)

if cuda.get_version() < (4, ):
    func.prepared_call((1, 1), do2_ptr)
else:
    func.prepared_call((1, 1), block, do2_ptr)


print("doubled second only again")
print(array1)
print(array2)

需要注意代码里提到了由于numpy高版本的原因,需要用numpy.getbuffer(numpy.uintp(int(self.data)))来传输数据。

demo_elementwise.py

利用pycuda.elementwise.ElementwiseKernel完成的一次向量的加权线性组合运算。ElementwiseKernel的参数全都是字符串,第一个参数为入参声明,第二个参数叫operation定义了返回值的计算方法,其中用到了第四个参数声明的一个kernel函数。第三个参数为name。第四个参数可以引入其他文件,或者定义被operation使用的函数。

from __future__ import absolute_import
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
import numpy
from pycuda.curandom import rand as curand

a_gpu = curand((50,))
b_gpu = curand((50,))

from pycuda.elementwise import ElementwiseKernel
lin_comb = ElementwiseKernel(
        "float a, float *x, float b, float *y, float *z",
        "z[i] = my_f(a*x[i], b*y[i])",
        "linear_combination",
        preamble="""
        __device__ float my_f(float x, float y)
        { 
          return sin(x*y);
        }
        """)

c_gpu = gpuarray.empty_like(a_gpu)
lin_comb(5, a_gpu, 6, b_gpu, c_gpu)

import numpy.linalg as la
assert la.norm(c_gpu.get() - numpy.sin((5*a_gpu*6*b_gpu).get())) < 1e-5
print(c_gpu)

SimpleSpeedTest.py

对比了GPU运算和CPU运算的速度。不管是作者注释中给出的结果还是博主实测的结果,GPU都大幅领先CPU。而GPU的几种不同实现中,SourceModule最为优秀,GPUArray表现最差。使用ElementWise的,在C++里循环的比在python里循环的快很多。可见最好还是完全使用C++源码实现kernel函数。

# Very simple speed testing code
# Shows you how to run a loop over sin() using different methods
# with a note of the time each method takes
# For the GPU this uses SourceModule, ElementwiseKernel, GPUArray
# For the CPU this uses numpy
# Ian@IanOzsvald.com

# Using a WinXP Intel Core2 Duo 2.66GHz CPU (1 CPU used)
# with a 9800GT GPU I get the following timings (smaller is better):
#
# Using nbr_values == 8192
# Calculating 100000 iterations
# SourceModule time and first three results:
# 0.166590s, [ 0.005477  0.005477  0.005477]
# Elementwise time and first three results:
# 0.171657s, [ 0.005477  0.005477  0.005477]
# Elementwise Python looping time and first three results:
# 1.487470s, [ 0.005477  0.005477  0.005477]
# GPUArray time and first three results:
# 4.740007s, [ 0.005477  0.005477  0.005477]
# CPU time and first three results:
# 32.933660s, [ 0.005477  0.005477  0.005477]
#
#
# Using Win 7 x64, GTX 470 GPU, X5650 Xeon,
# Driver v301.42, CUDA 4.2, Python 2.7 x64,
# PyCuda 2012.1 gave the following results:
#
# Using nbr_values == 8192
# Calculating 100000 iterations
# SourceModule time and first three results:
# 0.058321s, [ 0.005477  0.005477  0.005477]
# Elementwise time and first three results:
# 0.102110s, [ 0.005477  0.005477  0.005477]
# Elementwise Python looping time and first three results:
# 2.428810s, [ 0.005477  0.005477  0.005477]
# GPUArray time and first three results:
# 8.421861s, [ 0.005477  0.005477  0.005477]
# CPU time measured using :
# 5.905661s, [ 0.005477  0.005477  0.005477]


import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy
from pycuda.compiler import SourceModule
import pycuda.gpuarray as gpuarray
import pycuda.cumath
from pycuda.elementwise import ElementwiseKernel

blocks = 64
block_size = 128
nbr_values = blocks * block_size

print "Using nbr_values ==", nbr_values

# Number of iterations for the calculations,
# 100 is very quick, 2000000 will take a while
n_iter = 100000
print "Calculating %d iterations" % (n_iter)

# create two timers so we can speed-test each approach
start = drv.Event()
end = drv.Event()

######################
# SourceModele SECTION
# We write the C code and the indexing and we have lots of control

mod = SourceModule("""
__global__ void gpusin(float *dest, float *a, int n_iter)
{
  const int i = blockDim.x*blockIdx.x + threadIdx.x;
  for(int n = 0; n < n_iter; n++) {
    a[i] = sin(a[i]);
  }
  dest[i] = a[i];
}
""")

gpusin = mod.get_function("gpusin")

# create an array of 1s
a = numpy.ones(nbr_values).astype(numpy.float32)
# create a destination array that will receive the result
dest = numpy.zeros_like(a)

start.record() # start timing
gpusin(drv.Out(dest), drv.In(a), numpy.int32(n_iter), grid=(blocks,1), block=(block_size,1,1) )
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "SourceModule time and first three results:"
print "%fs, %s" % (secs, str(dest[:3]))


#####################
# Elementwise SECTION
# use an ElementwiseKernel with sin in a for loop all in C call from Python
kernel = ElementwiseKernel(
   "float *a, int n_iter",
   "for(int n = 0; n < n_iter; n++) { a[i] = sin(a[i]);}",
   "gpusin")

a = numpy.ones(nbr_values).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
start.record() # start timing
kernel(a_gpu, numpy.int(n_iter))
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "Elementwise time and first three results:"
print "%fs, %s" % (secs, str(a_gpu.get()[:3]))


####################################
# Elementwise Python looping SECTION
# as Elementwise but the for loop is in Python, not in C
kernel = ElementwiseKernel(
   "float *a",
   "a[i] = sin(a[i]);",
   "gpusin")

a = numpy.ones(nbr_values).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
start.record() # start timing
for i in range(n_iter):
    kernel(a_gpu)
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "Elementwise Python looping time and first three results:"
print "%fs, %s" % (secs, str(a_gpu.get()[:3]))


##################
# GPUArray SECTION
# The result is copied back to main memory on each iteration, this is a bottleneck

a = numpy.ones(nbr_values).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
start.record() # start timing
for i in range(n_iter):
    a_gpu = pycuda.cumath.sin(a_gpu)
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "GPUArray time and first three results:"
print "%fs, %s" % (secs, str(a_gpu.get()[:3]))


#############
# CPU SECTION
# use numpy the calculate the result on the CPU for reference

a = numpy.ones(nbr_values).astype(numpy.float32)
start.record() # start timing
start.synchronize()

for i in range(n_iter):
    a = numpy.sin(a)

end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "CPU time and first three results:"
print "%fs, %s" % (secs, str(a[:3]))

ThreadsAndBlocks.py

这个demo演示了gridblockthread之间的关系。

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

mod = SourceModule("""
    #include <stdio.h>

    __global__ void say_hi()
    {
      printf("I am %dth thread in threadIdx.x:%d.threadIdx.y:%d  blockIdx.:%d blockIdx.y:%d blockDim.x:%d blockDim.y:%d\\n",(threadIdx.x+threadIdx.y*blockDim.x+(blockIdx.x*blockDim.x*blockDim.y)+(blockIdx.y*blockDim.x*blockDim.y)),threadIdx.x, threadIdx.y,blockIdx.x,blockIdx.y,blockDim.x,blockDim.y);
    }
    """)

func = mod.get_function("say_hi")
func(block=(4,4,1),grid=(2,2,1))

kernel函数那一坨有点不方便阅读,单独摘出来:

#include <stdio.h>

    __global__ void say_hi()
    {
      printf("I am %dth thread 
      in threadIdx.x:%d.
      threadIdx.y:%d  
      blockIdx.:%d 
      blockIdx.y:%d 
      blockDim.x:%d 
      blockDim.y:%d\\n"
      ,(threadIdx.x+threadIdx.y*blockDim.x+(blockIdx.x*blockDim.x*blockDim.y)+(blockIdx.y*blockDim.x*blockDim.y))
      ,threadIdx.x
      ,threadIdx.y
      ,blockIdx.x
      ,blockIdx.y
      ,blockDim.x
      ,blockDim.y);
    }

kernel中可访问到以下属性,用于进行运算任务的分配:

  • blockDim.x
  • blockDim.y
  • blockIdx.x
  • blockIdx.y
  • threadIdx.x
  • threadIdx.y

其中blockDim为每个block的维度,被参数block=(4,4,1)控制,threadIdx将无法超过这个值,每个block中有16个threadblockIdxblockgrid中的坐标,grid维度为grid=(2,2,1),每个grid中有4个block

demo中,thread的计数是threadIdx.x + threadIdx.y * blockDim.x + (blockIdx.x * blockDim.x * blockDim.y) + (blockIdx.y * blockDim.x * blockDim.y这样计算的。但在gird中,沿斜对角线对称的block里的thread,计算出的id会重复。

MatrixmulSimple.py

用一个block计算两个矩阵的乘法,即每个thread负责计算结果矩阵中自己坐标位置的元素的值。

原demo为计算两个方阵的乘,改造为[M*N] * [N*M]

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
Multiplies two square matrices together using a *single* block of threads and
global memory only. Each thread computes one element of the resulting matrix.
"""

import numpy as np
from pycuda import driver, compiler, gpuarray, tools

# -- initialize the device
import pycuda.autoinit

# [M*N] * [N*M] matrix multiply
kernel_code_template = """
__global__ void MatrixMulKernel(float *a, float *b, float *c)
{
    // 2D Thread ID (assuming that only *one* block will be executed)
    // y responsible for the left matrix's row
    // x responsible for the right matrix's column
    // calculate the result matrix[x,y]'s element's value
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Pvalue is used to store the element of the matrix
    // that is computed by the thread
    float Pvalue = 0;

    // Each thread loads one row of M and one column of N,
    //   to produce one element of P.
    for (int i = 0; i < %(N)s; ++i) {
        float Aelement = a[ty * %(N)s + i];
        float Belement = b[i * %(M)s + tx];
        Pvalue += Aelement * Belement;
    }

    // Write the matrix to device memory;
    // each thread writes one element
    c[ty * %(M)s + tx] = Pvalue;
}
"""

# define the (square) matrix size
#  note that we'll only use *one* block of threads here
#  as a consequence this number (squared) can't exceed max_threads,
#  see http://documen.tician.de/pycuda/util.html#pycuda.tools.DeviceData
#  for more information on how to get this number for your device
M = 2
N = 3

# create two random square matrices
a_cpu = np.random.randn(M, N).astype(np.float32)
b_cpu = np.random.randn(N, M).astype(np.float32)

# compute reference on the CPU to verify GPU computation
c_cpu = np.dot(a_cpu, b_cpu)

# transfer host (CPU) memory to device (GPU) memory
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu = gpuarray.to_gpu(b_cpu)

# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((M, M), np.float32)

# get the kernel code from the template
# by specifying the constant MATRIX_SIZE
kernel_code = kernel_code_template % {
    'M': M,
    'N': N
    }

# compile the kernel code
mod = compiler.SourceModule(kernel_code)

# get the kernel function from the compiled module
matrixmul = mod.get_function("MatrixMulKernel")

# call the kernel on the card
matrixmul(
    # inputs
    a_gpu, b_gpu,
    # output
    c_gpu,
    # (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
    block = (M, M, 1),
    )

# print the results
print "-" * 80
print "Matrix A (GPU):"
print a_gpu.get()

print "-" * 80
print "Matrix B (GPU):"
print b_gpu.get()

print "-" * 80
print "Matrix C (GPU):"
print c_gpu.get()

print "-" * 80
print "Matrix C (CPU):"
print c_cpu

print "-" * 80
print "CPU-GPU difference:"
print c_cpu - c_gpu.get()

np.allclose(c_cpu, c_gpu.get())