最近公司项目需要使用GPU进行加速,没有人会CUDA编程,所以开始了这个学习。做深度学习,CUDA编程也算是必备技能了吧。
1. 介绍
CUDA是NVIDIA的利用GPU进行并行计算的API。PyCUDA是用python封装的一个库,除了进行计算的kernel函数仍需用C++编程以外,其余的像内存分配、设备控制、编译等工作都可以使用python进行,还是便利了不少。经过我的资料查找,我认为学习顺序如下:
- 推荐先阅读CUDA编程入门极简教程 ,介绍了CUDA编程的原理,异构编程执行逻辑,编程基础,CUDA编程的逻辑层和物理层介绍
- 学习PyCUDA Tutorial,熟悉PyCUDA的API
- 学习PyCUDA Examples,学习实际应用中的代码
官方站点如下:
本文实际是对官方文档、例子的翻译,加入了一点注释和自己的思考。可以直接食用本文,也可学习上方列出的官方资料。不过学习本文之前最好先阅读CUDA编程入门极简教程。
写完本文后,笔者也试着实现工作中需要使用GPU的一个算法。感觉几个坐标系比较容易弄混:图片的宽高、block的x,y坐标、thread的x,y坐标。仅仅是写出可用,不追求极致的效率、并行的话,其余的部分倒没有太难。
2. 安装
Linux下的官方的PyCUDA安装步骤大概如下:
- 确保CUDA已正确安装并配置
- 安装CUDA对应版本的gcc
- 安装Boost C++
- 安装numpy
- 安装PyCUDA
但搞深度学习的一般都使用了docker,带深度学习框架的镜像中通常都有CUDA、boost、numpy等,实测只需进行第5步:
# 安装依赖库
apt-get install build-essential python-dev python-setuptools libboost-python-dev libboost-thread-dev -y
#安装PyCUDA
pip install pycuda
还是非常简单方便的。
3. Tutorial
本章是对官方Tutorial的一个翻译,供日后查阅。
Getting started
import与初始化
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
Transferring Data
数据的转移。
将数据从host
(CPU与内存)转移到device
(GPU与显存)。数据通常是numpy
格式,但其实任何满足python buffer接口的类型都可以,甚至是str
。
创建一个$4 \times 4$的随机数矩阵:
import numpy
a = numpy.random.randn(4,4)
a中是双精度浮点数,而NVIDIA设备通常只支持单精度:
a = a.astype(numpy.float32)
在device上为a分配所需的显存:
a_gpu = cuda.mem_alloc(a.nbytes)
将数据转移到GPU:
cuda.memcpy_htod(a_gpu, a)
Executing a Kernel
CUDA计算:kernel函数
本节将把a_gpu
中的值翻倍。
首先,编写对应的CUDA C代码,并送入pycuda.compiler.SourceModule
:
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
如果没有报错,则代码已成功编译并载入了device。我们得到一个对我们pycuda.driver.Function
的引用并调用它。使用a_gpu
为参数,block size为$4 \times 4$:
func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))
最后,我们从GPU获得结果并打印它:
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled
print a
结果如下:
[[ 0.51360393 1.40589952 2.25009012 3.02563429]
[-0.75841576 -1.18757617 2.72269917 3.12156057]
[ 0.28826082 -2.92448163 1.21624792 2.86353827]
[ 1.57651746 0.63500965 2.21570683 -0.44537592]]
[[ 0.25680196 0.70294976 1.12504506 1.51281714]
[-0.37920788 -0.59378809 1.36134958 1.56078029]
[ 0.14413041 -1.46224082 0.60812396 1.43176913]
[ 0.78825873 0.31750482 1.10785341 -0.22268796]]
Shortcuts for Explicit Memory Copies
Argument handler pycuda.driver.In
, pycuda.driver.Out
, 和pycuda.driver.InOut
能简化内存转移。上个例子中,如果可以覆盖a
,则不需要创建a_gpu
:
func(cuda.InOut(a), block=(4, 4, 1))
Bonus: Abstracting Away the Complications
代码简化方法,无需编写kernel C++代码
使用pycuda.gpuarray.GPUArray
,可以获得上个例子的同样效果:
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy
a_gpu = gpuarray.to_gpu(numpy.random.randn(4,4).astype(numpy.float32))
a_doubled = (2*a_gpu).get()
print a_doubled
print a_gpu
Advanced Topics
Structures
假定我们有如下structure,用于对一个可变长度的array进行翻倍。ptr
是array的指针,datalen
规定了长度,__padding
用于64位对齐:
mod = SourceModule("""
struct DoubleOperation {
int datalen, __padding; // so 64-bit ptrs can be aligned
float *ptr;
};
__global__ void double_array(DoubleOperation *a) {
a = &a[blockIdx.x];
for (int idx = threadIdx.x; idx < a->datalen; idx += blockDim.x) {
a->ptr[idx] *= 2;
}
}
""")
grid
中的每个block
(这是CUDA中的概念,见官方文档,或本文第一章中推荐的入门博客)会为array中的一个元素进行翻倍。里面的for
循环可以允许比threads
多的data elements被翻倍,但如果确定threads
数量足够的话,这样效率不高。
接下来,为structure创建一个wrapper类,并初始化两个array:
class DoubleOpStruct:
# 计算内存大小,8是固定的8个字节,struct中datalen和__padding的占用。intp是numpy中用作指针的数据类型,nbytes意思是占用的字节数。这里不直接写4或8,而是通过numpy获取指针所需字节数,避免写死到64位或32位环境
mem_size = 8 + numpy.intp(0).nbytes
'''构造函数,第一个参数为python数组,第二个参数为要转移到的指针地址 '''
def __init__(self, array, struct_arr_ptr):
self.data = cuda.to_device(array)
self.shape, self.dtype = array.shape, array.dtype
# 转移datalen
cuda.memcpy_htod(int(struct_arr_ptr), numpy.getbuffer(numpy.int32(array.size)))
# 转移数组
cuda.memcpy_htod(int(struct_arr_ptr) + 8, numpy.getbuffer(numpy.intp(int(self.data))))
def __str__(self):
return str(cuda.from_device(self.data, self.shape, self.dtype))
# 分配两个struct大小的内存,返回指针
struct_arr = cuda.mem_alloc(2 * DoubleOpStruct.mem_size)
# 通过偏移得到第二个的指针
do2_ptr = int(struct_arr) + DoubleOpStruct.mem_size
# 初始化两个wrapper类,分别为[1, 2, 3]和[0, 4]
array1 = DoubleOpStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)
array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), do2_ptr)
print("original arrays", array1, array2)
上方代码使用了 pycuda.driver.to_device()
和pycuda.driver.from_device()
两个函数进行了值的allocate和copy。并演示了对内存已分配的block的用法。
接下来的代码对两个数组进行了翻倍,并只对第二个进行了翻倍:
func = mod.get_function("double_array")
func(struct_arr, block = (32, 1, 1), grid=(2, 1))
print("doubled arrays", array1, array2)
func(numpy.intp(do2_ptr), block = (32, 1, 1), grid=(1, 1))
print("doubled second only", array1, array2, "\n")
注意对两个array进行翻倍时,grid尺寸为(2,1)
,而单个时为(1,1)
,即每个grid负责一个array的计算。注意对两个翻倍仅需执行一次,因为分配了两个gird。而通过偏移得到的指针需要用numpy.intp()
函数封装一下,需要注意。
这里有点困惑的地方是block尺寸的设置,两个都设为了(32,1,1)
。我更倾向于这里的32是随便写的,因为第一个例子里的4*4
矩阵翻倍,block设为了(4,4,1)
,即block里共4*4
个threads,每个只为矩阵中的1个元素翻倍。我认为本例的kernel函数中由于有for循环,故block可以随意设置。笔者尝试过(1,1,1)
,同样能够见效。如果这里理解有误,欢迎留言告知。
官方Tutorial到此结束。关于后续的学习,它推荐了Device Interface以及官方Example。
4. Device Interface
设备相关接口,只做简单记录不进行全文翻译,详细可查看Device Interface。
可以查询版本,定义了Exception,定义了常量,定义了device和context,并发和流的相关函数,memory相关函数,以及在device上编程需要的Module
和Functions
。
5. Metaprogramming
传统的软件开发,人们编写程序完成一个任务。而在metaprogramming中,人们编写程序来编写程序完成一个任务。
这听起来很复杂,我们先来看看为啥这是一个好主意。
Why metaprogramming?
Automated Tuning
CUDA程序员很多时间都用在了代码tuning上:
- 每个block的thread数量为多少最优?
- 我每次需要处理多少数据?
- 哪些数据需加载到共享内存中,以及corresponding blocks需多大?
幸运的话你可以形成一种模式,很快地选出最快的版本。但当下一代硬件出现时,这会失效。
PyCUDA的解决方案是:
Forget heuristics. Benchmark at run time and use whatever works fastest.
这也是PyCUDA相比CUDA的重要优点:它可以让你在代码运行时进行决策。
Data Types
代码运行时可能会需要处理多种数据类型。如单精度或双精度浮点数。你可以预编译两个版本,但何苦呢。只用在需要时生成需要的代码即可。
Specialize Code for the Given Problem
假如你需要实现一个库,需要完成某些任务。想象一下如果你能为被需要的任务生成代码,而不是让代码不必要的泛化从而变慢,这有多爽。PyCUDA能让这变成现实。
Constants are Faster than Variables
如果你每次运行时问题的大小变化很大,但你为唯一尺寸的数据进行了大量kernel调用时,你可以考虑把代码中的数据尺寸编译为常量。这会带来显著的性能提升,主要来自fetch时间的减少和register pressure的降低。不仅如此,乘以一个常量也比变量乘变量执行得更高效。
Loop Unrolling
CUDA编程手册介绍了nvcc
以及其loop unroll功能。但据PyCUDA作者实验,在V2.1后就不是这么回事了。而通过元编程,可以自动地把循环unroll到需要的size。
Metaprogramming using a Templating Engine
如果元编程需求比较简单,那么模板引擎是比较好的选择。Python有很多模板引擎,推荐Jinja 2和Cheetah。
下面的代码是在可配置的block size下进行向量加法的简单元程序:
from jinja2 import Template
tpl = Template("""
__global__ void add(
{{ type_name }} *tgt,
{{ type_name }} *op1,
{{ type_name }} *op2)
{
int idx = threadIdx.x +
{{ thread_block_size }} * {{block_size}}
* blockIdx.x;
{% for i in range(block_size) %}
{% set offset = i*thread_block_size %}
tgt[idx + {{ offset }}] =
op1[idx + {{ offset }}]
+ op2[idx + {{ offset }}];
{% endfor %}
}""")
rendered_tpl = tpl.render(
type_name="float", block_size=block_size,
thread_block_size=thread_block_size)
mod = SourceModule(rendered_tpl)
可运行的代码见examples/demo_meta_template.py
。还有使用Cheetah的模板元编程的矩阵乘法:demo_meta_matrixmul_cheetah.py
和demo_meta_matrixmul_cheetah.template.cu
。
Metaprogramming using codepy
更复杂的元编程,模板引擎就无法满足了。codepy
提供了一种从python数据结构生成CUDA源代码的方法。
下方例子与上一个例子用途一致:
from codepy.cgen import FunctionBody, \
FunctionDeclaration, Typedef, POD, Value, \
Pointer, Module, Block, Initializer, Assign
from codepy.cgen.cuda import CudaGlobal
mod = Module([
FunctionBody(
CudaGlobal(FunctionDeclaration(
Value("void", "add"),
arg_decls=[Pointer(POD(dtype, name))
for name in ["tgt", "op1", "op2"]])),
Block([
Initializer(
POD(numpy.int32, "idx"),
"threadIdx.x + %d*blockIdx.x"
% (thread_block_size*block_size)),
]+[
Assign(
"tgt[idx+%d]" % (o*thread_block_size),
"op1[idx+%d] + op2[idx+%d]" % (
o*thread_block_size,
o*thread_block_size))
for o in range(block_size)]))])
mod = SourceModule(mod)
完整代码见examples/demo_meta_codepy.py
。
官方文档后续还介绍了Profiler、JIT编译、OpenGL、GPU Arrays、一些接口。不一一赘述了,接下来开始学习一些Example。
6. Examples
官方共给出了42个demo,本文只会对博主觉得需要学习的进行说明。
hello_gpu.py
这个demo使用CUDA计算了两个长度为400的随机向量的积,并将结果与python计算的积相减,结果应为0向量。
from __future__ import print_function
from __future__ import absolute_import
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy
import numpy.linalg as la
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1))
print(dest-a*b)
demo.py
Tutorial中的代码,为一个(4*4)
矩阵翻倍,分别是原始写法、通过InOut
的简便写法、以及使用gpuarray
的无kernel写法。
# Sample source code from the Tutorial Introduction in the documentation.
from __future__ import print_function
from __future__ import absolute_import
import pycuda.driver as cuda
import pycuda.autoinit # noqa
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(4, 4)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
func = mod.get_function("doublify")
func(a_gpu, block=(4, 4, 1), grid=(1, 1), shared=0)
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print("original array:")
print(a)
print("doubled with kernel:")
print(a_doubled)
# alternate kernel invocation -------------------------------------------------
func(cuda.InOut(a), block=(4, 4, 1))
print("doubled with InOut:")
print(a)
# part 2 ----------------------------------------------------------------------
import pycuda.gpuarray as gpuarray
a_gpu = gpuarray.to_gpu(numpy.random.randn(4, 4).astype(numpy.float32))
a_doubled = (2*a_gpu).get()
print("original array:")
print(a_gpu)
print("doubled with gpuarray:")
print(a_doubled)
demo_struct.py
Tutorial中的Advance topic里的代码
# prepared invocations and structures -----------------------------------------
from __future__ import print_function
from __future__ import absolute_import
import pycuda.driver as cuda
import pycuda.autoinit
import numpy
from pycuda.compiler import SourceModule
class DoubleOpStruct:
mem_size = 8 + numpy.uintp(0).nbytes
def __init__(self, array, struct_arr_ptr):
self.data = cuda.to_device(array)
self.shape, self.dtype = array.shape, array.dtype
"""
numpy.getbuffer() needed due to lack of new-style buffer interface for
scalar numpy arrays as of numpy version 1.9.1
see: https://github.com/inducer/pycuda/pull/60
"""
cuda.memcpy_htod(int(struct_arr_ptr),
numpy.getbuffer(numpy.int32(array.size)))
cuda.memcpy_htod(int(struct_arr_ptr) + 8,
numpy.getbuffer(numpy.uintp(int(self.data))))
def __str__(self):
return str(cuda.from_device(self.data, self.shape, self.dtype))
struct_arr = cuda.mem_alloc(2 * DoubleOpStruct.mem_size)
do2_ptr = int(struct_arr) + DoubleOpStruct.mem_size
array1 = DoubleOpStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)
array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), do2_ptr)
print("original arrays")
print(array1)
print(array2)
mod = SourceModule("""
struct DoubleOperation {
int datalen, __padding; // so 64-bit ptrs can be aligned
float *ptr;
};
__global__ void double_array(DoubleOperation *a)
{
a = a + blockIdx.x;
for (int idx = threadIdx.x; idx < a->datalen; idx += blockDim.x)
{
float *a_ptr = a->ptr;
a_ptr[idx] *= 2;
}
}
""")
func = mod.get_function("double_array")
func(struct_arr, block=(32, 1, 1), grid=(2, 1))
print("doubled arrays")
print(array1)
print(array2)
func(numpy.uintp(do2_ptr), block=(32, 1, 1), grid=(1, 1))
print("doubled second only")
print(array1)
print(array2)
if cuda.get_version() < (4, ):
func.prepare("P", block=(32, 1, 1))
func.prepared_call((2, 1), struct_arr)
else:
func.prepare("P")
block = (32, 1, 1)
func.prepared_call((2, 1), block, struct_arr)
print("doubled again")
print(array1)
print(array2)
if cuda.get_version() < (4, ):
func.prepared_call((1, 1), do2_ptr)
else:
func.prepared_call((1, 1), block, do2_ptr)
print("doubled second only again")
print(array1)
print(array2)
需要注意代码里提到了由于numpy高版本的原因,需要用numpy.getbuffer(numpy.uintp(int(self.data)))
来传输数据。
demo_elementwise.py
利用pycuda.elementwise.ElementwiseKernel
完成的一次向量的加权线性组合运算。ElementwiseKernel
的参数全都是字符串,第一个参数为入参声明,第二个参数叫operation
定义了返回值的计算方法,其中用到了第四个参数声明的一个kernel函数。第三个参数为name。第四个参数可以引入其他文件,或者定义被operation
使用的函数。
from __future__ import absolute_import
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
import numpy
from pycuda.curandom import rand as curand
a_gpu = curand((50,))
b_gpu = curand((50,))
from pycuda.elementwise import ElementwiseKernel
lin_comb = ElementwiseKernel(
"float a, float *x, float b, float *y, float *z",
"z[i] = my_f(a*x[i], b*y[i])",
"linear_combination",
preamble="""
__device__ float my_f(float x, float y)
{
return sin(x*y);
}
""")
c_gpu = gpuarray.empty_like(a_gpu)
lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
import numpy.linalg as la
assert la.norm(c_gpu.get() - numpy.sin((5*a_gpu*6*b_gpu).get())) < 1e-5
print(c_gpu)
SimpleSpeedTest.py
对比了GPU运算和CPU运算的速度。不管是作者注释中给出的结果还是博主实测的结果,GPU都大幅领先CPU。而GPU的几种不同实现中,SourceModule
最为优秀,GPUArray
表现最差。使用ElementWise
的,在C++里循环的比在python里循环的快很多。可见最好还是完全使用C++源码实现kernel函数。
# Very simple speed testing code
# Shows you how to run a loop over sin() using different methods
# with a note of the time each method takes
# For the GPU this uses SourceModule, ElementwiseKernel, GPUArray
# For the CPU this uses numpy
# Ian@IanOzsvald.com
# Using a WinXP Intel Core2 Duo 2.66GHz CPU (1 CPU used)
# with a 9800GT GPU I get the following timings (smaller is better):
#
# Using nbr_values == 8192
# Calculating 100000 iterations
# SourceModule time and first three results:
# 0.166590s, [ 0.005477 0.005477 0.005477]
# Elementwise time and first three results:
# 0.171657s, [ 0.005477 0.005477 0.005477]
# Elementwise Python looping time and first three results:
# 1.487470s, [ 0.005477 0.005477 0.005477]
# GPUArray time and first three results:
# 4.740007s, [ 0.005477 0.005477 0.005477]
# CPU time and first three results:
# 32.933660s, [ 0.005477 0.005477 0.005477]
#
#
# Using Win 7 x64, GTX 470 GPU, X5650 Xeon,
# Driver v301.42, CUDA 4.2, Python 2.7 x64,
# PyCuda 2012.1 gave the following results:
#
# Using nbr_values == 8192
# Calculating 100000 iterations
# SourceModule time and first three results:
# 0.058321s, [ 0.005477 0.005477 0.005477]
# Elementwise time and first three results:
# 0.102110s, [ 0.005477 0.005477 0.005477]
# Elementwise Python looping time and first three results:
# 2.428810s, [ 0.005477 0.005477 0.005477]
# GPUArray time and first three results:
# 8.421861s, [ 0.005477 0.005477 0.005477]
# CPU time measured using :
# 5.905661s, [ 0.005477 0.005477 0.005477]
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy
from pycuda.compiler import SourceModule
import pycuda.gpuarray as gpuarray
import pycuda.cumath
from pycuda.elementwise import ElementwiseKernel
blocks = 64
block_size = 128
nbr_values = blocks * block_size
print "Using nbr_values ==", nbr_values
# Number of iterations for the calculations,
# 100 is very quick, 2000000 will take a while
n_iter = 100000
print "Calculating %d iterations" % (n_iter)
# create two timers so we can speed-test each approach
start = drv.Event()
end = drv.Event()
######################
# SourceModele SECTION
# We write the C code and the indexing and we have lots of control
mod = SourceModule("""
__global__ void gpusin(float *dest, float *a, int n_iter)
{
const int i = blockDim.x*blockIdx.x + threadIdx.x;
for(int n = 0; n < n_iter; n++) {
a[i] = sin(a[i]);
}
dest[i] = a[i];
}
""")
gpusin = mod.get_function("gpusin")
# create an array of 1s
a = numpy.ones(nbr_values).astype(numpy.float32)
# create a destination array that will receive the result
dest = numpy.zeros_like(a)
start.record() # start timing
gpusin(drv.Out(dest), drv.In(a), numpy.int32(n_iter), grid=(blocks,1), block=(block_size,1,1) )
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "SourceModule time and first three results:"
print "%fs, %s" % (secs, str(dest[:3]))
#####################
# Elementwise SECTION
# use an ElementwiseKernel with sin in a for loop all in C call from Python
kernel = ElementwiseKernel(
"float *a, int n_iter",
"for(int n = 0; n < n_iter; n++) { a[i] = sin(a[i]);}",
"gpusin")
a = numpy.ones(nbr_values).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
start.record() # start timing
kernel(a_gpu, numpy.int(n_iter))
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "Elementwise time and first three results:"
print "%fs, %s" % (secs, str(a_gpu.get()[:3]))
####################################
# Elementwise Python looping SECTION
# as Elementwise but the for loop is in Python, not in C
kernel = ElementwiseKernel(
"float *a",
"a[i] = sin(a[i]);",
"gpusin")
a = numpy.ones(nbr_values).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
start.record() # start timing
for i in range(n_iter):
kernel(a_gpu)
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "Elementwise Python looping time and first three results:"
print "%fs, %s" % (secs, str(a_gpu.get()[:3]))
##################
# GPUArray SECTION
# The result is copied back to main memory on each iteration, this is a bottleneck
a = numpy.ones(nbr_values).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
start.record() # start timing
for i in range(n_iter):
a_gpu = pycuda.cumath.sin(a_gpu)
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "GPUArray time and first three results:"
print "%fs, %s" % (secs, str(a_gpu.get()[:3]))
#############
# CPU SECTION
# use numpy the calculate the result on the CPU for reference
a = numpy.ones(nbr_values).astype(numpy.float32)
start.record() # start timing
start.synchronize()
for i in range(n_iter):
a = numpy.sin(a)
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "CPU time and first three results:"
print "%fs, %s" % (secs, str(a[:3]))
ThreadsAndBlocks.py
这个demo演示了grid
、block
和thread
之间的关系。
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
mod = SourceModule("""
#include <stdio.h>
__global__ void say_hi()
{
printf("I am %dth thread in threadIdx.x:%d.threadIdx.y:%d blockIdx.:%d blockIdx.y:%d blockDim.x:%d blockDim.y:%d\\n",(threadIdx.x+threadIdx.y*blockDim.x+(blockIdx.x*blockDim.x*blockDim.y)+(blockIdx.y*blockDim.x*blockDim.y)),threadIdx.x, threadIdx.y,blockIdx.x,blockIdx.y,blockDim.x,blockDim.y);
}
""")
func = mod.get_function("say_hi")
func(block=(4,4,1),grid=(2,2,1))
kernel函数那一坨有点不方便阅读,单独摘出来:
#include <stdio.h>
__global__ void say_hi()
{
printf("I am %dth thread
in threadIdx.x:%d.
threadIdx.y:%d
blockIdx.:%d
blockIdx.y:%d
blockDim.x:%d
blockDim.y:%d\\n"
,(threadIdx.x+threadIdx.y*blockDim.x+(blockIdx.x*blockDim.x*blockDim.y)+(blockIdx.y*blockDim.x*blockDim.y))
,threadIdx.x
,threadIdx.y
,blockIdx.x
,blockIdx.y
,blockDim.x
,blockDim.y);
}
kernel中可访问到以下属性,用于进行运算任务的分配:
blockDim.x
blockDim.y
blockIdx.x
blockIdx.y
threadIdx.x
threadIdx.y
其中blockDim
为每个block的维度,被参数block=(4,4,1)
控制,threadIdx
将无法超过这个值,每个block
中有16个thread
。blockIdx
是block
在grid
中的坐标,grid
维度为grid=(2,2,1)
,每个grid
中有4个block
。
demo中,thread的计数是threadIdx.x + threadIdx.y * blockDim.x + (blockIdx.x * blockDim.x * blockDim.y) + (blockIdx.y * blockDim.x * blockDim.y
这样计算的。但在gird中,沿斜对角线对称的block
里的thread
,计算出的id会重复。
MatrixmulSimple.py
用一个block计算两个矩阵的乘法,即每个thread负责计算结果矩阵中自己坐标位置的元素的值。
原demo为计算两个方阵的乘,改造为[M*N] * [N*M]
。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Multiplies two square matrices together using a *single* block of threads and
global memory only. Each thread computes one element of the resulting matrix.
"""
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
# -- initialize the device
import pycuda.autoinit
# [M*N] * [N*M] matrix multiply
kernel_code_template = """
__global__ void MatrixMulKernel(float *a, float *b, float *c)
{
// 2D Thread ID (assuming that only *one* block will be executed)
// y responsible for the left matrix's row
// x responsible for the right matrix's column
// calculate the result matrix[x,y]'s element's value
int tx = threadIdx.x;
int ty = threadIdx.y;
// Pvalue is used to store the element of the matrix
// that is computed by the thread
float Pvalue = 0;
// Each thread loads one row of M and one column of N,
// to produce one element of P.
for (int i = 0; i < %(N)s; ++i) {
float Aelement = a[ty * %(N)s + i];
float Belement = b[i * %(M)s + tx];
Pvalue += Aelement * Belement;
}
// Write the matrix to device memory;
// each thread writes one element
c[ty * %(M)s + tx] = Pvalue;
}
"""
# define the (square) matrix size
# note that we'll only use *one* block of threads here
# as a consequence this number (squared) can't exceed max_threads,
# see http://documen.tician.de/pycuda/util.html#pycuda.tools.DeviceData
# for more information on how to get this number for your device
M = 2
N = 3
# create two random square matrices
a_cpu = np.random.randn(M, N).astype(np.float32)
b_cpu = np.random.randn(N, M).astype(np.float32)
# compute reference on the CPU to verify GPU computation
c_cpu = np.dot(a_cpu, b_cpu)
# transfer host (CPU) memory to device (GPU) memory
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu = gpuarray.to_gpu(b_cpu)
# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((M, M), np.float32)
# get the kernel code from the template
# by specifying the constant MATRIX_SIZE
kernel_code = kernel_code_template % {
'M': M,
'N': N
}
# compile the kernel code
mod = compiler.SourceModule(kernel_code)
# get the kernel function from the compiled module
matrixmul = mod.get_function("MatrixMulKernel")
# call the kernel on the card
matrixmul(
# inputs
a_gpu, b_gpu,
# output
c_gpu,
# (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
block = (M, M, 1),
)
# print the results
print "-" * 80
print "Matrix A (GPU):"
print a_gpu.get()
print "-" * 80
print "Matrix B (GPU):"
print b_gpu.get()
print "-" * 80
print "Matrix C (GPU):"
print c_gpu.get()
print "-" * 80
print "Matrix C (CPU):"
print c_cpu
print "-" * 80
print "CPU-GPU difference:"
print c_cpu - c_gpu.get()
np.allclose(c_cpu, c_gpu.get())