报价宝 > 科技 >

手把手教你如何用Julia做GPU编程（附代码）

消息来源:baojiabao.com 作者: 发布时间：2026-05-23

报价宝综合消息手把手教你如何用Julia做GPU编程（附代码）

原标题：手把手教你如何用Julia做GPU编程（附代码）

来源：nextjournal

编辑：肖琴、三石

【新智元导读】本文旨在快速介绍GPU的工作原理，详细介绍当前的Julia GPU生态系统，并让读者了解简单的GPU编程是多么的容易。

GPU是如何工作的？

首先，什么是GPU？

GPU是一个大规模并行处理器，具有几千个并行处理单元。例如，本文中使用的Tesla k80提供4992个并行CUDA内核。 GPU在频率，延迟和硬件功能方面与CPU完全不同，但有点类似于拥有4992个内核的慢速CPU！

“Tesla K80”

可启用并行线程的数量可以大幅提高GPU速度，但也让它的使用性变得更加困难。让我们来详细看看在使用这种原始动力时，你会遇到哪些缺点：

GPU是一个独立的硬件，具有自己的内存空间和不同的架构。因此，从RAM到GPU存储器（VRAM）的传输时间很长。即使在GPU上启动内核（换句话说，调度函数调用）也会带来较大的延迟。 GPU的时间约为10us，而CPU的时间则为几纳秒。
在没有高级包装器的情况下，设置内核会很快变得复杂
较低的精度是默认值，而较高的精度计算可以轻松地消除所有性能增益
GPU函数(内核)本质上是并行的，所以编写GPU内核至少和编写并行CPU代码一样困难，但是硬件上的差异增加了相当多的复杂性
与上述相关，许多算法都不能很好地移植到GPU上。
内核通常是用C/ C++编写的，这并不是写算法的最佳语言。
CUDA和OpenCL之间存在分歧，OpenCL是用于编写低级GPU代码的主要框架。虽然CUDA只支持英伟达硬件，但OpenCL支持所有硬件，但有些粗糙。

Julia的诞生是个好消息！它是一种高级脚本语言，允许你在Julia本身编写内核和周围的代码，同时在大多数GPU硬件上运行！

GPUArrays

大多数高度并行的算法需要通过相当多的数据来克服所有线程和延迟开销。因此，大多数算法都需要数组来管理所有数据，这需要一个好的GPU数组库（array library）作为关键基础。

GPUArrays.jl是Julia的基础。它提供了一个抽象数组实现，专门用于使用高度并行硬件的原始功能。它包含设置GPU所需的所有功能，启动Julia GPU函数并提供一些基本的数组算法。

抽象意味着它需要以CuArrays和CLArrays形式的具体实现。由于继承了GPUArrays的所有功能，它们都提供完全相同的界面。唯一的区别出现在分配数组时，这会强制你决定数组是否位于CUDA或OpenCL设备上。关于这一点的更多信息，请参阅内存部分。

GPUArrays有助于减少代码重复，因为它允许编写独立于硬件的GPU内核，可以通过CuArrays或CLArrays将其编译为本机GPU代码。因此，许多通用内核可以在继承自GPUArrays的所有packages之间共享。

一点选择建议：CuArrays仅适用于Nvidia GPU，而CLArrays适用于大多数可用的GPU。CuArrays比CLArrays更稳定，并且已经可以在Julia 0.7上运行。速度上差异不明显。我建议两者都试一下，看看哪个效果最好。

对于本文，我将选择CuArrays，因为本文是为Julia 0.7 / 1.0而写的，CLArrays仍然不支持。

性能

让我们用一个简单的互动式代码示例来快速说明为什么要将计算转移到GPU上，这个示例计算julia set：

1using CuArrays, FileIO, Colors, GPUArrays, BenchmarkTools

2using CuArrays: CuArray

3"""

4The function calculating the Julia set

5"""

6function juliaset(z0, maxiter)

7c = ComplexF32(-0.5, 0.75)

8z = z0

9fori in1:maxiter

10abs2(z) > 4f0 && return(i - 1) % UInt8

11z = z * z + c

12end

13returnmaxiter % UInt8 # % is used to convert without overflow check

14end

15range = 100:50:2^12

16cutimes, jltimes = Float64[], Float64[]

17function run_bench(in, out)

18# use dot syntax to apply `juliaset` to each elemt of q_converted

19# and write the output to result

20out .= juliaset.(in, 16)

21# all calls to the GPU are scheduled asynchronous,

22# so we need to synchronize

23GPUArrays.synchronize(out)

24end

25# store a reference to the last results for plotting

26last_jl, last_cu = nothing, nothing

27forN inrange

28w, h = N, N

29q = [ComplexF32(r, i) fori=12.0/w):-1, r=-1.53.0/h):1.5]

30for(times, Typ) in((cutimes, CuArray), (jltimes, Array))

31# convert to Array or CuArray - moving the calculation to CPU/GPU

32q_converted = Typ(q)

33result = Typ(zeros(UInt8, size(q)))

34fori in1:10# 5 samples per size

35# benchmarking macro, all variables need to be prefixed with $

36t = Base.@elapsed begin

37run_bench(q_converted, result)

38end

39globallast_jl, last_cu # we"re in local scope

40ifresult isa CuArray

41last_cu = result

42else

43last_jl = result

44end

45push!(times, t)

46end

47end

48end

50cu_jl = hcat(Array(last_cu), last_jl)

51cmap = colormap("Blues", 16+ 1)

52color_lookup(val, cmap) = cmap[val + 1]

53save("results/juliaset.png", color_lookup.(cu_jl, (cmap,)))

1using Plots; plotly()

2x = repeat(range, inner = 10)

3speedup = jltimes ./ cutimes

4Plots.scatter(

5log2.(x), [speedup, fill(1.0, length(speedup))],

6label = ["cuda""cpu"], markersize = 2, markerstrokewidth = 0,

7legend = :right, xlabel = "2^N", ylabel = "speedup"

如你所见，对于大型数组，通过将计算移动到GPU可以获得稳定的60-80倍的加速。而且非常简单，只需将Julia array转换为GPUArray。

有人可能认为GPU的性能受到像Julia这样的动态语言的影响，但Julia的GPU性能应该与CUDA或OpenCL的原始性能相当。Tim Besard在集成LLVM Nvidia编译pipeline方面做得非常出色，达到了与纯CUDA C代码相同（有时甚至更好）的性能。Tim发表了一篇非常详细的博文，里面进一步解释了这一点[1]。CLArrays方法有点不同，它直接从Julia生成OpenCL C代码，具有与OpenCL C相同的性能！

为了更好地了解性能并查看与多线程CPU代码的比较，我收集了一些基准测试[2]。

内存（Memory）

GPU具有自己的存储空间，包括视频存储器（VRAM），不同的高速缓存和寄存器。无论你做什么，任何Julia对象都必须先转移到GPU才能使用。并非Julia中的所有类型都可以在GPU上工作。

首先让我们看一下Julia的类型：

1struct Test # an immutable struct

2# that only contains other immutable, which makes

3# isbitstype(Test) == true

4x::Float32

5end

7# the isbits property is important, since those types can be used

8# without constraints on the GPU!

9@assert isbitstype(Test) == true

10x = (2, 2)

11isa(x, Tuple{Int, Int}) # tuples are also immutable

12mutable struct Test2 #-> mutable, isbits(Test2) == false

13x::Float32

14end

15struct Test3

16# contains a heap allocation/ reference, not isbits

17x::Vector{Float32}

18y::Test2 # Test2 is mutable and also heap allocated / a reference

19end

20Vector{Test} # <- An Array with isbits elements is contigious in memory

21Vector{Test2} # <- An Array with mutable elements is basically an array of heap pointers. Since it just contains cpu heap pointers, it won"t work on the GPU.

"Array{Test2,1}"

所有这些Julia类型在转移到GPU或在GPU上创建时表现都不同。下表概述了预期结果：

创建位置描述了对象是否在CPU上创建然后传输到GPU内核，或者是否在内核的GPU上创建。这个表显示了是否可以创建类型的实例，并且对于从CPU到GPU的传输，该表还指示对象是否通过引用复制或传递。

Garbage Collection

使用GPU时的一个很大的区别是GPU上没有垃圾回收（ garbage collector, GC）。这不是什么大问题，因为为GPU编写的高性能内核不应该一开始就创建任何GC-tracked memory。

为GPU实现GC是可能的，但请记住，每个执行的内核都是大规模并行的。在~1000 GPU线程中的每一个线程创建和跟踪大量堆内存将很快破坏性能增益，因此这实际上是不值得的。

作为内核中堆分配数组的替代方法，你可以使用GPUArrays。GPUArray构造函数将创建GPU缓冲区并将数据传输到VRAM。如果调用Array(gpu_array)，数组将被转移回RAM，表示为普通的Julia数组。这些GPU数组的Julia句柄由Julia的GC跟踪，如果它不再使用，GPU内存将被释放。

因此，只能在设备上使用堆栈分配，并且对其余的预先分配的GPU缓冲区使用。由于传输非常昂贵的，因此在编程GPU时尽可能多地重用和预分配是很常见的。

The GPUArray Constructors

1using CuArrays, LinearAlgebra

3# GPU Arrays can be constructed from all Julia arrays containing isbits types!

4A1D = cu([1, 2, 3]) # cl for CLArrays

5A1D = fill(CuArray{Int}, 0, (100,)) # CLArray for CLArrays

6# Float32 array - Float32 is usually preferred and can be up to 30x faster on most GPUs than Float64

7diagonal_matrix = CuArray{Float32}(I, 100, 100)

8filled = fill(CuArray, 77f0, (4, 4, 4)) # 3D array filled with Float32 77

9randy = rand(CuArray, Float32, 42, 42) # random numbers generated on the GPU

10# The array constructor also accepts isbits iterators with a known size

11# Note, that since you can also pass isbits types to a gpu kernel directly, in most cases you won"t need to materialize them as an gpu array

12from_iter = CuArray(1:10)

13# let"s create a point type to further illustrate what can be done:

14struct Point

15x::Float32

16y::Float32

17end

18Base.convert(::Type{Point}, x::NTuple{2, Any}) = Point(x[1], x[2])

19# because we defined the above convert from a tuple to a point

20# [Point(2, 2)] can be written as Point[(2,2)] since all array

21# elements will get converted to Point

22custom_types = cu(Point[(1, 2), (4, 3), (2, 2)])

23typeof(custom_types)

"CuArray{Point, 1}"

Array Operations

许多操作是已经定义好的。最重要的是，GPUArrays支持Julia的fusing dot broadcasting notation。这种标记法允许你将函数应用于数组的每个元素，并使用f的返回值创建一个新数组。这个功能通常称为映射（map）。 broadcast 指的是具有不同形状的数组被散布到相同的形状。

它的工作方式如下：

1x = zeros(4, 4) # 4x4 array of zeros

2y = zeros(4) # 4 element array

3z = 2# a scalar

4# y"s 1st dimension gets repeated for the 2nd dimension in x

5# and the scalar z get"s repeated for all dimensions

6# the below is equal to `broadcast(+, broadcast(+, xx, y), z)`

7x .+ y .+ z

关于broadcasting如何工作的更多解释，可以看看这个指南：

julia.guide/broadcasting

这意味着在不分配堆内存（仅创建isbits类型）的情况下运行的任何Julia函数都可以应用于GPUArray的每个元素，并且多个dot调用将融合到一个内核调用中。由于内核调用延迟很高，这种融合是一个非常重要的优化。

1using CuArrays

2A = cu([1, 2, 3])

3B = cu([1, 2, 3])

4C = rand(CuArray, Float32, 3)

5result = A .+ B .- C

6test(a::T) where T = a * convert(T, 2) # convert to same type as `a`

8# inplace broadcast, writes directly into `result`

9result .= test.(A) # custom function work

11# The cool thing is that this composes well with custom types and custom functions.

12# Let"s go back to our Point type and define addition for it

13Base.:(+)(p1::Point, p2::Point) = Point(p1.x + p2.x, p1.y + p2.y)

15# now this works:

16custom_types = cu(Point[(1, 2), (4, 3), (2, 2)])

18# This particular example also shows the power of broadcasting:

19# Non array types are broadcasted and repeated for the whole length

20result = custom_types .+ Ref(Point(2, 2))

22# So the above is equal to (minus all the allocations):

23# this allocates a new array on the gpu, which we can avoid with the above broadcast

24broadcasted = fill(CuArray, Point(2, 2), (3,))

26result == custom_types .+ broadcasted

ture

现实世界中的GPUArrays

让我们直接看看一些很酷的用例。

如下面的视频所示，这个GPU加速烟雾模拟是使用GPUArrays + CLArrays创建的，可在GPU或CPU上运行，GPU版本的速度提高了15倍：

还有更多的用例，包括求解微分方程，有限元模拟和求解偏微分方程。

让我们来看一个简单的机器学习示例，看看如何使用GPUArrays：

1using Flux, Flux.Data.MNIST, Statistics

2using Flux: onehotbatch, onecold, crossentropy, throttle

3using Base.Iterators: repeated, partition

4using CuArrays

6# Classify MNIST digits with a convolutional network

8imgs = MNIST.images()

10labels = onehotbatch(MNIST.labels(), 0:9)

12# Partition into batches of size 1,000

13train = [(cat(float.(imgs[i])..., dims = 4), labels[:,i])

14fori inpartition(1:60_000, 1000)]

16use_gpu = true # helper to easily switch between gpu/cpu

18todevice(x) = use_gpu ? gpu(x) : x

20train = todevice.(train)

22# Prepare test set (first 1,000 images)

23tX = cat(float.(MNIST.images(:test)[1:1000])..., dims = 4) |> todevice

24tY = onehotbatch(MNIST.labels(:test)[1:1000], 0:9) |> todevice

26m = Chain(

27Conv((2,2), 1=>16, relu),

28x -> maxpool(x, (2,2)),

29Conv((2,2), 16=>8, relu),

30x -> maxpool(x, (2,2)),

31x -> reshape(x, :, size(x, 4)),

32Dense(288, 10), softmax) |> todevice

34m(train[1][1])

36loss(x, y) = crossentropy(m(x), y)

38accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))

40evalcb = throttle(() -> @show(accuracy(tX, tY)), 10)

41opt = ADAM(Flux.params(m));

1# train

2fori = 1:10

3Flux.train!(loss, train, opt, cb = evalcb)

4end

accuracy(tX, tY) = 0.101

accuracy(tX, tY) = 0.888

accuracy(tX, tY) = 0.919

1using Colors, FileIO, ImageShow

2N = 22

3img = tX[:, :, 1:1, N:N]

4println("Predicted: ", Flux.onecold(m(img)) .- 1)

5Gray.(collect(tX[:, :, 1, N]))

只需将数组转换为GPUArrays（使用gpu(array)），我们就可以将整个计算转移到GPU并获得相当不错的速度提升。这要归功于Julia复杂的AbstractArray基础架构，GPUArray可以无缝地集成到其中。接着，如果你省略了对转换为GPUArray，代码也将使用普通的Julia数组运行——但当然这是在CPU上运行。你可以通过将use_gpu = true更改为use_gpu = false并重试初始化和训练单元格来尝试这个操作。对比GPU和CPU，CPU运行时间为975秒，GPU运行时间为29秒 ——加速了约33倍！

另一个值得关注的好处是，GPUArrays不需显式地实现自动微分以有效地支持神经网络的反向传播。这是因为Julia的自动微分库适用于任意函数，并发出可在GPU上高效运行的代码。这有助于帮助Flux以最少的开发人员在GPU上工作，并使Flux GPU能够有效地支持用户定义的函数。在没有GPUArrays + Flux之间协调的情况下开箱即用是Julia的一个非常独特的特性，详细解释见[3].

编写GPU内核

只需使用GPUArrays的通用抽象数组界面，而不用编写任何GPU内核，就可以做很多事了。但是，在某些时候，可能需要实现一个需要在GPU上运行的算法，并且不能用通用数组算法的组合来表示。

好的一点是，GPUArrays通过一种分层方法减少了大量的工作，这种方法允许你从高级代码开始编写低级内核，类似于大多数OpenCL / CUDA示例里的。它还允许你在OpenCL或CUDA设备上执行内核，从而抽象出这些框架中的任何差异。

使这成为可能的函数名为gpu_call。它可以被称为 gpu_call(kernel, A::GPUArray, args)，并将在GPU上使用参数 (state, args...) 调用内核。State是一个后端特定对象，用于实现获取线程索引之类的功能。GPUArray需要作为第二个参数传递，一遍分派到正确的后端并提供启动参数的预设值。

让我们使用gpu_call来实现一个简单的map kernel：

1using GPUArrays, CuArrays

2# Overloading the Julia Base map! function for GPUArrays

3function Base.map!(f::Function, A::GPUArray, B::GPUArray)

4# our function that will run on the gpu

5function kernel(state, f, A, B)

6# If launch parameters aren"t specified, linear_index gets the index

7# into the Array passed as second argument to gpu_call (`A`)

8i = linear_index(state)

9ifi <= length(A)

10@inbounds A[i] = f(B[i])

11end

12return

13end

14# call kernel on the gpu

15gpu_call(kernel, A, (f, A, B))

16end

简单来说，上面的代码将在GPU上并行调用julia函数内核length(A) 次。内核的每个并行调用都有一个线程索引，我们可以使用它来安全地索引到数组A和B。如果我们计算自己的索引，而不是使用linear_index，我们需要确保没有多个线程读写同一个数组位置。因此，如果我们使用线程在纯Julia中编写，其对应版本如下：

1using BenchmarkTools

2function threadded_map!(f::Function, A::Array, B::Array)

3Threads.@threads fori in1:length(A)

4A[i] = f(B[i])

5end

7end

8x, y = rand(10^7), rand(10^7)

9kernel(y) = (y / 33f0) * (732.f0/y)

10# on the cpu without threads:

11single_t = @belapsed map!($kernel, $x, $y)

13# "on the CPU with 4 threads (2 real cores):

14thread_t = @belapsed threadded_map!($kernel, $x, $y)