Skip to content

Commit daff36d

Browse files
authored
Add SIMD docs & re-include installation tips (#807)
1 parent 67c1800 commit daff36d

File tree

5 files changed

+84
-18
lines changed

5 files changed

+84
-18
lines changed

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "AMDGPU"
22
uuid = "21141c5a-9bdb-4563-92ae-f87d6854732e"
33
authors = ["Julian P Samaroo <[email protected]>", "Valentin Churavy <[email protected]>", "Anton Smirnov <[email protected]>"]
4-
version = "1.3.6"
4+
version = "2.0.0"
55

66
[deps]
77
AbstractFFTs = "621f4979-c628-5d54-868e-fcf4e3e8185c"

docs/Project.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
33
DocumenterVitepress = "4710194d-e776-4893-9690-8d956a29c365"
44
GPUArrays = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
5+
SIMD = "fdea26ae-647d-5447-a871-4b548cad5224"
56

67
[compat]
78
Documenter = "1"

docs/make.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ function main()
3434
"Quick Start" => "tutorials/quickstart.md",
3535
"Performance Tips" => "tutorials/perf.md",
3636
"Profiling" => "tutorials/profiling.md",
37+
"Installation Tips" => "install_tips.md",
3738
],
3839
"API" => [
3940
"Devices" => "api/devices.md",

docs/src/install_tips.md

Lines changed: 22 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,5 @@
11
# Installation Info
22

3-
## Windows OS missing functionality
4-
5-
Windows **does not** yet support [Hostcall](@ref), which means that
6-
some of the functionality does not work, like:
7-
8-
- device printing;
9-
- dynamic memory allocation (from kernels).
10-
11-
These hostcalls are sometimes launched when AMDGPU detects that a
12-
kernel might throw an exception, specifically during conversions, like:
13-
`Int32(1f0)`.
14-
15-
To avoid this, use 'unsafe' conversion option:
16-
`unsafe_trunc(Int32, 1f0)`.
17-
183
## ROCm system libraries
194

205
On Linux, AMDGPU.jl queries the location of ROCm libraries through `rocminfo` by default.
@@ -104,10 +89,30 @@ hard_memory_limit = "none"
10489
# hard_memory_limit = "80 %"
10590
```
10691

92+
## Windows OS missing functionality
93+
94+
Windows **does not** yet support [Hostcall](@ref), which means that
95+
some of the functionality does not work, like:
96+
97+
- device printing;
98+
- dynamic memory allocation (from kernels).
99+
100+
These hostcalls are sometimes launched when AMDGPU detects that a
101+
kernel might throw an exception, specifically during conversions, like:
102+
`Int32(1f0)`.
103+
104+
To avoid this, use 'unsafe' conversion option:
105+
`unsafe_trunc(Int32, 1f0)`.
106+
107107
## Frequently-Asked-Questions
108108

109109
### Archlinux
110110

111-
For the last few ROCM releases we have seen folks run into issue with the distro-provided builds of ROCM and associated tools. [#770](https://github.com/JuliaGPU/AMDGPU.jl/issues/770), [#696](https://github.com/JuliaGPU/AMDGPU.jl/issues/696), [#767](https://github.com/JuliaGPU/AMDGPU.jl/issues/767)
111+
For the last few ROCM releases we have seen folks run into
112+
issue with the distro-provided builds of ROCM and associated tools.
113+
[#770](https://github.com/JuliaGPU/AMDGPU.jl/issues/770),
114+
[#696](https://github.com/JuliaGPU/AMDGPU.jl/issues/696),
115+
[#767](https://github.com/JuliaGPU/AMDGPU.jl/issues/767)
112116

113-
Some users have reported success with using the [`opencl-amd-dev`](https://aur.archlinux.org/packages/opencl-amd-dev) AUR package.
117+
Some users have reported success with using the
118+
[`opencl-amd-dev`](https://aur.archlinux.org/packages/opencl-amd-dev) AUR package.

docs/src/tutorials/perf.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,3 +44,62 @@ julia> GPUArrays.unsafe_free!(cache)
4444
For a more sophisticated real-world example, see how
4545
[GaussianSplatting.jl](https://github.com/JuliaNeuralGraphics/GaussianSplatting.jl/blob/e4ef1324c187371e336bef875b053023afe7fb2c/src/training.jl#L183)
4646
handles it.
47+
48+
## Using SIMD
49+
50+
Using vectorized load/store instructions can improve performance of the kernel.
51+
Let's see on a simple vector addition example how to use it.
52+
53+
We define two helper functions:
54+
- `vload` that will load SIMD tile given a pointer into the array;
55+
- `vstore!` that will write SIMD tile into an array given its pointer.
56+
57+
```@example vadd-simd
58+
using AMDGPU, SIMD
59+
60+
@inline function vload(::Type{SIMD.Vec{N, T}}, ptr::Core.LLVMPtr{T, AS}) where {N, T, AS}
61+
alignment = sizeof(T) * N
62+
vec_ptr = Base.bitcast(Core.LLVMPtr{SIMD.Vec{N, T}, AS}, ptr)
63+
return unsafe_load(vec_ptr, 1, Val(alignment))
64+
end
65+
66+
@inline function vstore!(ptr::Core.LLVMPtr{T, AS}, x::SIMD.Vec{N, T}) where {N, T, AS}
67+
alignment = sizeof(T) * N
68+
vec_ptr = Base.bitcast(Core.LLVMPtr{SIMD.Vec{N, T}, AS}, ptr)
69+
unsafe_store!(vec_ptr, x, 1, Val(alignment))
70+
return
71+
end
72+
73+
function vadd_simd!(c::AbstractVector{T}, a, b, ::Val{tile_size}) where {T, tile_size}
74+
i = workitemIdx().x + (workgroupIdx().x - 1) * workgroupDim().x
75+
tile_idx = (i - 1) * tile_size + 1
76+
77+
a_ptr = pointer(a, tile_idx)
78+
b_ptr = pointer(b, tile_idx)
79+
c_ptr = pointer(c, tile_idx)
80+
81+
a_tile = vload(SIMD.Vec{tile_size, T}, a_ptr)
82+
b_tile = vload(SIMD.Vec{tile_size, T}, b_ptr)
83+
vstore!(c_ptr, a_tile + b_tile)
84+
return
85+
end
86+
87+
n = 1024
88+
tile_size = Val(4)
89+
90+
a = ROCArray(ones(Int, n))
91+
b = ROCArray(ones(Int, n))
92+
c = ROCArray(zeros(Int, n))
93+
94+
groupsize = 256
95+
gridsize = cld(length(c), groupsize)
96+
@roc groupsize=groupsize gridsize=gridsize vadd_simd!(c, a, b, tile_size)
97+
@assert c == (a .+ b)
98+
```
99+
100+
Examining LLVM IR, we can see vectorized `load <4 x i64>`, `add <4 x i64>`
101+
and `store <4 x i64>` instructions:
102+
103+
```@example vadd-simd
104+
AMDGPU.@device_code_llvm @roc launch=false vadd_simd!(c, a, b, tile_size);
105+
```

0 commit comments

Comments
 (0)