Performance of naive transpose
Created by: mwarusz
I was curious about the performance of the naive transpose kernel example, so I did some benchmarking. The goal was to see how it compares to the naive CUDAnative
variant
function transpose_cuda!(b, a)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
j = (blockIdx().y-1) * blockDim().y + threadIdx().y
@inbounds b[i, j] = a[j, i]
nothing
end
I ended up writing a bunch of kernels, all of them can be found here. The results are as follows
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 39.41% 2.43411s 10 243.41ms 243.35ms 243.49ms ptxcall___gpu_transpose_kernel_naive_ldg__429_2
39.40% 2.43323s 10 243.32ms 242.84ms 243.50ms ptxcall___gpu_transpose_kernel_naive__426_1
5.32% 328.54ms 10 32.854ms 31.695ms 33.516ms ptxcall___gpu_transpose_kernel__432_3
5.08% 313.76ms 10 31.376ms 30.469ms 31.947ms ptxcall_transpose_cuda__5
2.82% 174.19ms 10 17.419ms 16.405ms 19.173ms ptxcall___gpu_transpose_kernel_ldg__435_4
2.79% 172.54ms 10 17.254ms 16.641ms 18.460ms ptxcall_transpose_cuda_ldg__6
2.59% 159.81ms 10 15.981ms 15.975ms 15.988ms ptxcall_transpose_gpuify_shared__8
2.59% 159.76ms 10 15.976ms 15.970ms 15.984ms ptxcall_transpose_cuda_shared__7
The main takeaway is that performance of the naive CUDanative
kernel is better
than the naive KernelAbstractions
kernel. I attribute this to cache effects, which can also be
seen by looking at the performance of variants using ldg
. It is possible to write the equivalent kernel using KernelAbstractions
(see transpose_kernel!
) but it is a bit painful. If my analysis is correct maybe it is worth to add functionality to make N-d blocking a bit easier.
@leios