Performance of naive transpose

Created by: mwarusz

I was curious about the performance of the naive transpose kernel example, so I did some benchmarking. The goal was to see how it compares to the naive CUDAnative variant

function transpose_cuda!(b, a)
  i = (blockIdx().x-1) * blockDim().x + threadIdx().x
  j = (blockIdx().y-1) * blockDim().y + threadIdx().y
  @inbounds b[i, j] = a[j, i]
  nothing
end

I ended up writing a bunch of kernels, all of them can be found here. The results are as follows

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   39.41%  2.43411s        10  243.41ms  243.35ms  243.49ms  ptxcall___gpu_transpose_kernel_naive_ldg__429_2
                   39.40%  2.43323s        10  243.32ms  242.84ms  243.50ms  ptxcall___gpu_transpose_kernel_naive__426_1
                    5.32%  328.54ms        10  32.854ms  31.695ms  33.516ms  ptxcall___gpu_transpose_kernel__432_3
                    5.08%  313.76ms        10  31.376ms  30.469ms  31.947ms  ptxcall_transpose_cuda__5
                    2.82%  174.19ms        10  17.419ms  16.405ms  19.173ms  ptxcall___gpu_transpose_kernel_ldg__435_4
                    2.79%  172.54ms        10  17.254ms  16.641ms  18.460ms  ptxcall_transpose_cuda_ldg__6
                    2.59%  159.81ms        10  15.981ms  15.975ms  15.988ms  ptxcall_transpose_gpuify_shared__8
                    2.59%  159.76ms        10  15.976ms  15.970ms  15.984ms  ptxcall_transpose_cuda_shared__7

The main takeaway is that performance of the naive CUDanative kernel is better than the naive KernelAbstractions kernel. I attribute this to cache effects, which can also be seen by looking at the performance of variants using ldg. It is possible to write the equivalent kernel using KernelAbstractions (see transpose_kernel!) but it is a bit painful. If my analysis is correct maybe it is worth to add functionality to make N-d blocking a bit easier. @leios