Simple Block Reduce Fails when using `while` loops
Created by: anicusan
Hi, thank you for developing this library - I would like to write optimised kernels for common GPU algorithms such as reduce, scan, radix sort, etc. similar to CUB but available on all KernelAbstractions platforms. The resulting "KA standard library" (KALib? Caleb?) could be used as a benchmark for future KA development & optimisation - and I can use the lessons along the way to populate the "Writing Kernels" section in the documentation. Big plans, but...
I'm implementing the block-wise reduce following this tutorial with this simple-looking code:
using KernelAbstractions
using CUDA
using CUDAKernels
@kernel function block_reduce(out, in)
# Get block / workgroup size
bs = @uniform @groupsize()[1]
# Block / group index, thread index within block, global thread index
bi = @index(Group, Linear)
ti = @index(Local, Linear)
gi = @index(Global, Linear)
# Copy each thread's corresponding item from global to shared memory
cache = @localmem eltype(out) (bs,)
cache[ti] = in[gi]
@synchronize
# Reduce elements in shared memory using sequential addressing following
# https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
@private s = bs ÷ 2
while s > 0
if ti < s
cache[ti] += cache[ti + s]
end
@synchronize
s = s >> 1
end
# Copy result back to global memory
if ti == 1
out[bi] = cache[1]
end
end
num_blocks = 10
block_size = 32
num_elements = num_blocks * block_size
in = rand(1:10, num_elements) |> CuArray
out = zeros(num_blocks) |> CuArray
kernel_reduce = block_reduce(CUDADevice(), block_size)
ev = kernel_reduce(out, in, ndrange=num_elements)
wait(ev)
println(out)
It shouldn't be more exotic than the example code in the docs - however, these two lines:
@private s = bs ÷ 2
while s > 0
Produce the following errors:
Reason: unsupported use of an undefined name (use of 'bs')
Stacktrace:
[1] macro expansion
@ ~/Prog/Julia/KALib/prototype/reduce.jl:31
[2] gpu_block_reduce
@ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
[3] gpu_block_reduce
@ ./none:0
Reason: unsupported dynamic function invocation (call to div)
Stacktrace:
[1] macro expansion
@ ~/Prog/Julia/KALib/prototype/reduce.jl:31
[2] gpu_block_reduce
@ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
[3] gpu_block_reduce
@ ./none:0
Reason: unsupported use of an undefined name (use of 's')
Stacktrace:
[1] macro expansion
@ ~/Prog/Julia/KALib/prototype/reduce.jl:32
[2] gpu_block_reduce
@ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
[3] gpu_block_reduce
@ ./none:0
Reason: unsupported dynamic function invocation (call to >)
[...Stacktrace...]
I tried following the code using Cthulhu.jl, but the errors appear simple: it's calling div(::Any, ::Int64) and >(::Any, ::Int64), so I assume the bs = @uniform @groupsize()[1]
and @private s = bs ÷ 2
are not inferred as being integers.
If I switch the arrays and device to CPU()
I get the following error:
nested task error: MethodError: no method matching isless(::Int64, ::NTuple{32, Int64})
Would you know why these errors appear or how I could investigate (and fix..) them?
Thanks, Leonard