tune CUDA kernels automatically
Created by: simeonschaub
This is still quite rough around the edges, but I am putting this up for feedback. This automatically splits up the threads over leading dimensions of the ndrange for better performance if the first dimension is small.