Default CPU workgroupsize can be inadequat for higher-dimensionsal kernels
Created by: vchuravy
As observed in https://github.com/WaterLily-jl/WaterLily.jl/pull/133#issuecomment-2263500896
When we have a grid like (64, 64, 64)
using a WG size of (1024, 1, 1) is inefficient. Instead WL uses (64,1,1)
and maybe KA should "spread" the 1024
to (64, 64,1)
.