[CUDAKernels] add an implicit sync to kernels with no dependencies
Created by: vchuravy
Fixes #221 (closed), by placing synchronizing against the task local stream for kernels that are not launched with dependencies. Maybe we should generally synchronize KA launches against the task-local stream?
Also uses the task-local stream instead of the CuDefaultStream
.