Use hostcall for wait and stream GC
Created by: vchuravy
While looking at a profile from @lcw
I noticed that
we spend a lot of CPU cycles on cuEventQuery
.
The second change is more questionable since it might make GPU launches slower (have yet to measure costs).