only create as many tasks as threads and more inference barriers
Created by: vchuravy
Improves performance on the CPU from:
BenchmarkTools.Trial:
memory estimate: 12.63 GiB
allocs estimate: 117440577
--------------
minimum time: 55.920 s (27.54% GC)
median time: 55.920 s (27.54% GC)
mean time: 55.920 s (27.54% GC)
maximum time: 55.920 s (27.54% GC)
--------------
To:
minimum time: 1.593 s (0.00% GC)
median time: 1.649 s (0.00% GC)
mean time: 1.649 s (0.00% GC)
maximum time: 1.703 s (0.00% GC)
Benchmark simple_transpose