Feature request: performant matmul! example
Created by: piever
I've noticed that transpose
and copy
are implemented in both a naive and a performant way here. On the other hand, for matmul!
, I could only find the naive implementation here.
I understand that a performant julia matmul!
on the GPU is a whole research project on its own over at GemmKernels.jl, but I was curious whether there would be some "middle ground" that loses some performance compared to cuBLAS but is easy and general. IMO, a good endgoal would be to be able to generate that code from a DSL like Tullio (see https://github.com/mcabbott/Tullio.jl/issues/80#issuecomment-770614929).