Feature request: performant matmul! example

Created by: piever

I've noticed that transpose and copy are implemented in both a naive and a performant way here. On the other hand, for matmul!, I could only find the naive implementation here.

I understand that a performant julia matmul! on the GPU is a whole research project on its own over at GemmKernels.jl, but I was curious whether there would be some "middle ground" that loses some performance compared to cuBLAS but is easy and general. IMO, a good endgoal would be to be able to generate that code from a DSL like Tullio (see https://github.com/mcabbott/Tullio.jl/issues/80#issuecomment-770614929).