I was looking around for a language to write my own versions of convolution laye...

ChrisRackauckas · on April 12, 2020

You could just use LoopVectorization on the CPU side. It's been shown to match well-tuned C++ BLAS implementations, for example with the pure Julia Gaius.jl (https://github.com/MasonProtter/Gaius.jl), so you can follow that as an example for getting BLAS-speed CPU side kernels. For the GPU side, there's CUDAnative.jl and KernelAbstractions.jl, and indeed benchmarks from NVIDIA show that it at least rivals directly writing CUDA (https://devblogs.nvidia.com/gpu-computing-julia-programming-...), so you won't be missing anything just by learning Julia and sticking to using just Julia for researching new kernel implementations.

p1esk · on April 13, 2020

In that benchmark, was Julia tested against CuDNN accelerated neural network CUDA code? If not, is it possible (and beneficial) to call CuDNN functions from Julia?

ChrisRackauckas · on April 13, 2020

That wasn't a benchmark with CuDNN since it was a benchmark about writing such kernels. However, Julia libraries call into optimized kernels whenever they exist, and things like NNLib.jl (the backbone of Flux.jl) and KNet.jl expose operations like `conv` that dispatch CuArrays to automatically use CuDNN.