Faster SpatialDepthWiseConvolution #481

shrubb · 2017-08-04T20:18:33Z

The current stub implementation is totally impractical; the fastest GPU depthwise conv for Torch was cuDNN's grouped conv with self.groups == self.nInputPlane. Still, even with cuDNN, Google's MobileNets for example would run 2 times slower than ResNet-34.

I've tried lots of lots of lots of methods to efficiently reduce this stuff to cublas routines: using gemmBatched, grouping channels for heavier gemms load etc. Unfortunately, then I've only managed to roughly reach cuDNN' performance in backward pass and make a 1.5x speedup in MobileNet forward pass.

Surprisingly, the fastest option by far turned to be...the super dumb for loop. Here it is (with a bit smarter option for accGradParams though). The forward/backward passes are now at least 45x/8x faster than the original implementation respectively. Default MobileNet's inference enjoys a speedup over cuDNN case of 3.57x on Maxwell and 5.18x on Pascal.

Tested all the output and gradients with large batch size & nInputPlane and various nOutputPlane.

Although the weight shape is (nOutputPlane) x (nInputPlane) x (kH) x (kW) which perfectly corresponds to cuDNN bindings, I didn't like it much since the weight tensor needs to be transposed back and forth when you need almost any kind of matmul/matvec. Don't know if it's critical, but I left it as is just to be safe anyway.

killeent · 2017-08-04T20:35:47Z

@shrubb can you post the script you used for benchmarking?

Also cc pytorch/pytorch#1708

shrubb · 2017-08-04T21:09:02Z

@killeent here it is. Hope it's just to check if my benchmarking is sane. The script's dirty, I just use it for rare checks.

I don't pretend to bring in some super-optimal implementation, there's actually little new in this PR. Just a better stub.

For anyone working on this, my previous attempts to employ "smarter" cublas usage are in this branch.

soumith · 2017-08-04T22:58:50Z

cc: @ajtulloch @ngimel

soumith · 2017-08-04T23:02:48Z

there's also https://github.com/szagoruyko/pyinn/blob/master/pyinn/conv2d_depthwise.py which is based on the Caffe code. it's a specialized conv. I presume you already benchmarked something in this order.

ajtulloch · 2017-08-04T23:09:51Z

One thing to improve it is to add template specializations for the most common kH/kW/stride/dilation (e.g. IMO it's worth adding a template specialization for 3x3s1 and re-benchmarking mobilenet/shufflenet).

szagoruyko · 2017-08-05T05:09:50Z

@soumith pyinn implementation is about the same with what Egor did, simple for loops and sum on grad wrt weight

shrubb · 2017-08-05T13:00:28Z

@ajtulloch already tried this, gives just a negligible improvement:

shrubb · 2017-08-05T13:12:24Z

Again, I believe this is NOT practical too. The code for weight gradients is also super simple, could be easily accelerated as well. Benchmarking it for fun.

…dateGradInput bug

ajtulloch · 2017-08-06T11:00:20Z

@schrubb how much did you specialize in that benchmark? i.e ideally you'd have a specialization for statically-known kh, kw, stride dilation + separate paths in the kernel for definitely-inbounds/possibly-outbounds?

shrubb · 2017-08-06T11:15:52Z

@ajtulloch hardcoded (kH, kW, padH, padW, strideH, strideW) = (3, 3, 1, 1, 1, 1). Dilation is also fixed (1,1) in both kernels.
I don't see why tracking inbound/outbound pixel will matter, memory reads are mostly contiguous anyway.
To me, all this seems insignificant compared to main arithmetic routines' load.

Anyway, once again, I don't understand what is this all benchmarking and fighting for another 0.001 seconds for. Nobody is going to use this. Everyone's on PyTorch, and NVIDIA is likely to release this kind of conv in cuDNN sooner or later.

ngimel · 2017-08-06T17:41:07Z

Pytorch shares backend with torch, so this could be exposed in pytorch. I don't know how it compares with pyinn, though, in terms of performance. Nvidia won't release it tomorrow, and people have been wanting to use depthwise separable convolutions for years, so anything helps.

szagoruyko · 2017-08-07T13:31:42Z

@ngimel looks like cudnn7 supports grouped convolutions, would it be slower than such implementation?

fmassa · 2017-08-07T15:24:35Z

@szagoruyko it seems that your pyinn kernels for depthwise convolutions are better than cudnn7 with grouped convolutions. But I'll let @ngimel comment more on that.

Faster SpatialDepthWiseConvolution

64496ed

SpatialDepthWiseConvolution: specialize dilation values and fix an up…

260017f

…dateGradInput bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster SpatialDepthWiseConvolution #481

Faster SpatialDepthWiseConvolution #481

shrubb commented Aug 4, 2017

killeent commented Aug 4, 2017

shrubb commented Aug 4, 2017

soumith commented Aug 4, 2017

soumith commented Aug 4, 2017 •

edited

Loading

ajtulloch commented Aug 4, 2017

szagoruyko commented Aug 5, 2017

shrubb commented Aug 5, 2017

shrubb commented Aug 5, 2017 •

edited

Loading

ajtulloch commented Aug 6, 2017

shrubb commented Aug 6, 2017 •

edited

Loading

ngimel commented Aug 6, 2017

szagoruyko commented Aug 7, 2017

fmassa commented Aug 7, 2017

Faster SpatialDepthWiseConvolution #481

Are you sure you want to change the base?

Faster SpatialDepthWiseConvolution #481

Conversation

shrubb commented Aug 4, 2017

killeent commented Aug 4, 2017

shrubb commented Aug 4, 2017

soumith commented Aug 4, 2017

soumith commented Aug 4, 2017 • edited Loading

ajtulloch commented Aug 4, 2017

szagoruyko commented Aug 5, 2017

shrubb commented Aug 5, 2017

shrubb commented Aug 5, 2017 • edited Loading

ajtulloch commented Aug 6, 2017

shrubb commented Aug 6, 2017 • edited Loading

ngimel commented Aug 6, 2017

szagoruyko commented Aug 7, 2017

fmassa commented Aug 7, 2017

soumith commented Aug 4, 2017 •

edited

Loading

shrubb commented Aug 5, 2017 •

edited

Loading

shrubb commented Aug 6, 2017 •

edited

Loading