Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current stub implementation is totally impractical; the fastest GPU depthwise conv for Torch was cuDNN's grouped conv with
self.groups == self.nInputPlane
. Still, even with cuDNN, Google's MobileNets for example would run 2 times slower than ResNet-34.I've tried lots of lots of lots of methods to efficiently reduce this stuff to cublas routines: using
gemmBatched
, grouping channels for heaviergemm
s load etc. Unfortunately, then I've only managed to roughly reach cuDNN' performance in backward pass and make a 1.5x speedup in MobileNet forward pass.Surprisingly, the fastest option by far turned to be...the super dumb
for
loop. Here it is (with a bit smarter option foraccGradParams
though). The forward/backward passes are now at least 45x/8x faster than the original implementation respectively. Default MobileNet's inference enjoys a speedup over cuDNN case of 3.57x on Maxwell and 5.18x on Pascal.Tested all the output and gradients with large batch size &
nInputPlane
and variousnOutputPlane
.Although the
weight
shape is(nOutputPlane) x (nInputPlane) x (kH) x (kW)
which perfectly corresponds to cuDNN bindings, I didn't like it much since the weight tensor needs to be transposed back and forth when you need almost any kind of matmul/matvec. Don't know if it's critical, but I left it as is just to be safe anyway.