-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggestions for performance #18
Comments
Thanks Peter, it all makes sense considering that the equal-size hidden layers allows for less allocations and copies to be made. What kind of performance improvement are you getting? I'm interested in both GNU and Intel compilers. |
Elapsed time on output: 0.732999980 This on gfortran-7, intel processor and a model with (50,50) hidden neurons, 21 inputs and 256 outputs. I don't have the intel compilers on this computer but I do at home, I can get back to you.
|
On ifort the differences are much bigger:
Elapsed time on output: 0.8096000 Here the last routine processes all the input data at once as mentioned (for the other ones, I used a loop).
|
Thanks. Do you know if any of these could be implemented in the existing class that doesn't assume equal sizes of hidden layers? The only opportunity I see is to allocate the activation arrays once rather than re-allocate on assignment in every iteration, in which case the subroutine to calculate activations in-place may be effective. Btw, the 10-output-batch branch introduces the function to output per batch (contributed by @jvdp1), although it's just a wrapper around the output for a single set of inputs, so it doesn't have any performance improvements that you propose. The |
|
That might be a good idea, although based on the previous result the allocation overhead isn't that big..I'm by no means an expert on Fortran by the way, this has been an interesting learning experience! |
OK, great, thanks, I overlooked that. In general, I don't want to break the high-level functional API. However, I'm all for tweaking the internal implementation if performance benefits are clear. You give clear ideas and examples how to do this. I will play with it and let you know. |
You're welcome! I have now implemented these changes in a more general manner here: https://github.com/peterukk/rte-rrtmgp/blob/master/neural/ Most notably, activation functions have been replaced with subroutines, and these procedures have a 2D array variant which are used in e.g. output_sgemm |
FYI I now tested using elemental subroutines as activation functions. This is considerably faster (and work with both 1D and 2D arrays), but unfortunately, pointers do not work with elemental procedures :/ disappointed that object-oriented programming leads to a performance loss in this instance |
I'm using this package in a very performance-critical setting. I managed to make the inference quite a bit faster by making some changes (some of these are custom to my model). Below is the inference for a "flat" feed-forward model which takes an 1D array of inputs. I also experimented with inputting multiple samples at a time (2D array) which replaces matrix-vector multiplications with matrix-matrix. This may be faster for some platforms/models (for me it was a wash) but in this case make sure to use SGEMM/DGEMM to replace the matmul call.
The text was updated successfully, but these errors were encountered: