Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading and writing bitpacked activations in C++ kernels. #305

Merged
merged 19 commits into from
Apr 7, 2020

Conversation

AdamHillier
Copy link
Contributor

@AdamHillier AdamHillier commented Mar 25, 2020

What do these changes do?

This feature is disabled by default and so does not change any behaviour for new or existing converted models; enabling this feature will require changes to the converter to set the correct op flags.

Currently, binary convolutions expect full precision input, bitpack that input, perform the computation, and then write full-precision output. This feature allows a binary convolutions to instead perform bitpacking inside the kernel and write (8-bit or 32-bit) bitpacked output, and for the subsequent binary convolution to skip the usual initial bitpacking step and read the already-bitpacked input directly. Doing this significantly reduces the number of reads and writes to memory and the the overall memory footprint of the model inference.

Support is added only for the C++ kernels (x86 and Arm32), including the reference kernel. Support for the optimised assembly kernels (Arm64) requires additional future work.

How have these changes been tested?

The existing bconv2d_test.cc tests have been extended to support testing bitpacked input/output tensors with the Ruy c++ kernel (8-bit bitpacking) and the reference kernel (32-bit bitpacking).

This feature is disabled by default and so does not change any
behaviour for new or existing converted models; enabling this feature
will require changes to the converter to set the correct op flags.

Currently, binary convolutions expect full precision input, bitpack
that input, perform the computation, and then write full-precision
output.

This feature allows a binary convolutions to instead perform
bitpacking inside the kernel and write (8-bit) bitpacked output, and
for the subsequent binary convolution to skip the usual initial
bitpacking step and read the already-bitpacked input directly.
Doing this significantly reduces the number of reads and writes to
memory and the the overall memory footprint of the model inference.

Support is added only for the C++ kernels (x86 and Arm32). Support
for the optimised assembly kernels (Arm64) requires additional
future work.
@AdamHillier AdamHillier added the feature New feature or request label Mar 25, 2020
@AdamHillier AdamHillier requested a review from a team March 25, 2020 12:29
@arashb arashb requested review from Tombana and removed request for a team March 25, 2020 12:37
@AdamHillier
Copy link
Contributor Author

AdamHillier commented Apr 1, 2020

Thank you very much for reviewing @Tombana, and your suggestions. I've now merged in changes from master and the PR is now extended to work with the reference kernel as well (which took far, far, far longer than I expected it would).

A few remarks:

  • In the reference kernel 32-bit bitpacking is used; in the C++ Ruy kernel 8-bit bitpacking is used. This now means that reference and optimised ops cannot be used simultaneously in the same interpreter, because they won't interact correctly when using bitpacked activations. This shouldn't ever actually be a problem because mixing the two wouldn't make sense, but it's not a restriction that we actually enforce anywhere and technically did work previously.

  • One problem I encountered with 32-bit bitpacking in the reference kernel was being forced to use the kTFLiteInt32 type for bitpacked tensors, as TFL doesn't have an unsigned 32 bit integer type. Unfortunately, this conflicted with using std::uint8_t and std::uint32_t as destination scalars throughout my implementation to template the bitpacked versions of the kernels. TFLite didn't like me populating a kTFLiteInt32 tensor with std::uint32_t data, or extracting std::uint32_t data from the tensor. Rather than searching for a cumbersome workaround, I elected to change all std::uint8_t and std::uint32_t usage as destination scalars to the signed equivalents std::uint8 and std::uint32.

    This means there is now a bit of a mis-match between TBitpacked and DstScalar, because in the former we still used the unsigned types (although I don't think there's a technical restriction given that the bitpacking functions use std::make_unsigned, and so would also work with the signed scalars).

    How do you feel about switching to the signed bitpacking types everywhere?

  • It will probably be easiest to review the new changes commit-by-commit.

@Tombana
Copy link
Collaborator

Tombana commented Apr 3, 2020

  • In the reference kernel 32-bit bitpacking is used; in the C++ Ruy kernel 8-bit bitpacking is used. This now means that reference and optimised ops cannot be used simultaneously in the same interpreter, because they won't interact correctly when using bitpacked activations. This shouldn't ever actually be a problem because mixing the two wouldn't make sense, but it's not a restriction that we actually enforce anywhere and technically did work previously.

It should indeed not be a problem. In the long-term however, I think we have to try to get the 32- (or even 64-)bit RUY version working (it already does, but simply not multithreaded), because it will be faster because of memory-alignment. I think we can do that in a separate PR, later.

  • One problem I encountered with 32-bit bitpacking in the reference kernel was being forced to use the kTFLiteInt32 type for bitpacked tensors, as TFL doesn't have an unsigned 32 bit integer type. Unfortunately, this conflicted with using std::uint8_t and std::uint32_t as destination scalars throughout my implementation to template the bitpacked versions of the kernels. TFLite didn't like me populating a kTFLiteInt32 tensor with std::uint32_t data, or extracting std::uint32_t data from the tensor. Rather than searching for a cumbersome workaround, I elected to change all std::uint8_t and std::uint32_t usage as destination scalars to the signed equivalents std::uint8 and std::uint32.
    This means there is now a bit of a mis-match between TBitpacked and DstScalar, because in the former we still used the unsigned types (although I don't think there's a technical restriction given that the bitpacking functions use std::make_unsigned, and so would also work with the signed scalars).

    How do you feel about switching to the signed bitpacking types everywhere?

I've ran into the same problems, its very annoying that you can't put uint32 data into a tflite tensor without reinterpret-casting it to int32.
Given that one could forget about std::make_unsigned (which can cause hard-to-debug problems), I vote for an entirely new datatype, now that we are getting int8 quantized types as well. The RFC is here: #325
This does not solve the issue of the tflite tensor type enum, but at least our code will be much clearer and readable.

  • It will probably be easiest to review the new changes commit-by-commit.

Thanks!

Tombana
Tombana previously approved these changes Apr 7, 2020
@AdamHillier
Copy link
Contributor Author

GitHub won't let me approve my own PR, even if a substantial part of it was written by someone else, but thank you very much @Tombana for all of this 🎉

@Tombana Tombana merged commit 24df15a into master Apr 7, 2020
@Tombana Tombana deleted the read-write-bitpacked-x86 branch April 7, 2020 11:11
Tombana pushed a commit that referenced this pull request Apr 6, 2021
This 'test' is now split in two parts:
1) an actual test that takes an example Keras model and runs it
through the 'generate_ikva_network' function.
2) a debug utility that can be ran as a stand-alone tool, now
including a proper argument parser.

And also:
* Move Ikva end2end tests to its own target
* Add new Ikva cocotb smoke test to CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants