imlib/filter: Vectorize morph() kernel. #2415
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Depends on #2417.
Benchmark results here: https://docs.google.com/spreadsheets/d/1-FNVKCEr8-6UYs8MUm6wgsOt2c8ihJ2mg9QXKkG91os/edit?gid=452211341#gid=452211341
AE3 Performance with Helium is 4.2x faster than the RT1062.
Otherwise, note that this PR reduces the performance of the morph kernel by 50% for grayscale 3x3 kernels to be generic and vectorizable. The previous code provided the best possible speed for M4/M7 architectures but could not be vectorized and was only applicable for kernels of size 3x3. The new code offers vectorized processing for any kernel size.
Given the massive performance gain Helium has over the scalar code, this tradeoff makes sense.
Arguments mul/add were dropped as these are impossible to handle without complicating the default loop case. Additionally, they can easily overflow the 16-bit accumulators being used.