Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ARM Neon and scalar implementations of SIMD functions #359

Merged
merged 4 commits into from
Aug 31, 2022

Conversation

jmarshall
Copy link
Contributor

@jmarshall jmarshall commented Jun 26, 2022

Building BWA has long been limited to x86-64 due to the unconditional usage of SSE2 SIMD intrinsics in ksw.c's ksw_u8() and ksw_i16() functions. This PR adds Neon implementations of these functions, allowing compilation on Aarch64 machines, and also implementations using ordinary scalar C code that at least provide a working (albeit inefficient) build on other (little endian, at least) platforms.

This is done without duplicating the code in ksw.c: in most cases, by defining SSE2 intrinsics as static inline functions calling Neon intrinsics where there is a direct equivalent; in other cases, by analysing what the SSE2 code is doing semantically and using a non-directly-equivalent Neon intrinsic that has the same effect in the context of the code. (Similarly with the scalar implementation using a small inline C function for each SSE2 intrinsic.)

There are already BWA PRs #283 and #344, which port the SIMD code to other platforms using SIMDe and sse2neon respectively. The approach used in this PR is superior for several reasons:

  1. It adds ~200 lines of code, rather than a ~800,000-line submodule (SIMDe) or an external ~9000-line header file (sse2neon).

  2. The sse2neon approach adds support for ARM platforms only; this approach (and the SIMDe approach) add support for other platforms too — little endian platforms anyway; the code may need further porting to other endiannesses.

    (This is mostly of interest to distribution packagers who wish to have at least minimal support for other platforms; I am unaware of substantial communities of even potential BWA users on platforms other than x86 and ARM.)

  3. This approach produces a more efficient ARM implementation than naive use of a direct SSE2 intrinsic emulation layer:

    • The __max_16/__max_8 macros return the maximum entry in their vector argument.

      SSE2 does not implement this operation directly, so they are coded in terms of other SSE2 intrinsics. The _mm_srli_si128 operation used does not have a direct equivalent in Neon; sse2neon implements it using a sequence of three Neon intrinsics involving memory I/O. Implemented thus, __max_8 requires 3*(1+3)+1 = 13 Neon operations.

      However by understanding the semantics of the code (instead of simply translating individual operations), the entire __max_8 computation can be done using a single Neon vmaxvq_s16 intrinsic (involving only registers).

    • Similarly sse2neon implements __mm_movemask_epi8 using a sequence of six Neon intrinsics. However analysing the code semantics (see allzero_16 and allzero_0f_8 below) shows that an identical effect can be achieved using a single vmaxvq_u8/vmaxvq_u16 Neon intrinsic.

The previous code implicitly caused a load; change it so the load
intrinsic is explicitly invoked, as the others are. (This in fact
makes no difference to the generated code.)
Many Intel intrinsics have a corresponding Neon equivalent.
Other cases are more interesting:

* Neon's vmaxvq directly selects the maximum entry in a vector,
  so can be used to implement both the __max_16/__max_8 macros
  and the _mm_movemask_epi8 early loop exit. Introduce additional
  helper macros alongside __max_16/__max_8 so that the early loop
  exit can similarly be implemented differently on the two platforms.

* Full-width shifts can be done via vextq. This is defined close to
  the ksw_u8()/ksw_i16() functions (rather than in neon_sse.h) as it
  implicitly uses one of their local variables.

* ksw_i16() uses saturating *signed* 16-bit operations apart from
  _mm_subs_epu16; presumably the data is effectively still signed but
  we wish to keep it non-negative. The ARM intrinsics are more careful
  about type checking, so this requires an extra U16() helper macro.
@jmarshall
Copy link
Contributor Author

This has been tested on x86-64 macOS and on aarch64 Debian GNU/Linux 11 (bullseye) (on a Raspberry Pi). The scalar version has been tested on both of those platforms. Thanks to @nh13 for assistance with test data and testing.

Make the native SSE2 code conditional on __SSE2__, which is defined
by GCC/Clang/etc on x86-64 by default and on i386 with -msse2 etc.
Copy link
Contributor

@martin-g martin-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @jmarshall !
I've successfully verified your changes on Linux openEuler 20.03 SP3. CPU: Kunpeng-920 aarch64

@julien-faye
Copy link

Awesome work, @jmarshall !
I have also tested successfully the changes on Oracle Cloud's Ampere VM!
I hope it is merged soon!

@jmarshall
Copy link
Contributor Author

This has now been tested on a big-endian host (namely s390x), where (for this PR's test set of 10,000 reads) it produces the same output as the PR on other hosts and current bwa master on x86-64.

Note that bwa's index files (in particular, the .sa and .bwt files) are not portable between endiannesses, so bwa index must be run on the same machine — or endianness, at least — as bwa mem.

@julien-faye
Copy link

@lh3 Is there a chance this PR to be reviewed and merged soon ?

@lh3 lh3 merged commit 2e603e4 into lh3:master Aug 31, 2022
@lh3
Copy link
Owner

lh3 commented Aug 31, 2022

Thank you so much for this, @jmarshall. I got a M1 macbook yesterday. I also need the arm support now.

@jmarshall jmarshall deleted the neon branch August 31, 2022 14:46
@jmarshall
Copy link
Contributor Author

Glad to be of service! 😄

If you were going to make a new release, it would be vital to also merge either #345 or #348 (the latter would be my preference…), which fix a bug on current master whereby the SAM @HD header can be printed in the wrong order, after some of the other headers.

@nh13
Copy link
Contributor

nh13 commented Aug 31, 2022

@lh3 thank-you! I'd echo John's call for a new release so we can bump it on conda for the ARM folks of the world to use. I don't think it's to onerous to also add #348 and/or a few other PRs before you release too.

@stsquad
Copy link

stsquad commented Sep 1, 2022

Do you know if anyone is considering enabling SVE/SVE2 for these functions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants