Add ARM Neon and scalar implementations of SIMD functions #359

jmarshall · 2022-06-26T23:44:22Z

Building BWA has long been limited to x86-64 due to the unconditional usage of SSE2 SIMD intrinsics in ksw.c's ksw_u8() and ksw_i16() functions. This PR adds Neon implementations of these functions, allowing compilation on Aarch64 machines, and also implementations using ordinary scalar C code that at least provide a working (albeit inefficient) build on other (little endian, at least) platforms.

This is done without duplicating the code in ksw.c: in most cases, by defining SSE2 intrinsics as static inline functions calling Neon intrinsics where there is a direct equivalent; in other cases, by analysing what the SSE2 code is doing semantically and using a non-directly-equivalent Neon intrinsic that has the same effect in the context of the code. (Similarly with the scalar implementation using a small inline C function for each SSE2 intrinsic.)

There are already BWA PRs #283 and #344, which port the SIMD code to other platforms using SIMDe and sse2neon respectively. The approach used in this PR is superior for several reasons:

It adds ~200 lines of code, rather than a ~800,000-line submodule (SIMDe) or an external ~9000-line header file (sse2neon).
The sse2neon approach adds support for ARM platforms only; this approach (and the SIMDe approach) add support for other platforms too — little endian platforms anyway; the code may need further porting to other endiannesses.

(This is mostly of interest to distribution packagers who wish to have at least minimal support for other platforms; I am unaware of substantial communities of even potential BWA users on platforms other than x86 and ARM.)
This approach produces a more efficient ARM implementation than naive use of a direct SSE2 intrinsic emulation layer:
- The __max_16/__max_8 macros return the maximum entry in their vector argument.
  
  SSE2 does not implement this operation directly, so they are coded in terms of other SSE2 intrinsics. The _mm_srli_si128 operation used does not have a direct equivalent in Neon; sse2neon implements it using a sequence of three Neon intrinsics involving memory I/O. Implemented thus, __max_8 requires 3*(1+3)+1 = 13 Neon operations.
  
  However by understanding the semantics of the code (instead of simply translating individual operations), the entire __max_8 computation can be done using a single Neon vmaxvq_s16 intrinsic (involving only registers).
- Similarly sse2neon implements __mm_movemask_epi8 using a sequence of six Neon intrinsics. However analysing the code semantics (see allzero_16 and allzero_0f_8 below) shows that an identical effect can be achieved using a single vmaxvq_u8/vmaxvq_u16 Neon intrinsic.

The previous code implicitly caused a load; change it so the load intrinsic is explicitly invoked, as the others are. (This in fact makes no difference to the generated code.)

Many Intel intrinsics have a corresponding Neon equivalent. Other cases are more interesting: * Neon's vmaxvq directly selects the maximum entry in a vector, so can be used to implement both the __max_16/__max_8 macros and the _mm_movemask_epi8 early loop exit. Introduce additional helper macros alongside __max_16/__max_8 so that the early loop exit can similarly be implemented differently on the two platforms. * Full-width shifts can be done via vextq. This is defined close to the ksw_u8()/ksw_i16() functions (rather than in neon_sse.h) as it implicitly uses one of their local variables. * ksw_i16() uses saturating *signed* 16-bit operations apart from _mm_subs_epu16; presumably the data is effectively still signed but we wish to keep it non-negative. The ARM intrinsics are more careful about type checking, so this requires an extra U16() helper macro.

jmarshall · 2022-06-26T23:49:00Z

This has been tested on x86-64 macOS and on aarch64 Debian GNU/Linux 11 (bullseye) (on a Raspberry Pi). The scalar version has been tested on both of those platforms. Thanks to @nh13 for assistance with test data and testing.

Make the native SSE2 code conditional on __SSE2__, which is defined by GCC/Clang/etc on x86-64 by default and on i386 with -msse2 etc.

martin-g

Thank you, @jmarshall !
I've successfully verified your changes on Linux openEuler 20.03 SP3. CPU: Kunpeng-920 aarch64

julien-faye · 2022-07-13T07:38:38Z

Awesome work, @jmarshall !
I have also tested successfully the changes on Oracle Cloud's Ampere VM!
I hope it is merged soon!

jmarshall · 2022-08-04T13:49:55Z

This has now been tested on a big-endian host (namely s390x), where (for this PR's test set of 10,000 reads) it produces the same output as the PR on other hosts and current bwa master on x86-64.

Note that bwa's index files (in particular, the .sa and .bwt files) are not portable between endiannesses, so bwa index must be run on the same machine — or endianness, at least — as bwa mem.

julien-faye · 2022-08-25T10:57:49Z

@lh3 Is there a chance this PR to be reviewed and merged soon ?

lh3 · 2022-08-31T13:55:03Z

Thank you so much for this, @jmarshall. I got a M1 macbook yesterday. I also need the arm support now.

jmarshall · 2022-08-31T14:52:13Z

Glad to be of service! 😄

If you were going to make a new release, it would be vital to also merge either #345 or #348 (the latter would be my preference…), which fix a bug on current master whereby the SAM @HD header can be printed in the wrong order, after some of the other headers.

nh13 · 2022-08-31T17:31:20Z

@lh3 thank-you! I'd echo John's call for a new release so we can bump it on conda for the ARM folks of the world to use. I don't think it's to onerous to also add #348 and/or a few other PRs before you release too.

stsquad · 2022-09-01T13:34:56Z

Do you know if anyone is considering enabling SVE/SVE2 for these functions?

Apply https://patch-diff.githubusercontent.com/raw/lh3/bwa/pull/359.patch from lh3/bwa#359 Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>

jmarshall added 3 commits June 17, 2022 18:42

Make _mm_load_si128() explicit

ab01ab4

The previous code implicitly caused a load; change it so the load intrinsic is explicitly invoked, as the others are. (This in fact makes no difference to the generated code.)

On other platforms, emulate SSE2 SIMD calls using scalar code

ac612b6

This was referenced Jun 26, 2022

travis: added arm64, ppc64le and s390x tests with simde #283

Open

Does not compile for ARM #300

Open

Use native SSE2 intrinsics on i386 as well as x86-64

c77ace7

Make the native SSE2 code conditional on __SSE2__, which is defined by GCC/Clang/etc on x86-64 by default and on i386 with -msse2 etc.

martin-g approved these changes Jun 27, 2022

View reviewed changes

martin-g mentioned this pull request Aug 10, 2022

Add CI job for Ubuntu 20.04 aarch64 #367

Merged

lh3 merged commit 2e603e4 into lh3:master Aug 31, 2022

jmarshall deleted the neon branch August 31, 2022 14:46

nh13 mentioned this pull request Aug 31, 2022

Add support for Arm/Aaarch64. #344

Closed

Yikun mentioned this pull request Feb 10, 2023

Upgrade bwa to latest to support linux arm64/aarch64 crisprVerse/Rbwa#1

Closed

martin-g added a commit to martin-g/Rbwa that referenced this pull request Feb 17, 2023

Add support for Linux ARM64

2c16f6e

Apply https://patch-diff.githubusercontent.com/raw/lh3/bwa/pull/359.patch from lh3/bwa#359 Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>

martin-g mentioned this pull request Feb 17, 2023

Add support for Linux ARM64 crisprVerse/Rbwa#2

Merged

Yikun mentioned this pull request Sep 6, 2023

Time to release new version? #401

Closed

yangxingyu588 mentioned this pull request Sep 11, 2023

ask about compiling on arm. bio-arm/bio-arm.github.io#12

Open

jlost mentioned this pull request Nov 19, 2023

Fails compile on POWER8: SSE2 intrinsics -> VSX/AltiVec #77

Closed

bedroge mentioned this pull request Nov 21, 2023

{2023.06}[GCCcore/11.3.0] BWA-0.7.17-GCCcore-11.3.0 EESSI/software-layer#369

Closed

jmarshall mentioned this pull request Mar 15, 2024

update bwa-0.7.17: enable linux-aarch64 build bioconda/bioconda-recipes#46461

Closed

chenrui333 mentioned this pull request Apr 23, 2024

bwa 0.7.18 Homebrew/homebrew-core#169815

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ARM Neon and scalar implementations of SIMD functions #359

Add ARM Neon and scalar implementations of SIMD functions #359

jmarshall commented Jun 26, 2022 •

edited

Loading

jmarshall commented Jun 26, 2022

martin-g left a comment

julien-faye commented Jul 13, 2022

jmarshall commented Aug 4, 2022

julien-faye commented Aug 25, 2022

lh3 commented Aug 31, 2022

jmarshall commented Aug 31, 2022

nh13 commented Aug 31, 2022

stsquad commented Sep 1, 2022

Add ARM Neon and scalar implementations of SIMD functions #359

Add ARM Neon and scalar implementations of SIMD functions #359

Conversation

jmarshall commented Jun 26, 2022 • edited Loading

jmarshall commented Jun 26, 2022

martin-g left a comment

Choose a reason for hiding this comment

julien-faye commented Jul 13, 2022

jmarshall commented Aug 4, 2022

julien-faye commented Aug 25, 2022

lh3 commented Aug 31, 2022

jmarshall commented Aug 31, 2022

nh13 commented Aug 31, 2022

stsquad commented Sep 1, 2022

jmarshall commented Jun 26, 2022 •

edited

Loading