Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add first NEON SIMD opcode implementations to fast interp #3859

Conversation

jammar1
Copy link

@jammar1 jammar1 commented Oct 15, 2024

Add some implementations of SIMD opcodes using NEON instructions.
Tested using:

(module
  (import "wasi_snapshot_preview1" "proc_exit" (func $proc_exit (param i32)))
  
  (memory (export "memory") 1)

  (func $assert_true (param v128)
    local.get 0
    v128.any_true 
    i32.eqz
    if
      unreachable
    end
  )

  (func $main (export "_start")

    i32.const 0
    i32.const 32
    memory.grow
    drop

    i32.const 0
    v128.const i8x16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    v128.store

    i32.const 0
    v128.load

    v128.const i8x16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    i8x16.eq
    call $assert_true

    i32.const 16
    v128.const i8x16 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
    v128.store

    i32.const 16
    v128.load
    v128.const i8x16 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
    i8x16.eq
    call $assert_true

    i32.const 0
    v128.load
    v128.const i8x16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    i8x16.eq
    call $assert_true
    drop

    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
    
    i32.const 0
    i32.const 1
    memory.grow
    drop

    i32.const 0
    i64.const 0x7F80FF017E02FE80
    i64.store

    i32.const 0
    v128.load8x8_s

    v128.const i16x8 127 -128 -1 1 126 2 -2 -128

    i16x8.eq
    call $assert_true

    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    i32.const 0
    i64.const 0x80FE027E01FF807F
    i64.store

    i32.const 0
    v128.load8x8_u

    v128.const i16x8 128 254 2 126 1 255 128 127

    i16x8.eq
    call $assert_true

    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    i32.const 0
    i64.const 0x8000FFFE7FFF0001
    i64.store

    i32.const 0
    v128.load16x4_s

    v128.const i32x4 -32768 -2 32767 1

    i32x4.eq
    call $assert_true

    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    i32.const 0
    i64.const 0x8000FFFE7FFF0001 
    i64.store

    i32.const 0
    v128.load16x4_u

    v128.const i32x4 32768 65534 32767 1   

    i32x4.eq
    call $assert_true

    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    i32.const 0
    i64.const 0x8000000000000001
    i64.store

    i32.const 0
    v128.load32x2_s

    v128.const i64x2 -2147483648 1 

    i64x2.eq
    call $assert_true

    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    i32.const 0
    i64.const 0x8000000000000001
    i64.store

    i32.const 0
    v128.load32x2_u

    v128.const i64x2 2147483648 1

    i64x2.eq
    call $assert_true

    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    call $proc_exit

  )
)

@@ -597,3 +597,6 @@ endif()
if (NOT WAMR_BUILD_SANITIZER STREQUAL "")
message (" Sanitizer ${WAMR_BUILD_SANITIZER} enabled")
endif ()
if (ENABLE_NEON EQUAL 1)
add_definitions(-DENABLE_NEON=1)
endif()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the devices we target are ARMv7 rather than ARMv8 (and even the v8 devices don't usually run in AArch64 mode. Most, but not all of the v7 devices have NEON support.

So I want to add a new macro for this, but not sure if this is the right thing to do?

I was thinking I could do something like a WAMR_BUILD_ARM_NEON style option and then #ifdef WAMR_BUILD_ARM_NEON != 0 || BUILD_TARGET_AARCH64 or something like that?

Copy link
Collaborator

@loganek loganek Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new build flag makes sense if really NEON support is non-deterministic only based on the architecture. I'd however suggest a few changes here:

  • use WAMR_BUILD_* prefix for consistency (e.g. WAMR_BUILD_ARM_NEON)
  • consider enabling it by default for some common combinations
  • for the #ifdef WAMR_BUILD_ARM_NEON != 0 || BUILD_TARGET_AARCH64 condition, I'm not sure if the build target is needed; we should just use the NEON ifdef, otherwise it should not be defined if not desired.

case SIMD_v128_load8x8_s:
#if ENABLE_NEON == 1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I do something similar to the invokeNative files where I define some function/symbol that is defined for each architecture in a separate file and put the opcode implementation there, or is it better inline here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my preference would be to not keep it here as it might grow a lot if we start supporting more architectures.

@jammar1
Copy link
Author

jammar1 commented Oct 15, 2024

Also regarding testing, I think I will need to add some qemu tests for ARMv7 + NEON in this PR. Is that okay?

@@ -597,3 +597,6 @@ endif()
if (NOT WAMR_BUILD_SANITIZER STREQUAL "")
message (" Sanitizer ${WAMR_BUILD_SANITIZER} enabled")
endif ()
if (ENABLE_NEON EQUAL 1)
add_definitions(-DENABLE_NEON=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
add_definitions(-DENABLE_NEON=1)
add_definitions(-DWASM_ENABLE_NEON=1)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, you should add this to the config.h file and set it to 0 by default

Comment on lines 5680 to 5686

int8x8_t v8 = vld1_s8((const int8_t *)maddr);
v8 = vrev64_s8(v8);
int16x8_t v16 = vmovl_s8(v8);

V128 result;
vst1q_s16((int16_t *)&result, v16);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess only this part is architecture specific so the other code can go outside of the ifdef (and possibly have an inefficient, plain C implementation in the else block?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for all the other opcodes below.

core/iwasm/common/wasm_runtime_common.h Show resolved Hide resolved

int8x8_t v8 = vld1_s8((const int8_t *)maddr);
v8 = vrev64_s8(v8);
int16x8_t v16 = vmovl_s8(v8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had better not use arch/os specified API here, could we wrap them to common APIs? And are there common implementations which doesn't depend on specified arch/os?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

using arch specific things directly like this PR should be the last resort after considering other options.

have you already considered libraries like simde? or compiler intrinsics? or even pure C implementation?
cf. https://github.com/yamt/toywasm/blob/e9226fae3e610d37742aa7afc8b0b7a7cf04f941/lib/insn_impl_simd.h#L8-L12

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to use simde - I'll update this PR adding that and implementing a couple of opcodes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added the library and implemented the swizzle opcode using it. Should be quite quick to add the rest of them now.

@jammar1 jammar1 force-pushed the jamamr1/first_neon_intrinsic branch from 8a4d665 to 2de1515 Compare October 21, 2024 14:57
@jammar1 jammar1 force-pushed the jamamr1/first_neon_intrinsic branch from 2de1515 to 71d129f Compare October 21, 2024 15:21
@@ -288,6 +288,9 @@ endif ()
if (WAMR_BUILD_LIB_RATS EQUAL 1)
message (" Lib rats enabled")
endif()
if (WAMR_BUILD_SIMDE EQUAL 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since now it only supports fast-interp, how about if ((WAMR_BUILD_SIMDE EQUAL 1) AND (WAMR_BUILD_FAST_INTERP EQUAL 1))?

@@ -142,6 +142,10 @@ if (WAMR_BUILD_LIB_RATS EQUAL 1)
include (${IWASM_DIR}/libraries/lib-rats/lib_rats.cmake)
endif ()

if (WAMR_BUILD_SIMDE EQUAL 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -19,6 +19,10 @@ else ()
set (LOADER "wasm_loader.c")
endif ()

if (WAMR_BUILD_SIMD)
set (WAMR_BUILD_SIMDE 1)
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add these lines since WAMR_BUILD_SIMD is already larger than 0, had better remove these lines?

core/iwasm/common/wasm_runtime_common.h Outdated Show resolved Hide resolved
core/iwasm/common/wasm_runtime_common.h Show resolved Hide resolved
wasm_set_exception(module, "unsupported SIMD opcode");
uint32 offset, addr;
offset = read_uint32(frame_ip);
addr = GET_OPERAND(uint32, I32, 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it should be addr = GET_OPERAND(uint32, I32, 0); frame_ip += 2;?


V128 data;
data = POP_V128();
frame_ip += 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should remove this line?

@jammar1 jammar1 force-pushed the jamamr1/first_neon_intrinsic branch from f101096 to b00c2e6 Compare October 31, 2024 15:03
@jammar1 jammar1 force-pushed the jamamr1/first_neon_intrinsic branch from b00c2e6 to d92cfd9 Compare October 31, 2024 15:23
Copy link
Collaborator

@loganek loganek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jammar1 jammar1 force-pushed the jamamr1/first_neon_intrinsic branch 3 times, most recently from ae858b2 to c669aaf Compare November 1, 2024 17:09
@jammar1 jammar1 force-pushed the jamamr1/first_neon_intrinsic branch from c669aaf to d35ed4d Compare November 1, 2024 17:34
HANDLE_OP(WASM_OP_SIMD_PREFIX)
{
GET_OPCODE();

switch (opcode) {
/* Memory */
case SIMD_v128_load:
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, wouldn't it be better to move all the new ~1000 lines to a new, separate file? The original core/iwasm/interpreter/wasm_interp_fast.c is already large. let's not make it any bigger.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I'm happy to follow up with a PR that moves things around later, but I have a few other branches that I'd need to rebase etc so would be good to merge this? This is just a feature branch so I think it's okay if it's a bit messy for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jammar1 I added several comments, could you have a look? I am fine to merge this PR to the branch dev/simd_for_interp first but we should address the issues then. Do you want to merge this PR first, or merge it after you address the new comments?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenyongh I've addressed the comments and CI/ the checks have passed. Could you merge it if you're happy with the changes? Thanks

@jammar1 jammar1 force-pushed the jamamr1/first_neon_intrinsic branch from d8042b4 to 2f651f1 Compare November 4, 2024 15:42
@jammar1
Copy link
Author

jammar1 commented Nov 4, 2024

@wenyongh I think I've addressed your comments, would you be able to review this? Thanks

core/iwasm/common/wasm_runtime_common.h Show resolved Hide resolved
core/iwasm/libraries/simde/simde.cmake Outdated Show resolved Hide resolved
core/iwasm/libraries/simde/simde.cmake Outdated Show resolved Hide resolved
core/iwasm/interpreter/wasm_opcode.h Outdated Show resolved Hide resolved
core/iwasm/interpreter/wasm_interp_fast.c Outdated Show resolved Hide resolved
@jammar1 jammar1 force-pushed the jamamr1/first_neon_intrinsic branch from e3e45c6 to 6b90955 Compare November 5, 2024 14:00
@jammar1 jammar1 force-pushed the jamamr1/first_neon_intrinsic branch from 6b90955 to b20c300 Compare November 5, 2024 14:04
Copy link
Collaborator

@lum1n0us lum1n0us left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@wenyongh wenyongh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wenyongh wenyongh merged commit c930c4d into bytecodealliance:dev/simd_for_interp Nov 5, 2024
383 checks passed
@wenyongh
Copy link
Contributor

wenyongh commented Nov 6, 2024

@wenyongh I think I've addressed your comments, would you be able to review this? Thanks

Sure, merged. A suggestion is to change WASM_ENABLE_SIMDE to WASM_ENABLE_LIB_SIMDE, one is the former makes confusing with WASM_ENABLE_SIMD, the other is that we use cmake variable WAMR_BUILD_LIB_SIMDE, it is better to make it consistent. How do you think?

@jammar1 jammar1 deleted the jamamr1/first_neon_intrinsic branch November 6, 2024 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants