Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Option to reduce code size of the generated AOT file when stack trace is enabled #3758

Open
4 of 8 tasks
loganek opened this issue Aug 28, 2024 · 7 comments
Open
4 of 8 tasks
Labels

Comments

@loganek
Copy link
Collaborator

loganek commented Aug 28, 2024

Feature

We'd like to propose an option in the WAMR AOT compiler that will reduce the size of the generated code while still keeping track of some information about the existing callstack.

Currently, when compiling our file (~6 MB, over 100k call instructions and ~15k functions) to AOT with call stack tracking enabled (--enable-dump-call-stack option), the resulting file is 25MB (and is even more for ARM). When the call stack tracking is disabled, the size is ~18 MB, but we can't turn it off in production so we can analyze the crashes. Our experiments show that with our settings (no GC, no EH which is not supported yet but will be in the future) we could generate ~20MB bundle while keeping the basic information (just function call stacks).

We're not targeting GC, JIT and EH (once it's ready) modes although the solution potentially can be extended to those features too (didn't dig deep into the implementation).

Benefit

The main benefit of the feature is a reduced AOT code size (perhaps at the cost of some of the additional features) therefore a lower memory usage and disk usage (or network usage if the bundle is downloaded).

Implementation

We considered two main approaches to implement this functionality. We're leaning towards option 1 given the concerns around the other option, but we'd like to open the discussion for the community and also hear some of the ideas.

Option 1: keep wrapping call instructions

Currently stack traces are implemented by adding extra code around call instruction for saving the frame on the stack before the instruction, and popping it after the instruction is executed. This approach assumes that we'll re-use the existing mechanism, but the code for pushing the stack trace will be simplified and configurable.

Right now the --enable-dump-call-stack generates a code for:

  • keeping track of the callstack, including:
  • checking if there's enough space to allocate a new frame
  • previous frame
  • function index
  • instruction offset
  • imported function's parameters
  • operand stack (GC only)
  • frame ref flags (GC only)
  • performance info (only when profiling is enabled)

We'd like to introduce a new flag (the name is yet to be decided, perhaps something like --dump-call-stack-detail-level but it's yet to be decided) which will allows to define what's being tracked as part of each call and the level of validation. Each frame will include the following information:

  • function number (always present)
  • instruction offset (optional, configured through wamrc flag)

For now we'll completely disable validation for stack overflows, but if required, this can be added as an option too in the future.

This feature will not allow to include any other information - the --enable-dump-call-stack flag should be used instead if additional data is required.

To avoid additional instructions, we'll do the following:

  • we'll not populate field for the previous frame; instead, we'll keep each frame will have the known size, so prev/next frame can be calculated by just subtracting or adding the size of the current frame
  • code for checking bound checks will not be added
  • populating instruction offset will be optionally enabled

The benefit of this approach is simplicity of the implementation as we can rely on the existing mechanism; we already implemented a proof of concept and tested it on a few examples. The major disadvantage (compare to option 2) is a need for instrumenting every single call instruction; based on a few programs we've noticed that there's usually 5-10x more call instructions than function definitions, so option 2) might result in a better code. Another disadvantage is that we'll need to make changes in the runtime (as the runtime, when generating stack trace, relies on the prev field from the AOTFrame now - we'll use AOT feature flags to detect which option is used), so users would have to update both AOT code and the runtime itself.

Option 2

In this option, instead of instrumenting every call instruction, we'll add a code for pushing the frame at the beginning of the function, and code for popping frame will be added on termination. Given that based on our experiments there's usually significantly more calls than function definitions, there should be a visible code size reduction too, which is what we're trying to achieve here. Another benefit of this approach is that the code compiled with a new compiler can work out of the box with the older runtimes (at least 2.x due to abi compatibility). There's a few concerns though:

  • This option will not work for calls to imported functions, so for those we'll need to instrument the call instruction just like we do today.
  • This option won't allow us to track instruction offset, which might be a blocker for a lot of the users (our team currently only uses function indexes, but we anticipate using instruction offsets in the future too for crash analysis)
  • There can be multiple terminators within the function, which can negatively affect the code size. It's possible to detect all those terminators but then either all of them would have to include the code for popping the frame, or they'd have to have a jump instruction to a shared code that does this (I haven't explored this too much though, maybe it's enough to add the code before the last end in the function and before each return).

Tasks

@wenyongh wenyongh added enhancement New feature or request aot compiler RFC labels Aug 29, 2024
@wenyongh
Copy link
Contributor

Thanks @loganek for bringing the great idea, it is really an issue that the AOT code size increases a lot when dump call stack, performance profiling or GC feature is enabled. As you mentioned, there may be many fields to commit to the stack frame, and I think there had better be an option of whether to check the stack overflow or not. And this issue introduces a new stack frame structure so there will be two stack frame structures, different option combinations may use different stack frame structure. I think you mean --dump-call-stack-detail-level=n? And what will each n denote?

Another way may be to add some wamrc options for developer to select which elements to commit? For example:

--enable-commit-func-idx/--disable-commit-func-idx, or --commit-func-idx=1/0:  commit function index
--enable-commit-ip:            commit bytecode offset
--enable-commit-sp            commit stack pointer
--enable-commit-lp             commit local variables

And together with the current --enable-perf-profiling and --enable-gc options, developer can select which elements to commit flexibly. But it introduces more options to wamrc. Which way do you prefer?

And had better add an option to enable/disable stack frame boundary check, e.g. --enable-stack-frame-bounds-checks or --stack-frame-bounds-checks=1/0?

@loganek
Copy link
Collaborator Author

loganek commented Aug 29, 2024

And this issue introduces a new stack frame structure so there will be two stack frame structures, different option combinations may use different stack frame structure.

I'll need to do a bit more experiments; in my proof of concept I indeed created a new frame structure (AOTMiniFrame) for simplicity although, I think we could possibly re-use the same data structure (AOTFrame`) and keep some of the fields (e.g. pointer to a previous frame) not populated.

I think you mean --dump-call-stack-detail-level=n? And what will each n denote?

This was just a conceptual idea to highlight the fact there'll be a few options. I think I agree that just having n might not be enough as users will need different combinations. We could either have multiple options like you suggested to enable individual features, or have a single multi-value option, e.g. --dump-call-stack-features which can either by specified multiple times (--dump-call-stack-features=func-idx --dump-call-stack-features=local-variables) or as a comma-separated list (--dump-call-stack-features=func-idx,local-variables). I personally don't have strong opinion on either of the user interface, although personally would keep a list of distinct parameters as small as possible.

I'm more interested in your thoughts about option 1 and option 2 in my initial post, but given that option 2 won't give us all information, we should perhaps abandon it.

@wenyongh
Copy link
Contributor

wenyongh commented Sep 2, 2024

And this issue introduces a new stack frame structure so there will be two stack frame structures, different option combinations may use different stack frame structure.

I'll need to do a bit more experiments; in my proof of concept I indeed created a new frame structure (AOTMiniFrame) for simplicity although, I think we could possibly re-use the same data structure (AOTFrame`) and keep some of the fields (e.g. pointer to a previous frame) not populated.

Re-using AOTFrame structure is also good, but had better use fixed offsets to get the field func_index and ip_offset and treat both of them as uint32 for the tiny frame. Defining a structure AOTMiniFrame or AOTTinyFrame with the two fields may be clearer.

I think you mean --dump-call-stack-detail-level=n? And what will each n denote?

This was just a conceptual idea to highlight the fact there'll be a few options. I think I agree that just having n might not be enough as users will need different combinations. We could either have multiple options like you suggested to enable individual features, or have a single multi-value option, e.g. --dump-call-stack-features which can either by specified multiple times (--dump-call-stack-features=func-idx --dump-call-stack-features=local-variables) or as a comma-separated list (--dump-call-stack-features=func-idx,local-variables). I personally don't have strong opinion on either of the user interface, although personally would keep a list of distinct parameters as small as possible.

I also like the style of --dump-call-stack-features option, in fact we already use that way in many other options, like --emit-custom-sections and --enable-segue. I ever discussed it with @lum1n0us, he thought that it may be not so convenient for the user to type the feature strings. I think maybe we can simplify the feature strings? e.g.: "func-idx", "ip", "sp" and "locals"? It should be short enough, e.g. --dump-call-stack-features=func-idx,ip.

I'm more interested in your thoughts about option 1 and option 2 in my initial post, but given that option 2 won't give us all information, we should perhaps abandon it.

For option 2, I think we don't need to store the instruction offset when the aot stack frame is created/pushed, in fact the ip offset is stored only when exception is to be thrown (see calling commit_ip in aot_emit_exception.c) and some places that may need to commit the frame data and switch to runtime (see callings of aot_gen_commit_sp_ip in aot_emit_function.c and aot_emit_control.c). The function aot_alloc_frame in aot_runtime.c and alloc_frame_for_aot_func in aot_emit_function.c also don't store ip offset to the frame.

And the places to free the the stack frame in the AOT code also seem not an issue: the aot compiler concentrates the handlings of function return to be only in several places: (1) func_ret labe, (2) got_exception label, (3) func_end label. Below is a piece of LLVM IR generated by coremark workload:

func_end:                                         ; preds = %block28_end
  %phi = phi i32 [ 0, %block28_end ]
  ret i32 %phi

func_ret:                                         ; preds = %block30_end, %br_if_else1021, %check_exce_succ1010, %check_exce_succ992, %br_if_else952, %br_if_else942, %loop5_begin, %loop4_begin, %loop3_begin, %loop2_begin, %check_exce_succ740, %check_exce_succ723, %check_exce_succ706, %check_exce_succ689, %block23_end, %br_if_else654, %br_if_else617, %check_exce_succ592, %check_exce_succ574, %block12_end, %br_if_else501, %br_if_else439, %br_if_else377, %block13_end
  br label %got_exception

got_exception:                                    ; preds = %func_ret, %block10_end, %check_nan_succ, %br_if_else268, %block0_end
  %exception_id_phi = phi i32 [ 4, %block0_end ], [ 5, %br_if_else268 ], [ 3, %check_nan_succ ], [ 4, %block10_end ], [ 33, %func_ret ]
  call void @aot_set_exception_with_id(ptr %aot_inst, i32 %exception_id_phi)
  ret i32 0

I think the main issue of option 2 is: AOT runtime also allocates/frees frame before/after calling aot function, see aot_call_function(), if changing it, there may be backward compatibility issue for old AOT file - no frame is allocated for the first call for the old AOT file.

If you just wants to save the AOT code size for tiny frame, maybe we can define a function to allocate the frame, call if before calling the wasm function by passing it with func idx (and ip offset if needed)?

@wenyongh
Copy link
Contributor

wenyongh commented Sep 2, 2024

If to use the option 2, maybe we can use two bits in the feature flag of AOT file: a bit to indicate whether the tiny frame is used and a bit to indicate whether option 2 is used. And in the runtime: (1) when bit for option 2 is 0, keep the current behavior, (2) else, don't allocate/free aot frame before/after calling a function, and when traversing the frames, treat the frame as full frame or tiny frame according to tiny frame flag. By this way, aot runtime can keep the backward compatibility with the old AOT file in which the two bits are 0.

@loganek
Copy link
Collaborator Author

loganek commented Sep 2, 2024

Re-using AOTFrame structure is also good, but had better use fixed offsets to get the field func_index and ip_offset and treat both of them as uint32 for the tiny frame. Defining a structure AOTMiniFrame or AOTTinyFrame with the two fields may be clearer.

I'll review the code and see what option would be more suitable; in my current PoC I indeed have two distinct data structures so I might keep it like that going forward.

I also like the style of --dump-call-stack-features option, in fact we already use that way in many other options, like --emit-custom-sections and --enable-segue. I ever discussed it with @lum1n0us, he thought that it may be not so convenient for the user to type the feature strings. I think maybe we can simplify the feature strings? e.g.: "func-idx", "ip", "sp" and "locals"? It should be short enough, e.g. --dump-call-stack-features=func-idx,ip.

Yes, I think the names can be short and explained in the doc. I also think usually those are not often handwritten but rather part of the CI/build script so I hope this won't be an issue.

I'm more interested in your thoughts about option 1 and option 2 in my initial post, but given that option 2 won't give us all information, we should perhaps abandon it.

For option 2, I think we don't need to store the instruction offset when the aot stack frame is created/pushed, in fact the ip offset is stored only when exception is to be thrown (see calling commit_ip in aot_emit_exception.c) and some places that may need to commit the frame data and switch to runtime (see callings of aot_gen_commit_sp_ip in aot_emit_function.c and aot_emit_control.c). The function aot_alloc_frame in aot_runtime.c and alloc_frame_for_aot_func in aot_emit_function.c also don't store ip offset to the frame.

I think pushing ip offset must happen on every call/call_indirect, otherwise the stacks are not that helpful. And that's what happens today, see for example aot_gen_commit_sp_ip in aot_compile_op_call. So this part must be kept as is, and I don't think we can remove it if we want to have a full trace of instruction pointers.

And the places to free the the stack frame in the AOT code also seem not an issue: the aot compiler concentrates the handlings of function return to be only in several places: (1) func_ret labe, (2) got_exception label, (3) func_end label.

Yes, those are the places where the code needs to be generated. I'd need to run more experiments if putting extra instructions there will not cause the code size to increase significantly, but hopefully not.

I think the main issue of option 2 is: AOT runtime also allocates/frees frame before/after calling aot function, see aot_call_function(), if changing it, there may be backward compatibility issue for old AOT file - no frame is allocated for the first call for the old AOT file.

As discussed in the meeting today, I think we can workaround it by having AOT feature flags so the runtime knows what to do.

If you just wants to save the AOT code size for tiny frame, maybe we can define a function to allocate the frame, call if before calling the wasm function by passing it with func idx (and ip offset if needed)?

I thought about it too, but I think calling the function is going to generate not significantly less code than having instructions for allocating frame, and I think it might affect a performance significantly so I think I'd rather generate the extra code at the beginning of each function and in its termination stage.

After thinking more about it and speaking to @wenyongh offline, I think option 2 might give us better results in terms of code size optimization, so I'll build another prototype now and share some results, based on that we'll pick the final approach (although given that in our case we have 10x more calls than function definitions, I think option 2 will likely give better results).

@wenyongh
Copy link
Contributor

wenyongh commented Sep 2, 2024

For option 2, I think we don't need to store the instruction offset when the aot stack frame is created/pushed, in fact the ip offset is stored only when exception is to be thrown (see calling commit_ip in aot_emit_exception.c) and some places that may need to commit the frame data and switch to runtime (see callings of aot_gen_commit_sp_ip in aot_emit_function.c and aot_emit_control.c). The function aot_alloc_frame in aot_runtime.c and alloc_frame_for_aot_func in aot_emit_function.c also don't store ip offset to the frame.

I think pushing ip offset must happen on every call/call_indirect, otherwise the stacks are not that helpful. And that's what happens today, see for example aot_gen_commit_sp_ip in aot_compile_op_call. So this part must be kept as is, and I don't think we can remove it if we want to have a full trace of instruction pointers.

Yes, that should be kept, but note that here it stores the ip offset to the caller's frame which has been created, but not the callee frame: the callee frame is to be created next by the aot code generated by alloc_frame_for_aot_func (in opcode call), or by calling runtime API aot_alloc_frame (in opcode call_indirect), and both of them don't store ip offset to the frame. So it doesn't matter that we change the current full trace implementation to option 2.

And the places to free the the stack frame in the AOT code also seem not an issue: the aot compiler concentrates the handlings of function return to be only in several places: (1) func_ret labe, (2) got_exception label, (3) func_end label.

Yes, those are the places where the code needs to be generated. I'd need to run more experiments if putting extra instructions there will not cause the code size to increase significantly, but hopefully not.

Yes, in fact func_ret label goes to got_exception label, so we can just free the frame in func_end label and got_exception. I am not sure whether there will be many func_end labels in a function, but I think we can optimize it to make a function have only one func_end label if needed.

If you just wants to save the AOT code size for tiny frame, maybe we can define a function to allocate the frame, call if before calling the wasm function by passing it with func idx (and ip offset if needed)?

I thought about it too, but I think calling the function is going to generate not significantly less code than having instructions for allocating frame, and I think it might affect a performance significantly so I think I'd rather generate the extra code at the beginning of each function and in its termination stage.

Got it, thanks.

After thinking more about it and speaking to @wenyongh offline, I think option 2 might give us better results in terms of code size optimization, so I'll build another prototype now and share some results, based on that we'll pick the final approach (although given that in our case we have 10x more calls than function definitions, I think option 2 will likely give better results).

Agree, in fact fast-jit also uses option 2, there is only one frame allocation and frame freeing in a function's jitted code.

wenyongh pushed a commit that referenced this issue Sep 5, 2024
Those parameters can be used to reduce the size of the AOT code.

There's going to be more changes related to AOT code size reduction,
this is just the initial step.

p.s. #3758
@loganek
Copy link
Collaborator Author

loganek commented Sep 9, 2024

Initially I thought that func_idx will always be needed. However, just having another thought on that, when the user is interested in both function index and instruction pointer, they could just save function pointer to the frame and skip writing the function index at all to save some generated code. We can then provide a simple script that using wasm-objdump "fixes" the stack trace by deducing function index based on the instruction pointer of that frame. I'll create a task to add func-idx as yet another parameter to the --call-stack-features option.

wenyongh pushed a commit that referenced this issue Sep 10, 2024
- Implement TINY / STANDARD frame modes - tiny mode is only able to keep track on the IP
  and func idx, STANDARD mode provides more capabilities (parameters, stack pointer etc.).
- Implement FRAME_PER_FUNCTION / FRAME_PER_CALL modes - frame per function adds
  code at the beginning and at the end of each function for allocating / deallocating stack frame,
  whereas in per-call mode the frame is allocated before each call. The exception is call to
  the imported function, where frame-per-function mode also allocates the stack before the
  `call` instruction (as it can't instrument the imported function).

At the moment TINY + FRAME_PER_FUNCTION is automatically enabled in case GC and perf
profiling are disabled and `values` call stack feature is not requested. In all the other cases
STANDARD + FRAME_PER_CALL is used.

STANDARD + FRAME_PER_FUNCTION and TINY + FRAME_PER_CALL are currently not
implemented but possible, and might be enabled in the future.

ps. #3758
loganek added a commit to loganek/wasm-micro-runtime that referenced this issue Sep 10, 2024
Also add a script that converts instruction pointers to function indexes (or function names)

bytecodealliance#3758
loganek added a commit to loganek/wasm-micro-runtime that referenced this issue Sep 10, 2024
Also add a script that converts instruction pointers to function indexes (or function names)

bytecodealliance#3758
wenyongh pushed a commit that referenced this issue Sep 11, 2024
Also add a script that converts instruction pointers to function indexes (or function names).

#3758
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants