-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PAL/Linux-SGX] Add AEX-Notify flows in exception handling #2037
base: dimakuv/aex-notify-part4
Are you sure you want to change the base?
Conversation
a1c2993
to
5a8651c
Compare
This commit adds the AEX-Notify flows inside the enclave. The stage-1 signal handler is augmented as follows when AEX-Notify is enabled: manually restore SSA[0] context, invoke the EDECCSSA instruction instead of EEXIT (to go from SSA[1] to SSA[0] without exiting the enclave) and finally jump to SSA[0].GPRSGX.RIP to resume enclave execution (it will resume in stage-2 signal handler). The stage-2 signal handler is augmented as follows: set bit 0 of SSA[0].GPRSGX.AEXNOTIFY (so that AEX-Notify starts working again for this thread), then apply AEX-Notify mitigations and finally restore regular enclave execution. This commit does not add any real AEX-Notify mitigations. Instead, we count the number of AEX events reported inside the SGX enclave and print this number on enclave termination (if log level is at least "warning"). Note that current implementation of AEX-Notify does not use the checkpoint mechanism described in the official AEX-Notify whitepaper. That checkpoint mechanism allows to coalesce multiple AEX events that occur during the execution of mitigations. This saves some CPU cycles and some signal-handling stack space, but we leave implementing this optimization as future work. Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
5a8651c
to
4ea9dcb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)
a discussion (no related file):
Debug failure with GDB support (try LibOS regression tests that use GDB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)
a discussion (no related file):
Debug some failures with EDMM (try LibOS regression tests).
Fixed GDB issue. Fixed a SIGSEGV data race on thread termination (ERESUME morphs into EENTER but then performs EEXIT). Added AEXNOTIFY envvar to LibOS regression tests (but only to a subset from `manifest.template`, simply because changing all manifest template files would be a huge git diff). Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 11 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners
a discussion (no related file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Debug failure with GDB support (try LibOS regression tests that use GDB).
Done. AEX-Notify is not really compatible with GDB, see e.g. the official whitepaper: https://cdrdv2-public.intel.com/736463/aex-notify-white-paper-public.pdf, Section 8.
libos/test/regression/manifest.template
line 27 at r2 (raw file):
sgx.edmm_enable = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }} sgx.use_exinfo = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }} sgx.experimental_enable_aex_notify = {{ 'true' if env.get('AEXNOTIFY', '0') == '1' else 'false' }}
For now added only in this manifest file, but technically should add in all files (similar to sgx.edmm_enable
) and have at least one CI pipeline with AEXNOTIFY=1
envvar.
pal/src/host/linux-sgx/host_exception.c
line 397 at r2 (raw file):
noreturn void fail_on_morphed_eresume(void) { log_error("Bug in AEX-Notify flows: ERESUME morphed into EENTER but then the enclave performed " "EEXIT instead of EDECCSSA. Please debug.");
This particular bug (data race) made my brain boil for two days. I definitely want to keep this diagnostics for future, if we ever have more bugs like this.
Explaining this data race is hard, but basically:
- AEX-Notify now allows ERESUME to morph into EENTER.
- EENTER may be exited via EDECCSSA (assumed in AEX-Notify) or via EEXIT (legacy non-AEX-Notify flows).
- The above implies that ERESUME can end up in EEXIT, and the enclave should jump out to the "exit target" that by our Gramine convention is specified in RDX reg (I mean the Gramine convention for EEXIT).
- But Gramine also assumes that ERESUME never returns, and this is now broken -> data race! Note that before, RDX reg was random garbage upon ERESUME (which made sense before, as ERESUME would never use RDX and would never return).
pal/src/host/linux-sgx/pal_exception.c
line 61 at r2 (raw file):
MB(); SET_ENCLAVE_TCB(ready_for_aex_notify, 0UL); MB();
I feel like this could be implemented in a cleaner way, maybe re-using some of the existing variables... I am definitely not proud of this stopping_aex_notify
helper variable, but this seemed easy.
Fixed EDMM issue. Turned out to be a case of too many nested signal handlers inside Gramine's SGX PAL, which overflowed the SGX enclave signal stack. Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 11 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners
a discussion (no related file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Debug some failures with EDMM (try LibOS regression tests).
Done
This commit adds conditional AEX-Notify enablement to all Gramine tests. Run tests e.g. like this (on a machine that supports AEX-Notify both in hardware and in Linux kernel): $ EDMM=1 AEXNOTIFY=1 SGX=1 gramine-test pytest Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 61 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners
a discussion (no related file):
In my local tests, everything seems to work. I stress-tested all our Gramine LibOS and PAL tests, with and without EDMM.
To merge this PR, our CI should have AEX-Notify-supporting workers, and AEXNOTIFY=1
envvar must be set in at least one Jenkins pipeline. This enablement must be done similarly to the EDMM=1
one.
I currently put a blocking comment on this, not to forget about updating the CI.
libos/test/regression/manifest.template
line 27 at r2 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
For now added only in this manifest file, but technically should add in all files (similar to
sgx.edmm_enable
) and have at least one CI pipeline withAEXNOTIFY=1
envvar.
Done, now added everywhere. The enablement is similar to EDMM (with its EDMM=1
envvar).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 61 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners
a discussion (no related file):
Quick perf numbers for Gramine built in Release mode on Ubuntu 24.04 with Linux v6.11.
Not sure if they are useful, just wanted to post here. They show that current AEX-Notify (with dummy mitigation) has small overhead.
make clean; AEXNOTIFY=0 EDMM=0 SGX=1 gramine-test pytest -k 'not TC_04_Attestation'
-- done in 200.36smake clean; AEXNOTIFY=0 EDMM=1 SGX=1 gramine-test pytest -k 'not TC_04_Attestation'
-- done in 138.00smake clean; AEXNOTIFY=1 EDMM=0 SGX=1 gramine-test pytest -k 'not TC_04_Attestation'
-- done in 208.25smake clean; AEXNOTIFY=1 EDMM=1 SGX=1 gramine-test pytest -k 'not TC_04_Attestation'
-- done in 141.50s
Description of the changes
Part 5 in AEX-Notify series.
This PR adds the AEX-Notify flows inside the enclave.
The stage-1 signal handler is augmented as follows when AEX-Notify is enabled: manually restore SSA[0] context, invoke the EDECCSSA instruction instead of EEXIT (to go from SSA[1] to SSA[0] without exiting the enclave) and finally jump to SSA[0].GPRSGX.RIP to resume enclave execution (it will resume in stage-2 signal handler).
The stage-2 signal handler is augmented as follows: set bit 0 of SSA[0].GPRSGX.AEXNOTIFY (so that AEX-Notify starts working again for this thread), then apply AEX-Notify mitigations and finally restore regular enclave execution.
This PR does not add any real AEX-Notify mitigations. Instead, we count the number of AEX events reported inside the SGX enclave and print this number on enclave termination (if log level is at least "warning").
Note that current implementation of AEX-Notify does not use the checkpoint mechanism described in the official AEX-Notify whitepaper. That checkpoint mechanism allows to coalesce multiple AEX events that occur during the execution of mitigations. This saves some CPU cycles and some signal-handling stack space, but we leave implementing this optimization as future work.
See also related PRs and discussions:
Related documentation:
Closes #1530
Closes #1531
How to test this PR?
AEX-Notify is enabled in all LibOS/PAL test manifests if
AEXNOTIFY=1
environment variable is set.This change is