Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PAL/Linux-SGX] Add AEX-Notify flows in exception handling #2037

Open
wants to merge 4 commits into
base: dimakuv/aex-notify-part4
Choose a base branch
from

Conversation

dimakuv
Copy link
Contributor

@dimakuv dimakuv commented Oct 16, 2024

Description of the changes

Part 5 in AEX-Notify series.

This PR adds the AEX-Notify flows inside the enclave.

The stage-1 signal handler is augmented as follows when AEX-Notify is enabled: manually restore SSA[0] context, invoke the EDECCSSA instruction instead of EEXIT (to go from SSA[1] to SSA[0] without exiting the enclave) and finally jump to SSA[0].GPRSGX.RIP to resume enclave execution (it will resume in stage-2 signal handler).

The stage-2 signal handler is augmented as follows: set bit 0 of SSA[0].GPRSGX.AEXNOTIFY (so that AEX-Notify starts working again for this thread), then apply AEX-Notify mitigations and finally restore regular enclave execution.

This PR does not add any real AEX-Notify mitigations. Instead, we count the number of AEX events reported inside the SGX enclave and print this number on enclave termination (if log level is at least "warning").

Note that current implementation of AEX-Notify does not use the checkpoint mechanism described in the official AEX-Notify whitepaper. That checkpoint mechanism allows to coalesce multiple AEX events that occur during the execution of mitigations. This saves some CPU cycles and some signal-handling stack space, but we leave implementing this optimization as future work.

See also related PRs and discussions:

Related documentation:

Closes #1530
Closes #1531

How to test this PR?

AEX-Notify is enabled in all LibOS/PAL test manifests if AEXNOTIFY=1 environment variable is set.


This change is Reviewable

This commit adds the AEX-Notify flows inside the enclave.

The stage-1 signal handler is augmented as follows when AEX-Notify is
enabled: manually restore SSA[0] context, invoke the EDECCSSA
instruction instead of EEXIT (to go from SSA[1] to SSA[0] without
exiting the enclave) and finally jump to SSA[0].GPRSGX.RIP to resume
enclave execution (it will resume in stage-2 signal handler).

The stage-2 signal handler is augmented as follows: set bit 0 of
SSA[0].GPRSGX.AEXNOTIFY (so that AEX-Notify starts working again for
this thread), then apply AEX-Notify mitigations and finally restore
regular enclave execution.

This commit does not add any real AEX-Notify mitigations. Instead, we
count the number of AEX events reported inside the SGX enclave and print
this number on enclave termination (if log level is at least "warning").

Note that current implementation of AEX-Notify does not use the
checkpoint mechanism described in the official AEX-Notify whitepaper.
That checkpoint mechanism allows to coalesce multiple AEX events
that occur during the execution of mitigations. This saves some CPU
cycles and some signal-handling stack space, but we leave implementing
this optimization as future work.

Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)

a discussion (no related file):
Debug failure with GDB support (try LibOS regression tests that use GDB).


Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)

a discussion (no related file):
Debug some failures with EDMM (try LibOS regression tests).


Fixed GDB issue. Fixed a SIGSEGV data race on thread termination
(ERESUME morphs into EENTER but then performs EEXIT). Added AEXNOTIFY
envvar to LibOS regression tests (but only to a subset from
`manifest.template`, simply because changing all manifest template files
would be a huge git diff).

Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Debug failure with GDB support (try LibOS regression tests that use GDB).

Done. AEX-Notify is not really compatible with GDB, see e.g. the official whitepaper: https://cdrdv2-public.intel.com/736463/aex-notify-white-paper-public.pdf, Section 8.



libos/test/regression/manifest.template line 27 at r2 (raw file):

sgx.edmm_enable = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }}
sgx.use_exinfo = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }}
sgx.experimental_enable_aex_notify = {{ 'true' if env.get('AEXNOTIFY', '0') == '1' else 'false' }}

For now added only in this manifest file, but technically should add in all files (similar to sgx.edmm_enable) and have at least one CI pipeline with AEXNOTIFY=1 envvar.


pal/src/host/linux-sgx/host_exception.c line 397 at r2 (raw file):

noreturn void fail_on_morphed_eresume(void) {
    log_error("Bug in AEX-Notify flows: ERESUME morphed into EENTER but then the enclave performed "
              "EEXIT instead of EDECCSSA. Please debug.");

This particular bug (data race) made my brain boil for two days. I definitely want to keep this diagnostics for future, if we ever have more bugs like this.

Explaining this data race is hard, but basically:

  • AEX-Notify now allows ERESUME to morph into EENTER.
  • EENTER may be exited via EDECCSSA (assumed in AEX-Notify) or via EEXIT (legacy non-AEX-Notify flows).
  • The above implies that ERESUME can end up in EEXIT, and the enclave should jump out to the "exit target" that by our Gramine convention is specified in RDX reg (I mean the Gramine convention for EEXIT).
  • But Gramine also assumes that ERESUME never returns, and this is now broken -> data race! Note that before, RDX reg was random garbage upon ERESUME (which made sense before, as ERESUME would never use RDX and would never return).

pal/src/host/linux-sgx/pal_exception.c line 61 at r2 (raw file):

    MB();
    SET_ENCLAVE_TCB(ready_for_aex_notify, 0UL);
    MB();

I feel like this could be implemented in a cleaner way, maybe re-using some of the existing variables... I am definitely not proud of this stopping_aex_notify helper variable, but this seemed easy.

Fixed EDMM issue. Turned out to be a case of too many nested signal
handlers inside Gramine's SGX PAL, which overflowed the SGX enclave
signal stack.

Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Debug some failures with EDMM (try LibOS regression tests).

Done


This commit adds conditional AEX-Notify enablement to all Gramine tests.

Run tests e.g. like this (on a machine that supports AEX-Notify both in
hardware and in Linux kernel):

    $ EDMM=1 AEXNOTIFY=1 SGX=1 gramine-test pytest

Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 61 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

a discussion (no related file):
In my local tests, everything seems to work. I stress-tested all our Gramine LibOS and PAL tests, with and without EDMM.

To merge this PR, our CI should have AEX-Notify-supporting workers, and AEXNOTIFY=1 envvar must be set in at least one Jenkins pipeline. This enablement must be done similarly to the EDMM=1 one.

I currently put a blocking comment on this, not to forget about updating the CI.



libos/test/regression/manifest.template line 27 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

For now added only in this manifest file, but technically should add in all files (similar to sgx.edmm_enable) and have at least one CI pipeline with AEXNOTIFY=1 envvar.

Done, now added everywhere. The enablement is similar to EDMM (with its EDMM=1 envvar).

Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 61 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

a discussion (no related file):
Quick perf numbers for Gramine built in Release mode on Ubuntu 24.04 with Linux v6.11.

Not sure if they are useful, just wanted to post here. They show that current AEX-Notify (with dummy mitigation) has small overhead.

  • make clean; AEXNOTIFY=0 EDMM=0 SGX=1 gramine-test pytest -k 'not TC_04_Attestation' -- done in 200.36s
  • make clean; AEXNOTIFY=0 EDMM=1 SGX=1 gramine-test pytest -k 'not TC_04_Attestation' -- done in 138.00s
  • make clean; AEXNOTIFY=1 EDMM=0 SGX=1 gramine-test pytest -k 'not TC_04_Attestation' -- done in 208.25s
  • make clean; AEXNOTIFY=1 EDMM=1 SGX=1 gramine-test pytest -k 'not TC_04_Attestation' -- done in 141.50s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant