-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] virtio-blk storage via mmio using virtio-drivers (Qemu) #343
base: main
Are you sure you want to change the base?
Conversation
I'll rebase and retest everything, just wanted to put this out here already. |
Wow, cool! We might also think about using IRQs once the COCONUT kernel has the infrastructure for it. |
I haven't gone too deep in the codebase, but can't this be done by implementing |
Unfortunately not, because the devices (virtio-blk for example) use mmio accesses directly too (to the configuration space), without going through the transport. |
To be clear, there is a MMIO transport in the virtio-drivers crate. But it simply uses volatile reads and writes by default. My change here was to add mmio access functions to the hardware abstraction trait, to be able to use explicit vmgextis instead of "plain" volatile accesses. |
3983c9d
to
89212e9
Compare
Not convinced this makes sense, so my qemu patch doesn't assign interrupts to these virtio-mmio transport slots. When using interrupts: How would that look like? Would SVSM (partly) emulate lapic and ioapic then, to make sure the guest os can't tamper with the IRQ lines owned by SVSM? |
Yes, The SVSM would own the HV-emulated APIC (of vAPIC in TDX case) and emulate another X(2)APIC for the guest OS. The IOAPIC emulation remains on the host side. APIC support in SVSM is needed anyway for TDX to implement IPIs for TLB flushes. On the SNP side IRQ support is needed to mitigate the AHOI group of attacks. |
kernel/src/sev/ghcb.rs
Outdated
fn read_buffer_as<T>(&mut self, offset: isize) -> Result<T, GhcbError> | ||
where | ||
T: Sized + Copy, | ||
{ | ||
let size: isize = mem::size_of::<T>() as isize; | ||
|
||
if offset < 0 || offset + size > (GHCB_BUFFER_SIZE as isize) { | ||
return Err(GhcbError::InvalidOffset); | ||
} | ||
|
||
unsafe { | ||
let src = self | ||
.buffer | ||
.as_mut_ptr() | ||
.cast::<u8>() | ||
.offset(offset) | ||
.cast::<T>(); | ||
Ok(*src) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to use isize
here for the offset. I understand this was perhaps done to mirror write_buffer()
below, but we can do the right thing here (and the caller), i.e., use usize
, and thus add()
instead of offset()
Also, there are some safety issues:
- We need to check that
offset + size
does not overflow (otherwiseadd()
/offset()
might cause UB. - We need to check that
src
has the alignment requirements forT
. We should either add a check or useread_unaligned()
. Otherwise this might cause UB.
Something like this:
fn read_buffer_as<T>(&mut self, offset: isize) -> Result<T, GhcbError> | |
where | |
T: Sized + Copy, | |
{ | |
let size: isize = mem::size_of::<T>() as isize; | |
if offset < 0 || offset + size > (GHCB_BUFFER_SIZE as isize) { | |
return Err(GhcbError::InvalidOffset); | |
} | |
unsafe { | |
let src = self | |
.buffer | |
.as_mut_ptr() | |
.cast::<u8>() | |
.offset(offset) | |
.cast::<T>(); | |
Ok(*src) | |
} | |
} | |
fn read_buffer_as<T>(&mut self, offset: usize) -> Result<T, GhcbError> | |
where | |
T: Sized + Copy, | |
{ | |
offset | |
.checked_add(mem::size_of::<T>()) | |
.filter(|end| *end <= GHCB_BUFFER_SIZE) | |
.ok_or(GhcbError::InvalidOffset)?; | |
// SAFETY: we have verified that offset is within bounds and does not | |
// overflow | |
let src = unsafe { self.buffer.as_ptr().add(offset) }; | |
if src.align_offset(mem::align_of::<T>()) != 0 { | |
return Err(GhcbError::InvalidOffset); | |
} | |
// SAFETY: we have verified the pointer is aligned, as well as within | |
// bounds. | |
unsafe { Ok(src.cast::<T>().read()) } | |
} |
I believe these additional checks should not cause a noticeable performance hit. As a last resort, we could make this function (as well as write_buffer()
) unsafe and have the caller ensure the safety requirements are met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied the suggestion, with one change:
unsafe { Ok(src.cast::<T>().read()) }
does not work. Coconut hangs.
This works:
unsafe { Ok(*(src.cast::<T>())) }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like UB. Can you try with read_volatile()
instead? The compiler probably cannot infer that the buffer changes after a VMGEXIT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it. Still the same: hangs.
Also very strange: The last thing I see is "Validating memory", which happens way before
the virtio code (which is the only place calling this) runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last try: could you update write_buffer
so that it uses write_volatile()
as well? E.g. something like:
diff --git a/kernel/src/sev/ghcb.rs b/kernel/src/sev/ghcb.rs
index 9a5fe44..682d2cf 100644
--- a/kernel/src/sev/ghcb.rs
+++ b/kernel/src/sev/ghcb.rs
@@ -386,7 +386,7 @@ impl GHCB {
fn write_buffer<T>(&mut self, data: &T, offset: isize) -> Result<(), GhcbError>
where
- T: Sized,
+ T: Copy,
{
let size: isize = mem::size_of::<T>() as isize;
@@ -401,9 +401,8 @@ impl GHCB {
.cast::<u8>()
.offset(offset)
.cast::<T>();
- let src = data as *const T;
- ptr::copy_nonoverlapping(src, dst, 1);
+ dst.write_volatile(*data);
}
Ok(())
Perhaps with both of them being volatile the compiler can make the right assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange. I retried the approaches mentioned above (read, read_volatile, read_volatile + write_volatile, and the plain "cast" line) and now they all work (virtio-blk tests code runs).
Before Coconut would hang reliably on all approaches except the one I mentioned. 🤷♂️
So, shall we use read_volatile
and write_volatile
to be safe? Or what would you recommend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried both debug and release modes? Anyhow, yes, we should probably use the volatile versions to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried both, debug and release.
Using volatile accesses now.
kernel/src/virtio/mod.rs
Outdated
} | ||
|
||
unsafe fn mmio_read<T: Sized + Copy>(src: &T) -> T { | ||
let paddr = PhysAddr::from((src as *const T) as u64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this, why are we not going through virt_to_phys()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we can't access mmio locations directly and have to go through the ghcb mmio functions anyway, the mmio area is not mapped.
Instead, the physical MMIO base address is cast to the VirtIOHeader
struct. This only works because the struct members are never actually accessed directly. It only serves as a fancy way to calculate the correct offsets of the individual registers. (This is just how the crate does it).
It should be possible to map the mmio range, cast that as VirtIOHeader
and then translate back to the paddr when doing the mmio calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying this now... virt_to_phys
fails with "Invalid physical address" (Copy&paste error btw, it is printing the vaddr
there...)
What is the correct way to create the mapping, btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a temporary mapping use PerCPUPageMappingGuard
. For a permanent mapping you would probably need to call into this_cpu().get_pgtable().map_region(..)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping the MMIO range works also now:
diff --git a/kernel/src/virtio/mod.rs b/kernel/src/virtio/mod.rs
index 0f9d8b8..63da985 100644
--- a/kernel/src/virtio/mod.rs
+++ b/kernel/src/virtio/mod.rs
@@ -4,7 +4,7 @@
//
// Author: Oliver Steffen <[email protected]>
-use core::ptr::NonNull;
+use core::ptr::{addr_of, NonNull};
use virtio_drivers::{
device::blk::{VirtIOBlk, SECTOR_SIZE},
@@ -15,9 +15,7 @@ use virtio_drivers::{
};
use crate::{
- address::PhysAddr,
- cpu,
- mm::{alloc::*, page_visibility::*, *},
+ address::{PhysAddr, VirtAddr}, cpu::{self, percpu::this_cpu}, mm::{alloc::*, page_visibility::*, *}
};
struct SvsmHal;
@@ -107,7 +105,7 @@ unsafe impl virtio_drivers::Hal for SvsmHal {
}
unsafe fn mmio_read<T: Sized + Copy>(src: &T) -> T {
- let paddr = PhysAddr::from((src as *const T) as u64);
+ let paddr = this_cpu().get_pgtable().phys_addr(VirtAddr::from(addr_of!(*src))).unwrap();
cpu::ghcb::current_ghcb()
.mmio_read::<T>(paddr)
@@ -115,7 +113,7 @@ unsafe impl virtio_drivers::Hal for SvsmHal {
}
unsafe fn mmio_write<T: Sized + Copy>(dst: &mut T, v: T) {
- let paddr = PhysAddr::from((dst as *mut T) as u64);
+ let paddr = this_cpu().get_pgtable().phys_addr(VirtAddr::from(addr_of!(*dst))).unwrap();
cpu::ghcb::current_ghcb()
.mmio_write::<T>(paddr, &v)
@@ -127,8 +125,12 @@ unsafe impl virtio_drivers::Hal for SvsmHal {
pub fn test_mmio() {
static MMIO_BASE: u64 = 0xfef03000; // Hard-coded in Qemu
+ let paddr = PhysAddr::from(MMIO_BASE);
+ let mem = PerCPUPageMappingGuard::create_4k(paddr).expect("Error mapping MMIO region");
+
+ log::info!("mapped MMIO range {:016x} to vaddr {:016x}", MMIO_BASE, mem.virt_addr());
// Test code below taken from virtio-drivers aarch64 example.
- let header = NonNull::new(MMIO_BASE as *mut VirtIOHeader).unwrap();
+ let header = NonNull::new(mem.virt_addr().as_mut_ptr() as *mut VirtIOHeader).unwrap();
match unsafe { MmioTransport::<SvsmHal>::new(header) } {
Err(e) => log::warn!(
"Error creating VirtIO MMIO transport at {:016x}: {}",
kernel/src/virtio/mod.rs
Outdated
} | ||
|
||
unsafe fn mmio_write<T: Sized>(dst: &mut T, v: T) { | ||
let paddr = PhysAddr::from((dst as *mut T) as u64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, why are we converting a regular reference to a PhysAddr
directly?
see coconut-svsm#343 (review) Signed-off-by: Oliver Steffen <[email protected]>
see coconut-svsm#343 (review) Signed-off-by: Oliver Steffen <[email protected]> rd-vol wr_vol
135aba9
to
9455857
Compare
@@ -384,9 +386,31 @@ impl GHCB { | |||
Ok(()) | |||
} | |||
|
|||
fn read_buffer_as<T>(&mut self, offset: usize) -> Result<T, GhcbError> | |||
where | |||
T: Sized + Copy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sized
is opt-out, so it is always implied unless something like ?Sized
is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, Copy
implies Sized
so this is doubly-redundant :)
// TODO: allow more than one page. | ||
// This currently works, becasue in "modern" virtio mode the crate only allocates | ||
// one page at a time. | ||
assert!(pages == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just using allocate_pages()
with some extra logic to zero them and make them all shared should work. right?
b257b38
to
fa4dffa
Compare
Use volatile access to ensue writes are not elided or reordered and data actually reaches the buffer. Signed-off-by: Oliver Steffen <[email protected]>
Add GHCB::mmio_read() and GHCB::mmio_write() to perform MMIO accesses via vmgexit. Signed-off-by: Oliver Steffen <[email protected]>
Demonstrate the use of the virtio-drivers crate, accessing a virtio-blk device via the virtio-mmio transport. Signed-off-by: Oliver Steffen <[email protected]>
Rebased onto f23151f from Sept 16. This now uses the latest virtio-drivers crate, see osteffenrh/virtio-drivers#1 |
Feel free to let me know if you need any help with that. |
Here is some discussion on how this could be done: google/zerocopy#1919 |
Add support for virtio-blk storage devices using the virtio-mmio transport.
MMIO accesses are done via explicit vmgexits. Interrupts are not used (-> polling).
The virtio-drivers crate from the rcore-os project is used. It required some modifications to support custom MMIO access functions.
Crate License: MIT.
This PR requires a patched version of Qemu , adding virtio-mmio slots to the Q35 machine
model.
ToDo
For persistent storage to be usable we still need to:
VMPL!=0
Build & Run
Building & Preparations
Build the patched Qemu as usual with IGVM support.
Build Coconut as usual.
Create an empty disk image for Coconut to use
Launching Qemu
Add
and
to your Qemu command line.
Full example:
A short write-read test will be run on the virtio-blk device, followed by a
simple write speed benchmark. This requires external time measurement.
Example output (timestamps added externally):
Write Speed
Write speed depends on the block size used.
From the output above we get: