[RFC] virtio-blk storage via mmio using virtio-drivers (Qemu) #343

osteffenrh · 2024-05-14T14:15:24Z

Add support for virtio-blk storage devices using the virtio-mmio transport.
MMIO accesses are done via explicit vmgexits. Interrupts are not used (-> polling).

The virtio-drivers crate from the rcore-os project is used. It required some modifications to support custom MMIO access functions.
Crate License: MIT.

This PR requires a patched version of Qemu , adding virtio-mmio slots to the Q35 machine
model.

ToDo

For persistent storage to be usable we still need to:

Copy virtio-drivers into the repo
Set up some basic Fuzzing
Improve memory management:
- Allow buffers > one page
- Allocate buffers in a smarter/more compact way?
Add block device layer to Coconut and integrate vitio-blk into it
Add encryption layer
Add a simple file system on top of the block layer. Choices?
Block access to the MMIO ranges for VMPL!=0
Supply MMIO base address via IGVM file

Build & Run

Building & Preparations

Build the patched Qemu as usual with IGVM support.

git clone https://github.com/osteffenrh/qemu.git --branch virtio-pr --single-branch --depth=1
cd qemu
$ ./configure --prefix=$HOME/bin/qemu-svsm/ --target-list=x86_64-softmmu --enable-igvm
ninja -C build/
make install

Build Coconut as usual.

Create an empty disk image for Coconut to use

truncate -s16M mmio.raw

Launching Qemu

Add

-machine x-svsm-virtio-mmio=on

and

-global virtio-mmio.force-legacy=false \
-drive file="mmio.raw",format=raw,if=none,id=mmio \
-device virtio-blk-device,drive=mmio

to your Qemu command line.

Full example:

$HOME/bin/qemu-svsm/bin/qemu-system-x86_64 \
  -enable-kvm \
  -cpu EPYC-v4 \
  -smp 4 \
  -machine q35,confidential-guest-support=sev0,memory-backend=mem0 \
  -machine x-svsm-virtio-mmio=on \
  -object memory-backend-memfd,id=mem0,size=4G,share=true,prealloc=false,reserve=false \
  -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,init-flags=1,igvm-file="bin/coconut-qemu.igvm" \
  -no-reboot \
  -drive file="disk.qcow2",if=none,id=disk0,format=qcow2,snapshot=on \
  -device virtio-scsi-pci,id=scsi0,disable-legacy=on,iommu_platform=true \
  -device scsi-hd,drive=disk0 \
  -nographic \
  -serial stdio \
  -monitor none \
  -global virtio-mmio.force-legacy=false \
  -drive file="mmio.raw",format=raw,if=none,id=mmio \
  -device virtio-blk-device,drive=mmio

A short write-read test will be run on the virtio-blk device, followed by a
simple write speed benchmark. This requires external time measurement.

Example output (timestamps added externally):

   0.0011s   [SVSM] Virtio test trace
   0.0004s   [SVSM] Detected virtio MMIO device with vendor id 0x554D4551, device type Block, version Modern
   0.0006s   [SVSM] config: 0xfef03100
   0.0007s   [SVSM] found a block device of size 131072KB
   0.0173s   [SVSM] Write+Read Test Start
   0.0002s   [SVSM] Write+Read Test End
   0.0005s   [SVSM] Write Benchmark Start
  65.2324s   [SVSM] virtio-blk start write. Block size = 512
   0.0005s   [SVSM] virtio-blk end write. Block size = 512, Total bytes = 134217728
   8.1706s   [SVSM] virtio-blk start write. Block size = 4096
   0.0003s   [SVSM] virtio-blk end write. Block size = 4096, Total bytes = 134217728
   0.0008s   [SVSM] Write Benchmark End
   0.0005s   [SVSM] Virtio test end

Write Speed

Write speed depends on the block size used.
From the output above we get:

512Byte writes: 1.96MB/s
4096Byte writes: 15.7MB/s

osteffenrh · 2024-05-14T14:36:33Z

I'll rebase and retest everything, just wanted to put this out here already.

joergroedel · 2024-05-14T15:58:01Z

Wow, cool! We might also think about using IRQs once the COCONUT kernel has the infrastructure for it.

00xc · 2024-05-15T10:35:08Z

Add support for virtio-blk storage devices using the virtio-mmio transport. MMIO accesses are done via explicit vmgexits. Interrupts are not used (-> polling).

The virtio-drivers crate from the rcore-os project is used. It required some modifications to support custom MMIO access functions. Crate License: MIT.

I haven't gone too deep in the codebase, but can't this be done by implementing Transport for a new MmioTransport-like type?

osteffenrh · 2024-05-15T12:30:45Z

Add support for virtio-blk storage devices using the virtio-mmio transport. MMIO accesses are done via explicit vmgexits. Interrupts are not used (-> polling).
The virtio-drivers crate from the rcore-os project is used. It required some modifications to support custom MMIO access functions. Crate License: MIT.

I haven't gone too deep in the codebase, but can't this be done by implementing Transport for a new MmioTransport-like type?

Unfortunately not, because the devices (virtio-blk for example) use mmio accesses directly too (to the configuration space), without going through the transport.

osteffenrh · 2024-05-15T12:35:12Z

To be clear, there is a MMIO transport in the virtio-drivers crate. But it simply uses volatile reads and writes by default. My change here was to add mmio access functions to the hardware abstraction trait, to be able to use explicit vmgextis instead of "plain" volatile accesses.
Since we don't have a #VC handler yet that supports MMIO, that seemed like the way to go.

kraxel · 2024-05-21T13:31:03Z

Wow, cool! We might also think about using IRQs once the COCONUT kernel has the infrastructure for it.

Not convinced this makes sense, so my qemu patch doesn't assign interrupts to these virtio-mmio transport slots.
Do we want run virtio devices asynchronously? Is the complexity worth the performance benefit?

When using interrupts: How would that look like? Would SVSM (partly) emulate lapic and ioapic then, to make sure the guest os can't tamper with the IRQ lines owned by SVSM?

joergroedel · 2024-05-22T09:01:43Z

Wow, cool! We might also think about using IRQs once the COCONUT kernel has the infrastructure for it.

Not convinced this makes sense, so my qemu patch doesn't assign interrupts to these virtio-mmio transport slots. Do we want run virtio devices asynchronously? Is the complexity worth the performance benefit?

When using interrupts: How would that look like? Would SVSM (partly) emulate lapic and ioapic then, to make sure the guest os can't tamper with the IRQ lines owned by SVSM?

Yes, The SVSM would own the HV-emulated APIC (of vAPIC in TDX case) and emulate another X(2)APIC for the guest OS. The IOAPIC emulation remains on the host side. APIC support in SVSM is needed anyway for TDX to implement IPIs for TLB flushes. On the SNP side IRQ support is needed to mitigate the AHOI group of attacks.

00xc · 2024-05-24T11:29:17Z

kernel/src/sev/ghcb.rs

+    fn read_buffer_as<T>(&mut self, offset: isize) -> Result<T, GhcbError>
+    where
+        T: Sized + Copy,
+    {
+        let size: isize = mem::size_of::<T>() as isize;
+
+        if offset < 0 || offset + size > (GHCB_BUFFER_SIZE as isize) {
+            return Err(GhcbError::InvalidOffset);
+        }
+
+        unsafe {
+            let src = self
+                .buffer
+                .as_mut_ptr()
+                .cast::<u8>()
+                .offset(offset)
+                .cast::<T>();
+            Ok(*src)
+        }
+    }


We do not need to use isize here for the offset. I understand this was perhaps done to mirror write_buffer() below, but we can do the right thing here (and the caller), i.e., use usize, and thus add() instead of offset()

Also, there are some safety issues:

We need to check that offset + size does not overflow (otherwise add() / offset() might cause UB.

We need to check that src has the alignment requirements for T. We should either add a check or use read_unaligned(). Otherwise this might cause UB.

Something like this:

Suggested change

fn read_buffer_as<T>(&mut self, offset: isize) -> Result<T, GhcbError>

where

T: Sized + Copy,

{

let size: isize = mem::size_of::<T>() as isize;

if offset < 0 || offset + size > (GHCB_BUFFER_SIZE as isize) {

return Err(GhcbError::InvalidOffset);

}

unsafe {

let src = self

.buffer

.as_mut_ptr()

.cast::<u8>()

.offset(offset)

.cast::<T>();

Ok(*src)

}

}

fn read_buffer_as<T>(&mut self, offset: usize) -> Result<T, GhcbError>

where

T: Sized + Copy,

{

offset

.checked_add(mem::size_of::<T>())

.filter(|end| *end <= GHCB_BUFFER_SIZE)

.ok_or(GhcbError::InvalidOffset)?;

// SAFETY: we have verified that offset is within bounds and does not

// overflow

let src = unsafe { self.buffer.as_ptr().add(offset) };

if src.align_offset(mem::align_of::<T>()) != 0 {

return Err(GhcbError::InvalidOffset);

}

// SAFETY: we have verified the pointer is aligned, as well as within

// bounds.

unsafe { Ok(src.cast::<T>().read()) }

}

I believe these additional checks should not cause a noticeable performance hit. As a last resort, we could make this function (as well as write_buffer()) unsafe and have the caller ensure the safety requirements are met.

Applied the suggestion, with one change:

unsafe { Ok(src.cast::<T>().read()) }

does not work. Coconut hangs.

This works:

unsafe { Ok(*(src.cast::<T>())) }

This sounds like UB. Can you try with read_volatile() instead? The compiler probably cannot infer that the buffer changes after a VMGEXIT.

Tried it. Still the same: hangs.

Also very strange: The last thing I see is "Validating memory", which happens way before
the virtio code (which is the only place calling this) runs.

One last try: could you update write_buffer so that it uses write_volatile() as well? E.g. something like:

diff --git a/kernel/src/sev/ghcb.rs b/kernel/src/sev/ghcb.rs index 9a5fe44..682d2cf 100644 --- a/kernel/src/sev/ghcb.rs +++ b/kernel/src/sev/ghcb.rs @@ -386,7 +386,7 @@ impl GHCB { fn write_buffer<T>(&mut self, data: &T, offset: isize) -> Result<(), GhcbError> where - T: Sized, + T: Copy, { let size: isize = mem::size_of::<T>() as isize; @@ -401,9 +401,8 @@ impl GHCB { .cast::<u8>() .offset(offset) .cast::<T>(); - let src = data as *const T; - ptr::copy_nonoverlapping(src, dst, 1); + dst.write_volatile(*data); } Ok(())

Perhaps with both of them being volatile the compiler can make the right assumptions.

This is strange. I retried the approaches mentioned above (read, read_volatile, read_volatile + write_volatile, and the plain "cast" line) and now they all work (virtio-blk tests code runs).
Before Coconut would hang reliably on all approaches except the one I mentioned. 🤷‍♂️

So, shall we use read_volatile and write_volatile to be safe? Or what would you recommend?

Have you tried both debug and release modes? Anyhow, yes, we should probably use the volatile versions to be safe.

I tried both, debug and release.

Using volatile accesses now.

00xc · 2024-05-24T11:35:07Z

kernel/src/virtio/mod.rs

+    }
+
+    unsafe fn mmio_read<T: Sized + Copy>(src: &T) -> T {
+        let paddr = PhysAddr::from((src as *const T) as u64);


Not sure I understand this, why are we not going through virt_to_phys()?

Since we can't access mmio locations directly and have to go through the ghcb mmio functions anyway, the mmio area is not mapped.

Instead, the physical MMIO base address is cast to the VirtIOHeader struct. This only works because the struct members are never actually accessed directly. It only serves as a fancy way to calculate the correct offsets of the individual registers. (This is just how the crate does it).

It should be possible to map the mmio range, cast that as VirtIOHeader and then translate back to the paddr when doing the mmio calls.

Trying this now... virt_to_phys fails with "Invalid physical address" (Copy&paste error btw, it is printing the vaddr there...)

What is the correct way to create the mapping, btw?

For a temporary mapping use PerCPUPageMappingGuard. For a permanent mapping you would probably need to call into this_cpu().get_pgtable().map_region(..).

Mapping the MMIO range works also now:

diff --git a/kernel/src/virtio/mod.rs b/kernel/src/virtio/mod.rs index 0f9d8b8..63da985 100644 --- a/kernel/src/virtio/mod.rs +++ b/kernel/src/virtio/mod.rs @@ -4,7 +4,7 @@ // // Author: Oliver Steffen <[email protected]> -use core::ptr::NonNull; +use core::ptr::{addr_of, NonNull}; use virtio_drivers::{ device::blk::{VirtIOBlk, SECTOR_SIZE}, @@ -15,9 +15,7 @@ use virtio_drivers::{ }; use crate::{ - address::PhysAddr, - cpu, - mm::{alloc::*, page_visibility::*, *}, + address::{PhysAddr, VirtAddr}, cpu::{self, percpu::this_cpu}, mm::{alloc::*, page_visibility::*, *} }; struct SvsmHal; @@ -107,7 +105,7 @@ unsafe impl virtio_drivers::Hal for SvsmHal { } unsafe fn mmio_read<T: Sized + Copy>(src: &T) -> T { - let paddr = PhysAddr::from((src as *const T) as u64); + let paddr = this_cpu().get_pgtable().phys_addr(VirtAddr::from(addr_of!(*src))).unwrap(); cpu::ghcb::current_ghcb() .mmio_read::<T>(paddr) @@ -115,7 +113,7 @@ unsafe impl virtio_drivers::Hal for SvsmHal { } unsafe fn mmio_write<T: Sized + Copy>(dst: &mut T, v: T) { - let paddr = PhysAddr::from((dst as *mut T) as u64); + let paddr = this_cpu().get_pgtable().phys_addr(VirtAddr::from(addr_of!(*dst))).unwrap(); cpu::ghcb::current_ghcb() .mmio_write::<T>(paddr, &v) @@ -127,8 +125,12 @@ unsafe impl virtio_drivers::Hal for SvsmHal { pub fn test_mmio() { static MMIO_BASE: u64 = 0xfef03000; // Hard-coded in Qemu + let paddr = PhysAddr::from(MMIO_BASE); + let mem = PerCPUPageMappingGuard::create_4k(paddr).expect("Error mapping MMIO region"); + + log::info!("mapped MMIO range {:016x} to vaddr {:016x}", MMIO_BASE, mem.virt_addr()); // Test code below taken from virtio-drivers aarch64 example. - let header = NonNull::new(MMIO_BASE as *mut VirtIOHeader).unwrap(); + let header = NonNull::new(mem.virt_addr().as_mut_ptr() as *mut VirtIOHeader).unwrap(); match unsafe { MmioTransport::<SvsmHal>::new(header) } { Err(e) => log::warn!( "Error creating VirtIO MMIO transport at {:016x}: {}",

00xc · 2024-05-24T11:35:46Z

kernel/src/virtio/mod.rs

+    }
+
+    unsafe fn mmio_write<T: Sized>(dst: &mut T, v: T) {
+        let paddr = PhysAddr::from((dst as *mut T) as u64);


Same here, why are we converting a regular reference to a PhysAddr directly?

see coconut-svsm#343 (review) Signed-off-by: Oliver Steffen <[email protected]>

see coconut-svsm#343 (review) Signed-off-by: Oliver Steffen <[email protected]> rd-vol wr_vol

00xc · 2024-06-04T10:55:53Z

kernel/src/sev/ghcb.rs

@@ -384,9 +386,31 @@ impl GHCB {
        Ok(())
    }

+    fn read_buffer_as<T>(&mut self, offset: usize) -> Result<T, GhcbError>
+    where
+        T: Sized + Copy,


Sized is opt-out, so it is always implied unless something like ?Sized is used.

Furthermore, Copy implies Sized so this is doubly-redundant :)

00xc · 2024-06-04T11:00:14Z

kernel/src/virtio/mod.rs

+        // TODO: allow more than one page.
+        //       This currently works, becasue in "modern" virtio mode the crate only allocates
+        //       one page at a time.
+        assert!(pages == 1);


I think just using allocate_pages() with some extra logic to zero them and make them all shared should work. right?

Use volatile access to ensue writes are not elided or reordered and data actually reaches the buffer. Signed-off-by: Oliver Steffen <[email protected]>

Add GHCB::mmio_read() and GHCB::mmio_write() to perform MMIO accesses via vmgexit. Signed-off-by: Oliver Steffen <[email protected]>

Demonstrate the use of the virtio-drivers crate, accessing a virtio-blk device via the virtio-mmio transport. Signed-off-by: Oliver Steffen <[email protected]>

osteffenrh · 2024-10-16T08:24:31Z

Rebased onto f23151f from Sept 16.
Going to current main branch requires more work due to changes to ghcb::write_buffer()
(now uses traits from zerocopy).

This now uses the latest virtio-drivers crate, see osteffenrh/virtio-drivers#1

Freax13 · 2024-10-16T08:49:54Z

Going to current main branch requires more work due to changes to ghcb::write_buffer() (now uses traits from zerocopy).

Feel free to let me know if you need any help with that.

osteffenrh · 2024-10-17T10:16:39Z

Going to current main branch requires more work due to changes to ghcb::write_buffer() (now uses traits from zerocopy).

Feel free to let me know if you need any help with that.

Here is some discussion on how this could be done: google/zerocopy#1919

osteffenrh force-pushed the virtio-pr branch 4 times, most recently from 3983c9d to 89212e9 Compare May 20, 2024 19:31

00xc reviewed May 24, 2024

View reviewed changes

osteffenrh added a commit to osteffenrh/coconut-svsm that referenced this pull request May 28, 2024

(Comment) improve read_buffer_as<T>()

c4070cb

see coconut-svsm#343 (review) Signed-off-by: Oliver Steffen <[email protected]>

osteffenrh added a commit to osteffenrh/coconut-svsm that referenced this pull request Jun 1, 2024

(Comment) improve ghcb buffer read/writes

90ffb70

see coconut-svsm#343 (review) Signed-off-by: Oliver Steffen <[email protected]> rd-vol wr_vol

osteffenrh force-pushed the virtio-pr branch 3 times, most recently from 135aba9 to 9455857 Compare June 1, 2024 09:10

00xc reviewed Jun 4, 2024

View reviewed changes

osteffenrh force-pushed the virtio-pr branch 2 times, most recently from b257b38 to fa4dffa Compare July 29, 2024 19:37

osteffenrh added 2 commits October 1, 2024 13:42

ghcb: Use volatile writes for GHCB buffer

3f9ceb5

Use volatile access to ensue writes are not elided or reordered and data actually reaches the buffer. Signed-off-by: Oliver Steffen <[email protected]>

kernel/sev/ghcb: add MMIO read/write functionality

4741463

Add GHCB::mmio_read() and GHCB::mmio_write() to perform MMIO accesses via vmgexit. Signed-off-by: Oliver Steffen <[email protected]>

osteffenrh force-pushed the virtio-pr branch from fa4dffa to ccae07c Compare October 1, 2024 11:58

Add virtio-drivers POC

31e1adc

Demonstrate the use of the virtio-drivers crate, accessing a virtio-blk device via the virtio-mmio transport. Signed-off-by: Oliver Steffen <[email protected]>

osteffenrh force-pushed the virtio-pr branch from ccae07c to 31e1adc Compare October 16, 2024 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] virtio-blk storage via mmio using virtio-drivers (Qemu) #343

[RFC] virtio-blk storage via mmio using virtio-drivers (Qemu) #343

osteffenrh commented May 14, 2024 •

edited

Loading

osteffenrh commented May 14, 2024

joergroedel commented May 14, 2024

00xc commented May 15, 2024

osteffenrh commented May 15, 2024

osteffenrh commented May 15, 2024

kraxel commented May 21, 2024

joergroedel commented May 22, 2024

00xc May 24, 2024 •

edited

Loading

osteffenrh May 28, 2024

00xc May 28, 2024

osteffenrh May 28, 2024

00xc May 29, 2024

osteffenrh Jun 1, 2024

00xc Jun 1, 2024

osteffenrh Jun 1, 2024

00xc May 24, 2024

osteffenrh May 27, 2024

osteffenrh May 28, 2024

00xc May 29, 2024

osteffenrh Jun 2, 2024

00xc May 24, 2024

00xc Jun 4, 2024

Freax13 Oct 17, 2024

00xc Jun 4, 2024

osteffenrh commented Oct 16, 2024

Freax13 commented Oct 16, 2024

osteffenrh commented Oct 17, 2024

[RFC] virtio-blk storage via mmio using virtio-drivers (Qemu) #343

Are you sure you want to change the base?

[RFC] virtio-blk storage via mmio using virtio-drivers (Qemu) #343

Conversation

osteffenrh commented May 14, 2024 • edited Loading

ToDo

Build & Run

Building & Preparations

Launching Qemu

Write Speed

osteffenrh commented May 14, 2024

joergroedel commented May 14, 2024

00xc commented May 15, 2024

osteffenrh commented May 15, 2024

osteffenrh commented May 15, 2024

kraxel commented May 21, 2024

joergroedel commented May 22, 2024

00xc May 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

osteffenrh commented Oct 16, 2024

Freax13 commented Oct 16, 2024

osteffenrh commented Oct 17, 2024

osteffenrh commented May 14, 2024 •

edited

Loading

00xc May 24, 2024 •

edited

Loading