block/file: do pread/pwrite from Propolis heap instead of VM memory #985

iximeow · 2025-12-11T00:59:07Z

As described in the comment on MAPPING_IO_LIMIT_BYTES, this is a gross hack: if VM memory is used as the iovec for p{read,write}v to a block device or zvol, we'll end up contending on svmd->svmd_lock pretty heavily in higher IOPS situations.

This might arise when doing fewer larger I/Os to zvols or block devices as well, when we'll be forwarding each of the 4k entries in guest PRP lists as a distinct iovec, but we haven't tested that.

Either way, operating from Propolis heap avoids the issue for now. One could imagine this "should" be a configurable behavior on file block backends, since many files (like disk images in propolis-standalone) would not need this. The hope is that we can fix this in the kernel so the hack is no longer needed for zvols or block devices, rather than leaning further into properly supporting this copy hack.

For good measure, I did some testing on a propolis-standalone with this locally. Of the 1% time in user code when doing large I/Os a profile-1003 only turned up three stacks here, all in the malloc/free for the buffers.

iximeow · 2025-12-11T01:03:41Z

lib/propolis/src/vmm/mem.rs

+// worker threads for all file-backed disks, times this threshold. It works out
+// to up to 128 MiB (8 worker threads) of buffers per disk by default as of
+// writing.
+const MAPPING_IO_LIMIT_BYTES: usize = 16 * crate::common::MB;


with a Linux guest, dtrace says that this is a wildly large limit: when I have fio doing 16M I/Os, that turns into block_begin_write firing with 256kb of data to move (even though the controller tells the guest mdts=0 -> "no limit" on data transfer size). presumably Linux is breaking up the I/Os before making NVMe commands, but I don't plan on looking too closely.

As described in the comment on `MAPPING_IO_LIMIT_BYTES`, this is a gross hack: if VM memory is used as the iovec for `p{read,write}v` to a block device or zvol, we'll end up contending on `svmd->svmd_lock` pretty heavily in higher IOPS situations. This might arise when doing fewer larger I/Os to zvols or block devices as well, when we'll be forwarding each of the 4k entries in guest PRP lists as a distinct iovec, but we haven't tested that. Either way, operating from Propolis heap avoids the issue for now. One could imagine this "should" be a configurable behavior on file block backends, since many files (like disk images in propolis-standalone) would not need this. The hope is that we can fix this in the kernel so the hack is no longer needed for zvols or block devices, rather than leaning further into properly supporting this copy hack.

hawkw · 2025-12-11T21:05:24Z

Interesting, this sure seems like it should be worse than actually writing straight into the guest address space, but the contention making that worse makes sense. I agree that fixing the kernel seems like a better long term solution.

hawkw · 2025-12-11T21:07:14Z

lib/propolis/src/vmm/mem.rs

+// Beyond this, either fall back to using iovecs directly (and trust the kernel
+// to handle bizarre vecs) or error. This is an especially gross hack because we


"bizarre" here just meaning "in excess of 16 megabytes"?

worse: imagine a list of iovec where the ptr/len pair is just repeated 10000 times. if we did range coalescing in Propolis (i should do an issue about that actually), an evil driver could submit a write using 1MB (256 pages) but repeat that several hundred times to produce GiB+ writes kinda easily

actually since IOV_MAX is just 1024 this is less of a problem than i'd thought (... so we really ought to set MDTS to 10 lol)

worse: imagine a list of iovec where the ptr/len pair is just repeated 10000 times. if we did range coalescing in Propolis (i should do an issue about that actually), an evil driver could submit a write using 1MB (256 pages) but repeat that several hundred times to produce GiB+ writes kinda easily

oh, that's ghastly. ew. okay.

hawkw · 2025-12-11T21:09:13Z

lib/propolis/src/vmm/mem.rs

+            // If we're motivated to avoid the zero-fill via
+            // `Layout::with_size_align` + `GlobalAlloc::alloc`, we should
+            // probably avoid this gross hack entirely (see comment on
+            // MAPPING_IO_LIMIT_BYTES).


so it since has turned out that the zero-filling doesn't take a remarkable amount of CPU time (compared to the copying), but the malloc and frees themselves do in the form of locking in umem_cache_{alloc,free} particularly when the I/O sizes vary. I'm going to address some of the other comments and land this as-is, but probably just move the buffer out to be reused in a followup. I just want to measure if it actually changes much in the mixed-size-I/O case first..

hawkw · 2025-12-11T21:15:18Z

lib/propolis/src/vmm/mem.rs

+            let read = unsafe {
+                libc::preadv(fd, iov.as_ptr(), iov.len() as libc::c_int, offset)
+            };
+            if read == -1 {
+                return Err(Error::last_os_error());
+            }
+            let read: usize = read.try_into().expect("read is positive");


i thought we had someplace a thingy for doing a libc call and checking errno if the return value is -1, but after looking closer, it looks like this is just for ioctl; i wonder if it's worth having something like that for other libc functions like preadv (since we've got a bunch of similar things).

i'm imagining a thingy like

fn syscall_result(x: i32) -> Result<usize, io::Error> { usize::try_from(x).map_err(|_| io::Error::last_os_error()) }

not a blocker, naturally!

lib/propolis/src/vmm/mem.rs

hawkw · 2025-12-11T21:34:16Z

lib/propolis/src/vmm/mem.rs

+// Gross hack alert: since the mappings below are memory regions backed by
+// segvmm_ops, `zvol_{read,write}` and similar will end up contending on
+// `svmd->svmd_lock`. Instead, as long as the I/O is small enough we'll tolerate
+// it, copy from guest memory to Propolis heap. The segment backing Propolis'
+// heap has an `as_page{,un}lock` impl that avoids the more
+// expensive/contentious `as_fault()` fallback.


wow, i hate this a lot, but your rationale makes sense. i'll let you have it. :)

lib/propolis/src/vmm/mem.rs

hawkw

great, yeah, i think the use of copy_nonoverlapping is legit here. couple little nits but no blocking worries on my end.

lib/propolis/src/vmm/mem.rs

hawkw · 2025-12-18T20:38:32Z

lib/propolis/src/vmm/mem.rs

+            if read == -1 {
+                return Err(Error::last_os_error());
+            }
+            let read: usize = read.try_into().expect("read is positive");


nit, take it or leave it: we could, perhaps, avoid the panic here if we instead wrote:

Suggested change

if read == -1 {

return Err(Error::last_os_error());

}

let read: usize = read.try_into().expect("read is positive");

let read = match usize::try_from(read) {

Ok(read) => read,

Err(_) => return Err(Error::last_os_error()),

};

this way, we don't branch on the value of read twice, which...is maybe nice. it does mean we don't panic if preadv decides to return -420 or some shit, but...if we care about that, we could assert_eq!(read, -1) in the error path, i guess?

UPDATE: oh i also said a similarish thing last week: (https://github.com/oxidecomputer/propolis/pull/985/files#r2612098032, time is a flat circle.

I like that suggestion, with a caveat: this is basically let read = usize::try_from(read).map_err(|_| Error::last_os_error())?;, isn't it?

Actually...thinking about this a bit more.... Perhaps we avoid direct use of libc to whatever extent that we can?

nix has a wrapper for preadv that does more sensible error handling that returns a Result over usize, which is exactly what we want. Simiarly, I saw @iliana recommend rustix the other day, which does similarly.

On a meta note, I'm not sure which crate is best here, but perhaps company-wide perhaps we should at least present a soft preference for something that's slightly higher-level than libc?

there are kind of a few things. first, yeah, I do care about panicking if preadv() returned a bogus error value, but that's easy enough to assert. I don't super like usize::try_from here because it seems weird to read this later as "if preadv() returned a number that couldn't be a usize, then it must have errored", versus "if pread() returned -1 that is an error, otherwise it must be a usize". I was hoping that maybe there's a less annoying construction of this but at the end of the day I consider this whole patch kinda temporary so not sweating it so much.

mechanically, from "before", I'd agree that both rustix and nix wouldn't hurt here. but particularly as this is "data plane" I'm very, very cautious about the system behaving exactly as we expect at all times. nix returns Error::last_os_error() if the returned value is a -1-like sentinel, otherwise casts whatever the bits are into the expected type. rustix in this case has a debug_assert!(). both, IMO, are insufficiently paranoid that the syscall completed as expected particularly when we're handling guest data.

anyway, my paranoia aside, SubMapping::pread and SubMapping::pwrite do the nix-like "check for -1 otherwise cast" so I'll keep it like that here as well and take a pass at the libc stuff separately later.

hawkw · 2025-12-18T20:41:44Z

lib/propolis/src/vmm/mem.rs

+                // example), but `copy_nonoverlapping` has no compunctions about
+                // concurrent mutation.


perhaps worth noting explicitly that if a guest does decide it wants to screw up its own iovec for some reason, it would be pretty damn hard for LLVM to figure out a way to miscompile us in that case?

We could also drop down to something like the volatile copy intrinsic, which is sadly not exposed in the standard library (presumably via core) as intrinsics are unstable, but that would require either nightly or shenanigans. I proposed something in rust-lang/rfcs#2728 a long while back, but encountered a lot of resistance and ran out of steam to push that change forward.

if a guest does decide it wants to screw up its own iovec

I'm definitely thinking more about "same page submitted in another I/O concurrently", where we're working on the same range on a different thread in Propolis, more than the guest happens to scribble on memory at the same time.

before, I'd broken the rules but hadn't seen a better option and really just needed to justify to myself why it might be OK. now I just want to emphasize things can be happening (more than a normal copy_nonoverlapping, since I expect that someone looking at this file wouldn't immediately think of the page-submitted-in-other-I/Os case), but the contract of copy_nonoverlapping doesn't care about those things.

but generally "operate on SubMapping exclusively through pointer operations, never create a reference to or slice of the referenced data" probably should be a comment on SubMapping (the Safety comments on raw_readable and raw_writable substantiate this: "The caller must never create a reference to the underlying memory region.")

volatile copy intrinsic

would be a bit too limiting, I think! that intrinsic would demand a byte-wise copy loop, not memcpy() and widen-to-u64-if-not-AVX paths there, but the memory here isn't volatile.

lib/propolis/src/vmm/mem.rs

also we *do* set MDTS in practice, limit buffer size accordingly.

Along the way I realized the copying in #985 is somewhat more obviously correct when written in terms of `SubMapping` operations, so move that along with the docs.

iximeow commented Dec 11, 2025

View reviewed changes

iximeow force-pushed the ixi/file-copy-hack branch from c1bfd48 to 62fba95 Compare December 11, 2025 01:15

iximeow added the local storage Relating to the local storage project label Dec 11, 2025

iximeow added this to the 18 milestone Dec 11, 2025

hawkw approved these changes Dec 11, 2025

View reviewed changes

iximeow commented Dec 11, 2025

View reviewed changes

lib/propolis/src/vmm/mem.rs Outdated Show resolved Hide resolved

review feedback, use differently-fiddly copy

c2ea48f

ahl assigned iximeow Dec 15, 2025

hawkw approved these changes Dec 18, 2025

View reviewed changes

dancrossnyc reviewed Dec 19, 2025

View reviewed changes

lib/propolis/src/vmm/mem.rs Outdated Show resolved Hide resolved

review feedback round 2, punting on lifting the buffer out

4f42ff4

also we *do* set MDTS in practice, limit buffer size accordingly.

iximeow merged commit 17d02e1 into master Jan 6, 2026
11 checks passed

iximeow deleted the ixi/file-copy-hack branch January 6, 2026 03:01

This was referenced Jan 6, 2026

Elaborate on the constraints Mapping and SubMapping work under #1005

Open

Reuse the buffer for read/write-through-propolis-heap #1006

Draft

		// Beyond this, either fall back to using iovecs directly (and trust the kernel
		// to handle bizarre vecs) or error. This is an especially gross hack because we

		// example), but `copy_nonoverlapping` has no compunctions about
		// concurrent mutation.

block/file: do pread/pwrite from Propolis heap instead of VM memory #985

block/file: do pread/pwrite from Propolis heap instead of VM memory #985

Uh oh!

Conversation

iximeow commented Dec 11, 2025

Uh oh!

iximeow Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw commented Dec 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iximeow Dec 11, 2025 •

edited

Loading