Skip to content

mvp vm attestation#1091

Open
jordanhendricks wants to merge 27 commits intomasterfrom
jhendricks/rfd-605
Open

mvp vm attestation#1091
jordanhendricks wants to merge 27 commits intomasterfrom
jhendricks/rfd-605

Conversation

@jordanhendricks
Copy link
Copy Markdown
Contributor

@jordanhendricks jordanhendricks commented Mar 27, 2026

closes #1067

TODO:

Cargo.toml Outdated
# Attestation
#dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", branch = "jhendricks/update-sled-agent-types-versions", features = ["sled-agent"] }
dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", features = ["sled-agent"] }
vm-attest = { git = "https://github.com/oxidecomputer/vm-attest", rev = "a7c2a341866e359a3126aaaa67823ec5097000cd", default-features = false }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the Cargo.lock weirdness from dice-verifier -> sled-agent-client -> omciron-common (some previous rev) and that's where the later API dependency stuff we saw in Omicron comes up when building the tuf. sled-agent-client re-exports items out of propolis-client which means we end up in a situation where propolis-server depends on a different rev of propolis-client and everything's Weird.

i'm not totally sure what we want or need to do about this, particularly because we're definitely not using the propolis-client-related parts of sled-agent! we're just using one small part of the API for the RoT calls. but sled-agent and propolis are (i think?) updated in the same deployment unit so the cyclic dependency is fine.

@jordanhendricks jordanhendricks marked this pull request as ready for review April 2, 2026 00:08
@jordanhendricks
Copy link
Copy Markdown
Contributor Author

I want to add some comments in the attestation module but from a code-structure perspective @iximeow and I are happy with this. Ready for review!

@jordanhendricks jordanhendricks requested a review from hawkw April 2, 2026 00:41
@jordanhendricks jordanhendricks self-assigned this Apr 2, 2026
api_runtime.block_on(async { vnc.halt().await });
}

// TODO: clean up attestation server.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed now?

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the Tokio stuff felt a bit awkward here --- I'd be happy to open a PR against this branch changing some of the things I mentioned, if that's easier for you?

Comment on lines +499 to 500
// TODO: early return if none?
if let Some(vsock) = &self.spec.vsock {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, i think the TODO is as easy as changing this to

Suggested change
// TODO: early return if none?
if let Some(vsock) = &self.spec.vsock {
// TODO: early return if none?
let Some(vsock) = &self.spec.vsock else { return; };

and then un-indenting everything else in the function basically.

)];

let guest_cid = GuestCid::try_from(vsock.spec.guest_cid)
.context("guest cid")?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not super important but this string could be better probably

let guest_cid = GuestCid::try_from(vsock.spec.guest_cid)
.context("guest cid")?;
// While the spec does not recommend how large the virtio descriptor
// table should be we sized this appropriately in testing so
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit:

Suggested change
// table should be we sized this appropriately in testing, so

Comment on lines +707 to +728
// In a rack we only configure propolis-server with zero or
// one boot disks. It's possible to provide a fuller list,
// and in the future the product may actually expose such a
// capability. At that time, we'll need to have a reckoning
// for what "boot disk measurement" from the RoT actually
// means; it probably "should" be "the measurement of the
// disk that EDK2 decided to boot into", but that
// communication to and from the guest is a little more
// complicated than we want or need to build out today.
//
// Since as the system exists we either have no specific
// boot disk (and don't know where the guest is expected to
// end up), or one boot disk (and can determine which disk
// to collect a measurement of before even running guest
// firmware), we encode this expectation up front. If the
// product has changed such that this assert is reached,
// "that's exciting!" and "sorry for crashing your
// Propolis".
panic!(
"Unsupported VM RoT configuration: \
more than one boot disk"
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the rationale for making this a panic rather than a MachineInitError? would that be easier to debug if this was hit someday later?

Some(backend.clone_volume())
} else {
// Disk must be read-only to be used for attestation.
slog::info!(self.log, "boot disk is not read-only");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should explicitly state that this means it will not be attested?

Comment on lines +42 to +118
#[derive(Debug)]
enum AttestationInitState {
Preparing {
vm_conf_send: oneshot::Sender<VmInstanceConf>,
},
/// A transient state while we're getting the initializer ready, having
/// taken `Preparing` and its `vm_conf_send`, but before we've got a
/// `JoinHandle` to track as running.
Initializing,
Running {
init_task: JoinHandle<()>,
},
}

/// This struct manages providing the requisite data for a corresponding
/// `AttestationSock` to become fully functional.
pub struct AttestationSockInit {
log: slog::Logger,
vm_conf_send: oneshot::Sender<VmInstanceConf>,
uuid: uuid::Uuid,
volume_ref: Option<crucible::Volume>,
}

impl AttestationSockInit {
/// Do any any remaining work of collecting VM RoT measurements in support
/// of this VM's attestation server.
pub async fn run(self) {
let AttestationSockInit { log, vm_conf_send, uuid, volume_ref } = self;

let mut vm_conf = vm_attest::VmInstanceConf { uuid, boot_digest: None };

if let Some(volume) = volume_ref {
// TODO(jph): make propolis issue, link to #1078 and add a log line
// TODO: load-bearing sleep: we have a Crucible volume, but we can
// be here and chomping at the bit to get a digest calculation
// started well before the volume has been activated; in
// `propolis-server` we need to wait for at least a subsequent
// instance start. Similar to the scrub task for Crucible disks,
// delay some number of seconds in the hopes that activation is done
// promptly.
//
// This should be replaced by awaiting for some kind of actual
// "activated" signal.
tokio::time::sleep(std::time::Duration::from_secs(10)).await;

let boot_digest =
match crate::attestation::boot_digest::boot_disk_digest(
volume, &log,
)
.await
{
Ok(digest) => digest,
Err(e) => {
// a panic here is unfortunate, but helps us debug for
// now; if the digest calculation fails it may be some
// retryable issue that a guest OS would survive. but
// panicking here means we've stopped Propolis at the
// actual error, rather than noticing the
// `vm_conf_sender` having dropped elsewhere.
panic!("failed to compute boot disk digest: {e:?}");
}
};

vm_conf.boot_digest = Some(boot_digest);
} else {
slog::warn!(log, "not computing boot disk digest");
}

let send_res = vm_conf_send.send(vm_conf);
if let Err(_) = send_res {
slog::error!(
log,
"attestation server is not listening for its config?"
);
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soo, it feels a bit funny to me that this thing is a task we spawn that, when it completes, sends a message over a oneshot channel and then exits, and then we have a JoinHandle<()> for that task. It kinda feels like this could just be a JoinHandle<VmInstanceConf> and make a bunch of this at least a bit simpler?

I'd be happy to throw together a patch that does that refactoring if it's too annoying.

Comment on lines +255 to +261
let response = vm_attest::Response::Error(
"VmInstanceConf not ready".to_string(),
);
//let mut response =
//serde_json::to_string(&response)?;
//response.push('\n');
response
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unimportant nit:

Suggested change
let response = vm_attest::Response::Error(
"VmInstanceConf not ready".to_string(),
);
//let mut response =
//serde_json::to_string(&response)?;
//response.push('\n');
response
vm_attest::Response::Error(
"VmInstanceConf not ready".to_string(),
)

Comment on lines +322 to +339
let vm_conf = Arc::new(Mutex::new(None));

let log_ref = log.clone();
let vm_conf_cloned = vm_conf.clone();
tokio::spawn(async move {
match vm_conf_recv.await {
Ok(conf) => {
*vm_conf_cloned.lock().unwrap() = Some(conf);
}
Err(_e) => {
slog::warn!(
log_ref,
"lost boot digest sender, \
hopefully Propolis is stopping"
);
}
}
});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, upon looking slightly closer at things, it kinda feels like all of this stuff with the vm_conf_recv oneshot and the task that waits for that and then stuffs it into a mutex could all be avoided if vm_conf was a tokio::sync::watch channel...AFAICT, once it gets stuck in there once, it's never mutated again, so does it really need to be a Mutex for all time?

let mut buffer =
Buffer::new(this_block_count as usize, block_size as usize);

// TODO(jph): We don't want to panic in the case of a failed read. How
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to do this and test on dublin.

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! TODO: block comment
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mvp vm attestation support in propolis-server (rfd 605)

4 participants