mvp vm attestation#1091

Open

jordanhendricks wants to merge 27 commits intomasterfrom

jhendricks/rfd-605

Contributor

jordanhendricks commented Mar 27, 2026 •

edited

Loading

closes #1067

TODO:

understand why we see spurious attestation failures (@jordanhendricks) (edit: this was the system working as expected, failing requests before the boot digest is done)
enforce read-only-ness of the boot disk (@jordanhendricks)
understand why stopping an instance with this branch failed (example on berlin) (@iximeow) (edit: this became make Attest an async trait dice-util#360 and associated work)
understand vsock issues (edit: will be merged in XXX close connections #1094)
attestation module comment
fix phd test failures (@jordanhendricks)

jordanhendricks and others added 17 commits

March 20, 2026 18:59


          something that compiles

f6d25c1


          starting to sketch out sled-agent attest code

5dbf46c


          mvp attestation??

e12a38f


          remove dep on libipcc


          make boot digest parseable

ef01e4b


          ready for a racklette spin

591b9f5


          paper over async/sync/async bits

4ca28cb


          added recv channel for vm conf in attestation server

5f12a78


          moved tcp attest server inside of vm objects

b1c710c


          remove warning

e4b4a52


          start adding boot digest stuff

1c55d2b


          might have strung all the needful through propolis-server?

1c6ed47


          clippy lints and cargo fmt

14122a2


          racklette debug :(

449a3b2


          more debugging

19cfbf7


          restore 4ca28cb

d89273b


          remove todo file from tree

2d0a0e4

iximeow reviewed

View reviewed changes

Cargo.toml Outdated

+              # Attestation
+              #dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", branch = "jhendricks/update-sled-agent-types-versions", features = ["sled-agent"] }
+              dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", features = ["sled-agent"] }
+              vm-attest = { git = "https://github.com/oxidecomputer/vm-attest", rev = "a7c2a341866e359a3126aaaa67823ec5097000cd", default-features = false }

Member

iximeow Mar 27, 2026

most of the Cargo.lock weirdness from dice-verifier -> sled-agent-client -> omciron-common (some previous rev) and that's where the later API dependency stuff we saw in Omicron comes up when building the tuf. sled-agent-client re-exports items out of propolis-client which means we end up in a situation where propolis-server depends on a different rev of propolis-client and everything's Weird.

i'm not totally sure what we want or need to do about this, particularly because we're definitely not using the propolis-client-related parts of sled-agent! we're just using one small part of the API for the RoT calls. but sled-agent and propolis are (i think?) updated in the same deployment unit so the cyclic dependency is fine.

iximeow and others added 10 commits

March 30, 2026 20:36


          bump dice-util/vm-attest for AttestAsync

fea9dbb

this fixes issues (read: panics) related to AttestSledAgent's internal
`rt`, block_on, and dropping.


          enforce read-only boot disk

60c8c04


          rev dice-util and vm-attest further

9efdfb6


          rev dice-util, vm-attest

b137a90


          shuffle things around to be able to reign in a cancelled init task

cf55c6e


          halt cleanup

7f84255

actually stop the `AttestationSock` when we stop other Propolis
devices/backends, and along the way `tcp_attest` -> `attest_handle`.


          cleaning up some todos

776795a


          how had i not rebuilt the server...??

9af75aa


          testing a phd fix

60935ca


          my turn to not compile propolis-server

50c24ff

jordanhendricks marked this pull request as ready for review

April 2, 2026 00:08

Contributor Author

jordanhendricks commented Apr 2, 2026

I want to add some comments in the attestation module but from a code-structure perspective @iximeow and I are happy with this. Ready for review!

jordanhendricks requested a review from hawkw

April 2, 2026 00:41

jordanhendricks self-assigned this

papertigers reviewed

View reviewed changes

bin/propolis-server/src/main.rs

                       api_runtime.block_on(async { vnc.halt().await });
                   }
+                  // TODO: clean up attestation server.

Contributor

papertigers Apr 2, 2026

This can be removed now?

hawkw reviewed

View reviewed changes

Member

hawkw left a comment

Some of the Tokio stuff felt a bit awkward here --- I'd be happy to open a PR against this branch changing some of the things I mentioned, if that's easier for you?

bin/propolis-server/src/lib/initializer.rs

Comment on lines +499 to 500

		// TODO: early return if none?
		if let Some(vsock) = &self.spec.vsock {

Member

hawkw Apr 2, 2026

fwiw, i think the TODO is as easy as changing this to

Suggested change

      
                    // TODO: early return if none?
          
                    if let Some(vsock) = &self.spec.vsock {
          
                    // TODO: early return if none?
          
            let Some(vsock) = &self.spec.vsock else { return; };

and then un-indenting everything else in the function basically.

bin/propolis-server/src/lib/initializer.rs

                           )];
                           let guest_cid = GuestCid::try_from(vsock.spec.guest_cid)
                               .context("guest cid")?;

Member

hawkw Apr 2, 2026

not super important but this string could be better probably

bin/propolis-server/src/lib/initializer.rs

                           let guest_cid = GuestCid::try_from(vsock.spec.guest_cid)
                               .context("guest cid")?;
                           // While the spec does not recommend how large the virtio descriptor
                           // table should be we sized this appropriately in testing so

Member

hawkw Apr 2, 2026

turbo nit:

Suggested change


	// table should be we sized this appropriately in testing, so

bin/propolis-server/src/lib/initializer.rs

Comment on lines +707 to +728

+                                  // In a rack we only configure propolis-server with zero or
+                                  // one boot disks.  It's possible to provide a fuller list,
+                                  // and in the future the product may actually expose such a
+                                  // capability. At that time, we'll need to have a reckoning
+                                  // for what "boot disk measurement" from the RoT actually
+                                  // means; it probably "should" be "the measurement of the
+                                  // disk that EDK2 decided to boot into", but that
+                                  // communication to and from the guest is a little more
+                                  // complicated than we want or need to build out today.
+                                  //
+                                  // Since as the system exists we either have no specific
+                                  // boot disk (and don't know where the guest is expected to
+                                  // end up), or one boot disk (and can determine which disk
+                                  // to collect a measurement of before even running guest
+                                  // firmware), we encode this expectation up front. If the
+                                  // product has changed such that this assert is reached,
+                                  // "that's exciting!" and "sorry for crashing your
+                                  // Propolis".
+                                  panic!(
+                                      "Unsupported VM RoT configuration: \
+                                          more than one boot disk"
+                                  );

Member

hawkw Apr 2, 2026

what is the rationale for making this a panic rather than a MachineInitError? would that be easier to debug if this was hit someday later?

bin/propolis-server/src/lib/initializer.rs

+                                  Some(backend.clone_volume())
+                              } else {
+                                  // Disk must be read-only to be used for attestation.
+                                  slog::info!(self.log, "boot disk is not read-only");

Member

hawkw Apr 2, 2026

maybe this should explicitly state that this means it will not be attested?

lib/propolis/src/attestation/server.rs

Comment on lines +42 to +118

+              #[derive(Debug)]
+              enum AttestationInitState {
+                  Preparing {
+                      vm_conf_send: oneshot::Sender<VmInstanceConf>,
+                  },
+                  /// A transient state while we're getting the initializer ready, having
+                  /// taken `Preparing` and its `vm_conf_send`, but before we've got a
+                  /// `JoinHandle` to track as running.
+                  Initializing,
+                  Running {
+                      init_task: JoinHandle<()>,
+                  },
+              }
+              /// This struct manages providing the requisite data for a corresponding
+              /// `AttestationSock` to become fully functional.
+              pub struct AttestationSockInit {
+                  log: slog::Logger,
+                  vm_conf_send: oneshot::Sender<VmInstanceConf>,
+                  uuid: uuid::Uuid,
+                  volume_ref: Option<crucible::Volume>,
+              }
+              impl AttestationSockInit {
+                  /// Do any any remaining work of collecting VM RoT measurements in support
+                  /// of this VM's attestation server.
+                  pub async fn run(self) {
+                      let AttestationSockInit { log, vm_conf_send, uuid, volume_ref } = self;
+                      let mut vm_conf = vm_attest::VmInstanceConf { uuid, boot_digest: None };
+                      if let Some(volume) = volume_ref {
+                          // TODO(jph): make propolis issue, link to #1078 and add a log line
+                          // TODO: load-bearing sleep: we have a Crucible volume, but we can
+                          // be here and chomping at the bit to get a digest calculation
+                          // started well before the volume has been activated; in
+                          // `propolis-server` we need to wait for at least a subsequent
+                          // instance start. Similar to the scrub task for Crucible disks,
+                          // delay some number of seconds in the hopes that activation is done
+                          // promptly.
+                          //
+                          // This should be replaced by awaiting for some kind of actual
+                          // "activated" signal.
+                          tokio::time::sleep(std::time::Duration::from_secs(10)).await;
+                          let boot_digest =
+                              match crate::attestation::boot_digest::boot_disk_digest(
+                                  volume, &log,
+                              )
+                              .await
+                              {
+                                  Ok(digest) => digest,
+                                  Err(e) => {
+                                      // a panic here is unfortunate, but helps us debug for
+                                      // now; if the digest calculation fails it may be some
+                                      // retryable issue that a guest OS would survive. but
+                                      // panicking here means we've stopped Propolis at the
+                                      // actual error, rather than noticing the
+                                      // `vm_conf_sender` having dropped elsewhere.
+                                      panic!("failed to compute boot disk digest: {e:?}");
+                                  }
+                              };
+                          vm_conf.boot_digest = Some(boot_digest);
+                      } else {
+                          slog::warn!(log, "not computing boot disk digest");
+                      }
+                      let send_res = vm_conf_send.send(vm_conf);
+                      if let Err(_) = send_res {
+                          slog::error!(
+                              log,
+                              "attestation server is not listening for its config?"
+                          );
+                      }
+                  }
+              }

Member

hawkw Apr 2, 2026

Soo, it feels a bit funny to me that this thing is a task we spawn that, when it completes, sends a message over a oneshot channel and then exits, and then we have a JoinHandle<()> for that task. It kinda feels like this could just be a JoinHandle<VmInstanceConf> and make a bunch of this at least a bit simpler?

I'd be happy to throw together a patch that does that refactoring if it's too annoying.

lib/propolis/src/attestation/server.rs

Comment on lines +255 to +261

+                                          let response = vm_attest::Response::Error(
+                                              "VmInstanceConf not ready".to_string(),
+                                          );
+                                          //let mut response =
+                                          //serde_json::to_string(&response)?;
+                                          //response.push('\n');
+                                          response

Member

hawkw Apr 2, 2026

unimportant nit:

Suggested change

      
                                        let response = vm_attest::Response::Error(
          
                                            "VmInstanceConf not ready".to_string(),
          
                                        );
          
                                        //let mut response =
          
                                        //serde_json::to_string(&response)?;
          
                                        //response.push('\n');
          
                                        response
          
                                        vm_attest::Response::Error(
          
                                            "VmInstanceConf not ready".to_string(),
          
                                        )

lib/propolis/src/attestation/server.rs

Comment on lines +322 to +339

+                      let vm_conf = Arc::new(Mutex::new(None));
+                      let log_ref = log.clone();
+                      let vm_conf_cloned = vm_conf.clone();
+                      tokio::spawn(async move {
+                          match vm_conf_recv.await {
+                              Ok(conf) => {
+                                  *vm_conf_cloned.lock().unwrap() = Some(conf);
+                              }
+                              Err(_e) => {
+                                  slog::warn!(
+                                      log_ref,
+                                      "lost boot digest sender, \
+                                      hopefully Propolis is stopping"
+                                  );
+                              }
+                          }
+                      });

Member

hawkw Apr 2, 2026

actually, upon looking slightly closer at things, it kinda feels like all of this stuff with the vm_conf_recv oneshot and the task that waits for that and then stuffs it into a mutex could all be avoided if vm_conf was a tokio::sync::watch channel...AFAICT, once it gets stuck in there once, it's never mutated again, so does it really need to be a Mutex for all time?

jordanhendricks commented

View reviewed changes

lib/propolis/src/attestation/boot_digest.rs

+                      let mut buffer =
+                          Buffer::new(this_block_count as usize, block_size as usize);
+                      // TODO(jph): We don't want to panic in the case of a failed read. How

Contributor Author

jordanhendricks Apr 3, 2026

I still need to do this and test on dublin.

jordanhendricks commented

View reviewed changes

lib/propolis/src/attestation/mod.rs

+              // License, v. 2.0. If a copy of the MPL was not distributed with this
+              // file, You can obtain one at https://mozilla.org/MPL/2.0/.
+              //! TODO: block comment

Contributor Author

jordanhendricks Apr 3, 2026

in progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet