Conversation
Cargo.toml
Outdated
| # Attestation | ||
| #dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", branch = "jhendricks/update-sled-agent-types-versions", features = ["sled-agent"] } | ||
| dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", features = ["sled-agent"] } | ||
| vm-attest = { git = "https://github.com/oxidecomputer/vm-attest", rev = "a7c2a341866e359a3126aaaa67823ec5097000cd", default-features = false } |
There was a problem hiding this comment.
most of the Cargo.lock weirdness from dice-verifier -> sled-agent-client -> omciron-common (some previous rev) and that's where the later API dependency stuff we saw in Omicron comes up when building the tuf. sled-agent-client re-exports items out of propolis-client which means we end up in a situation where propolis-server depends on a different rev of propolis-client and everything's Weird.
i'm not totally sure what we want or need to do about this, particularly because we're definitely not using the propolis-client-related parts of sled-agent! we're just using one small part of the API for the RoT calls. but sled-agent and propolis are (i think?) updated in the same deployment unit so the cyclic dependency is fine.
this fixes issues (read: panics) related to AttestSledAgent's internal `rt`, block_on, and dropping.
actually stop the `AttestationSock` when we stop other Propolis devices/backends, and along the way `tcp_attest` -> `attest_handle`.
|
I want to add some comments in the attestation module but from a code-structure perspective @iximeow and I are happy with this. Ready for review! |
| api_runtime.block_on(async { vnc.halt().await }); | ||
| } | ||
|
|
||
| // TODO: clean up attestation server. |
There was a problem hiding this comment.
This can be removed now?
hawkw
left a comment
There was a problem hiding this comment.
Some of the Tokio stuff felt a bit awkward here --- I'd be happy to open a PR against this branch changing some of the things I mentioned, if that's easier for you?
| // TODO: early return if none? | ||
| if let Some(vsock) = &self.spec.vsock { |
There was a problem hiding this comment.
fwiw, i think the TODO is as easy as changing this to
| // TODO: early return if none? | |
| if let Some(vsock) = &self.spec.vsock { | |
| // TODO: early return if none? | |
| let Some(vsock) = &self.spec.vsock else { return; }; |
and then un-indenting everything else in the function basically.
| )]; | ||
|
|
||
| let guest_cid = GuestCid::try_from(vsock.spec.guest_cid) | ||
| .context("guest cid")?; |
There was a problem hiding this comment.
not super important but this string could be better probably
| let guest_cid = GuestCid::try_from(vsock.spec.guest_cid) | ||
| .context("guest cid")?; | ||
| // While the spec does not recommend how large the virtio descriptor | ||
| // table should be we sized this appropriately in testing so |
There was a problem hiding this comment.
turbo nit:
| // table should be we sized this appropriately in testing, so |
| // In a rack we only configure propolis-server with zero or | ||
| // one boot disks. It's possible to provide a fuller list, | ||
| // and in the future the product may actually expose such a | ||
| // capability. At that time, we'll need to have a reckoning | ||
| // for what "boot disk measurement" from the RoT actually | ||
| // means; it probably "should" be "the measurement of the | ||
| // disk that EDK2 decided to boot into", but that | ||
| // communication to and from the guest is a little more | ||
| // complicated than we want or need to build out today. | ||
| // | ||
| // Since as the system exists we either have no specific | ||
| // boot disk (and don't know where the guest is expected to | ||
| // end up), or one boot disk (and can determine which disk | ||
| // to collect a measurement of before even running guest | ||
| // firmware), we encode this expectation up front. If the | ||
| // product has changed such that this assert is reached, | ||
| // "that's exciting!" and "sorry for crashing your | ||
| // Propolis". | ||
| panic!( | ||
| "Unsupported VM RoT configuration: \ | ||
| more than one boot disk" | ||
| ); |
There was a problem hiding this comment.
what is the rationale for making this a panic rather than a MachineInitError? would that be easier to debug if this was hit someday later?
| Some(backend.clone_volume()) | ||
| } else { | ||
| // Disk must be read-only to be used for attestation. | ||
| slog::info!(self.log, "boot disk is not read-only"); |
There was a problem hiding this comment.
maybe this should explicitly state that this means it will not be attested?
| #[derive(Debug)] | ||
| enum AttestationInitState { | ||
| Preparing { | ||
| vm_conf_send: oneshot::Sender<VmInstanceConf>, | ||
| }, | ||
| /// A transient state while we're getting the initializer ready, having | ||
| /// taken `Preparing` and its `vm_conf_send`, but before we've got a | ||
| /// `JoinHandle` to track as running. | ||
| Initializing, | ||
| Running { | ||
| init_task: JoinHandle<()>, | ||
| }, | ||
| } | ||
|
|
||
| /// This struct manages providing the requisite data for a corresponding | ||
| /// `AttestationSock` to become fully functional. | ||
| pub struct AttestationSockInit { | ||
| log: slog::Logger, | ||
| vm_conf_send: oneshot::Sender<VmInstanceConf>, | ||
| uuid: uuid::Uuid, | ||
| volume_ref: Option<crucible::Volume>, | ||
| } | ||
|
|
||
| impl AttestationSockInit { | ||
| /// Do any any remaining work of collecting VM RoT measurements in support | ||
| /// of this VM's attestation server. | ||
| pub async fn run(self) { | ||
| let AttestationSockInit { log, vm_conf_send, uuid, volume_ref } = self; | ||
|
|
||
| let mut vm_conf = vm_attest::VmInstanceConf { uuid, boot_digest: None }; | ||
|
|
||
| if let Some(volume) = volume_ref { | ||
| // TODO(jph): make propolis issue, link to #1078 and add a log line | ||
| // TODO: load-bearing sleep: we have a Crucible volume, but we can | ||
| // be here and chomping at the bit to get a digest calculation | ||
| // started well before the volume has been activated; in | ||
| // `propolis-server` we need to wait for at least a subsequent | ||
| // instance start. Similar to the scrub task for Crucible disks, | ||
| // delay some number of seconds in the hopes that activation is done | ||
| // promptly. | ||
| // | ||
| // This should be replaced by awaiting for some kind of actual | ||
| // "activated" signal. | ||
| tokio::time::sleep(std::time::Duration::from_secs(10)).await; | ||
|
|
||
| let boot_digest = | ||
| match crate::attestation::boot_digest::boot_disk_digest( | ||
| volume, &log, | ||
| ) | ||
| .await | ||
| { | ||
| Ok(digest) => digest, | ||
| Err(e) => { | ||
| // a panic here is unfortunate, but helps us debug for | ||
| // now; if the digest calculation fails it may be some | ||
| // retryable issue that a guest OS would survive. but | ||
| // panicking here means we've stopped Propolis at the | ||
| // actual error, rather than noticing the | ||
| // `vm_conf_sender` having dropped elsewhere. | ||
| panic!("failed to compute boot disk digest: {e:?}"); | ||
| } | ||
| }; | ||
|
|
||
| vm_conf.boot_digest = Some(boot_digest); | ||
| } else { | ||
| slog::warn!(log, "not computing boot disk digest"); | ||
| } | ||
|
|
||
| let send_res = vm_conf_send.send(vm_conf); | ||
| if let Err(_) = send_res { | ||
| slog::error!( | ||
| log, | ||
| "attestation server is not listening for its config?" | ||
| ); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Soo, it feels a bit funny to me that this thing is a task we spawn that, when it completes, sends a message over a oneshot channel and then exits, and then we have a JoinHandle<()> for that task. It kinda feels like this could just be a JoinHandle<VmInstanceConf> and make a bunch of this at least a bit simpler?
I'd be happy to throw together a patch that does that refactoring if it's too annoying.
| let response = vm_attest::Response::Error( | ||
| "VmInstanceConf not ready".to_string(), | ||
| ); | ||
| //let mut response = | ||
| //serde_json::to_string(&response)?; | ||
| //response.push('\n'); | ||
| response |
There was a problem hiding this comment.
unimportant nit:
| let response = vm_attest::Response::Error( | |
| "VmInstanceConf not ready".to_string(), | |
| ); | |
| //let mut response = | |
| //serde_json::to_string(&response)?; | |
| //response.push('\n'); | |
| response | |
| vm_attest::Response::Error( | |
| "VmInstanceConf not ready".to_string(), | |
| ) |
| let vm_conf = Arc::new(Mutex::new(None)); | ||
|
|
||
| let log_ref = log.clone(); | ||
| let vm_conf_cloned = vm_conf.clone(); | ||
| tokio::spawn(async move { | ||
| match vm_conf_recv.await { | ||
| Ok(conf) => { | ||
| *vm_conf_cloned.lock().unwrap() = Some(conf); | ||
| } | ||
| Err(_e) => { | ||
| slog::warn!( | ||
| log_ref, | ||
| "lost boot digest sender, \ | ||
| hopefully Propolis is stopping" | ||
| ); | ||
| } | ||
| } | ||
| }); |
There was a problem hiding this comment.
actually, upon looking slightly closer at things, it kinda feels like all of this stuff with the vm_conf_recv oneshot and the task that waits for that and then stuffs it into a mutex could all be avoided if vm_conf was a tokio::sync::watch channel...AFAICT, once it gets stuck in there once, it's never mutated again, so does it really need to be a Mutex for all time?
| let mut buffer = | ||
| Buffer::new(this_block_count as usize, block_size as usize); | ||
|
|
||
| // TODO(jph): We don't want to panic in the case of a failed read. How |
There was a problem hiding this comment.
I still need to do this and test on dublin.
| // License, v. 2.0. If a copy of the MPL was not distributed with this | ||
| // file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
|
||
| //! TODO: block comment |
closes #1067
TODO: