Skip to content

[fm] Add simple disk diagnoser based on zpool health#10460

Open
smklein wants to merge 20 commits into
mainfrom
fm-disk-diagnoser
Open

[fm] Add simple disk diagnoser based on zpool health#10460
smklein wants to merge 20 commits into
mainfrom
fm-disk-diagnoser

Conversation

@smklein
Copy link
Copy Markdown
Collaborator

@smklein smklein commented May 19, 2026

The first fault management diagnosis engine: opens a case for any
non-Online zpool whose backing physical disk is currently in service
in the control plane, and closes it on recovery or expungement.

Supporting infrastructure introduced along the way:

  • DiagnosisEngineKind::Disk variant (Rust + DB enum)
  • fm_case_fact child table for per-engine state (one case has 0..N
    immutable facts; stable UUIDs across sitreps; participates in
    copy-forward + GC like other sitrep child tables)
  • CaseBuilder::{add_fact, remove_fact, facts} API
  • InServiceDisk nexus-types projection consumed by FM, populated from
    the existing zpool_list_all_external_batched datastore method with
    policy filtering done in the background task

pub(super) fn analyze(
input: &Input,
builder: &mut SitrepBuilder<'_>,
) -> anyhow::Result<()> {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the whole point of this PR is "to be able to build something here, and re-use it", but ironically the contents of this particular DE is particularly prone to change.

The "short version" of what we're doing:

  • Look at inventory, DB state, old sitreps
  • Make sure a case exists for each unhealthy zpool, with a corresponding "DiskFact"
  • Close old cases if their zpools is now healthy (or expunged)

We're doing this with a jumble of indices, iterations, etc. I think those will change. I think this DE will grow to track other state about these disks. I think each of these cases will potentially grow to have different facts.

Comment thread nexus/db-queries/src/db/datastore/fm.rs Outdated

/// Fetch all `fm_case_fact` rows belonging to cases in the given sitrep,
/// grouped by `case_id`.
async fn fm_case_facts_read_on_conn(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading facts alongside cases, there isn't really a need to mark "DE" on the fact table, so I removed it. It's redundant data anyway.

(Figured I'd mention this because it diverges slightly from the DB structure we talked about - but still sorts facts into case-specific buckets, so we can still "parse by the case DE type").

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I still think it's probably worth including in the DB record as a structured field, even if only for debugging reasons for now.

Also, at some point, I think we are going to probably have to figure out a way to allow multiple DEs to add facts to a case, although we don't have to cross that bridge yet. Consider the example of an ereport.data_loss.possible ereport indicating that a service processor has restarted and will need to be health-checked, as described in RFD 589. Suppose we have a trivial DE for handling data loss reports from SPs by doing a complete health check of that SP. This might open a case, and then request additional health checking of that DE, which might record some facts. Suppose one of those facts includes data that another DE would use to diagnose a fault. We should figure out how that flow will work, although we don't have to in this PR...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't each of those DEs just make a duplicate copy of that fact in their own cases? that seems like it helps keep fact lifecycle scoped "per-case" which is what we want.

I really hesitate to include this data "just to have it" because then it means we need to handle the case where "fact.de != fact.case.de", which is an impossible data corruption case we could just avoid by omitting the column

Copy link
Copy Markdown
Collaborator Author

@smklein smklein May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically with your data-loss case: my main argument is that "facts are associated with cases", regardless of how they're generated.

So: in the case where we have "DE 1 which does something, but wants to write down a fact for a case managed by DE 2" - I think we can make this happen in-memory during sitrep construction, but on-disk, this could look like:

  • DE1 has a case C1, queries for data
  • (next sitrep) DE1 sees new data for C1, decides to open a case C2 for analysis by a different DE (DE2). It can also pass along a fact for C2
  • On-disk: That fact is associated with C2. We could have a "comment" about how it was originally noticed by DE1/C1? But that origination doesn't really matter

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what I am getting at here is the question of, how do we expect the "handing off" of data from one DE to analysis by another DE to work if we are expecting that the two DEs will never try to read facts from cases that they don't own.

I think the idea of having DE1 open a new case for DE2, with facts whose schemas come from DE2's fact schemas (as you described in #10460 (comment)), seems like a reasonable approach. That was precisely the kind of thing I was hoping to work out a design for, and I feel like this is a reasonable one.

@smklein smklein force-pushed the fm-disk-diagnoser branch 2 times, most recently from a3cddcc to 26f2ade Compare May 19, 2026 01:28
Copy link
Copy Markdown
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really exciting to see this coming together!

I think it makes sense to use JSON for payloads in the DB due to the explosion in types as discussed in chat. I wonder about the versioning strategy though. The DEs in Nexus are the only things that need to interpret payloads, but they are essentially client side versioned. During an update, Nexus will not understand new payloads. Do we plan to use a two-phase update model where reporters can't issue newly added reports until a second update, or will DE's just ignore payloads they can't understand?

@smklein
Copy link
Copy Markdown
Collaborator Author

smklein commented May 19, 2026

During an update, Nexus will not understand new payloads. Do we plan to use a two-phase update model where reporters can't issue newly added reports until a second update, or will DE's just ignore payloads they can't understand?

Nexus is performing an atomic handoff from "old" to "new" before the database can be accessed, right? I don't think we need to worry about a mixed-version Nexus scenario - I believe we'll have "old Nexus, working with old data", then we'll perform handoff, and only worry about "new Nexus working with old + new data, which it can migrate"

Regardless, there are a bunch of strategies we could use for doing "fact payload" schema migration:

  • We could use the existing DB migration tools, to perform "data-only" migrations (look at all fm_case_fact rows, where diagnosis_engine = x, and where payload->variant = y, and re-write the payload).
  • We could rely on the re-generation of sitreps to have a phase where we load "old facts, and update them to new fact format". e.g. CaseFact::VariantFoo1 could be read, and in-memory updated to CaseFact::VariantFoo2, which gets written out in the next sitrep.

@smklein smklein force-pushed the fm-disk-diagnoser branch from 26f2ade to 67b661f Compare May 19, 2026 16:28
The first fault management diagnosis engine: opens a case for any
non-Online zpool whose backing physical disk is currently in service
in the control plane, and closes it on recovery or expungement.

Supporting infrastructure introduced along the way:

- DiagnosisEngineKind::Disk variant (Rust + DB enum)
- fm_case_fact child table for per-engine state (one case has 0..N
  immutable facts; stable UUIDs across sitreps; participates in
  copy-forward + GC like other sitrep child tables)
- CaseBuilder::{add_fact, remove_fact, facts} API
- InServiceDisk nexus-types projection consumed by FM, populated from
  the existing zpool_list_all_external_batched datastore method with
  policy filtering done in the background task

Schema migration: add-disk-de-and-facts (version 260) adds the 'disk'
enum value and creates fm_case_fact.
@smklein smklein force-pushed the fm-disk-diagnoser branch from 67b661f to 793b1ec Compare May 19, 2026 17:12
@hawkw hawkw self-requested a review May 19, 2026 17:31
@andrewjstone
Copy link
Copy Markdown
Contributor

andrewjstone commented May 19, 2026

Nexus is performing an atomic handoff from "old" to "new" before the database can be accessed, right? I don't think we need to worry about a mixed-version Nexus scenario - I believe we'll have "old Nexus, working with old data", then we'll perform handoff, and only worry about "new Nexus working with old + new data, which it can migrate"

Ah, I must be misunderstanding how payloads get populated. I was presuming that it's possible for the ingester of the payload to write to the database without actually knowing the format of the payload. But if we limit ingestion of new payloads until Nexus is updated, than I agree there is no problem.

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an incomplete review focusing on the database models and domain types; I haven't actually gotten as far as the actual diagnosis engine yet. I figured it would be more useful to leave a smaller review sooner rather than waiting to get to the "other half" of this PR.

Comment thread nexus/db-queries/src/db/datastore/fm.rs Outdated

/// Fetch all `fm_case_fact` rows belonging to cases in the given sitrep,
/// grouped by `case_id`.
async fn fm_case_facts_read_on_conn(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I still think it's probably worth including in the DB record as a structured field, even if only for debugging reasons for now.

Also, at some point, I think we are going to probably have to figure out a way to allow multiple DEs to add facts to a case, although we don't have to cross that bridge yet. Consider the example of an ereport.data_loss.possible ereport indicating that a service processor has restarted and will need to be health-checked, as described in RFD 589. Suppose we have a trivial DE for handling data loss reports from SPs by doing a complete health check of that SP. This might open a case, and then request additional health checking of that DE, which might record some facts. Suppose one of those facts includes data that another DE would use to diagnose a fault. We should figure out how that flow will work, although we don't have to in this PR...

Comment thread nexus/fm/src/builder/case.rs Outdated
Comment thread nexus/fm/src/builder/case.rs Outdated
Comment thread nexus/fm/src/builder/case.rs
Comment thread nexus/fm/src/builder/case.rs Outdated
Comment thread nexus/types/src/fm/case.rs Outdated
Comment thread nexus/types/src/fm/case.rs Outdated
Comment thread schema/crdb/fm-disk-de-and-facts/up2.sql
let mut support_bundles_requested = Vec::new();
let mut bundle_data_selections_requested = Vec::new();
let mut case_ereports = Vec::new();
let mut case_facts = Vec::new();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to be able to with_capacity this to be as long as the case's facts map...but i also notice we are not doing this for any of the other ones so it's kinda fine i guess...

Comment thread nexus/fm/src/analysis_input.rs
smklein added a commit that referenced this pull request May 20, 2026
Split out from #10460 per review feedback.

Renames the `Input::cases()` accessor to `Input::open_cases()`. The
struct already tracked open and closed-copied-forward cases separately
in private fields; this just makes the public accessor name reflect
that, and adds a short doc comment pointing at the (crate-private)
`closed_cases_copied_forward()` accessor for the other half.
name: &str,
pattern: &str,
) -> &mut Self {
pub fn variable_regex(&mut self, name: &str, pattern: &str) -> &mut Self {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, i guess rustfmt just decided it wanted to change this for...some reason?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¯_(ツ)_/¯

Comment thread nexus/fm/src/builder/case.rs
Comment thread nexus/types/src/fm/case.rs Outdated
Comment thread nexus/db-model/src/fm/case.rs Outdated
Comment thread nexus/db-model/src/fm/case.rs Outdated
Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm starting to try and wrap my head around what the diagnosis engine is actually doing. overall, i think this seems good so far, but i want to spend some more time thinking through how it currently works, especially the way it intersects with some of our design decisions around how facts work...

Comment thread nexus/fm/src/diagnosis/physical_disk.rs
Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
/// recorded zpool. There can be multiple facts in pathological cases
/// (e.g., two zpool ids on the same case after a hand-edit); the
/// diagnoser keeps all of them in its accounting.
zpool_unhealthy: BTreeMap<ZpoolUuid, Vec<(FactUuid, ZpoolHealth)>>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally i kind of feel like i might bite the bullet and make this iddqd-y, but i suppose that requires making a lot more structs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, nitpickily, i might consider calling it "unhealthy_zpools" or something, since it is an index of zpools by UUID...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use iddqd. Not sure if I like this more or less, honestly?

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
) -> anyhow::Result<()> {
// The disk DE's primary key today is `zpool_id`, so we build a local
// index keyed by zpool. Future variants of `DiskFact` are welcome to
// derive their own secondary indices (e.g., by `sled_id` for FMD).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sentence "the disk DE's primary key" feels a bit weird to me, i feel like when i see "primary key" i take that to mean we are talking about a database table..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the wording here; I really do care about "key" but PK is definitely DB terminology we don't need.

Comment thread nexus/fm/src/diagnosis/physical_disk.rs Outdated
Comment on lines +46 to +65
// Index every zpool we observed in this inventory by ID, so we can
// distinguish "saw it, it's Online" from "didn't see it at all" below.
let observed: BTreeMap<ZpoolUuid, ZpoolHealth> = input
.inventory()
.sled_agents
.iter()
.flat_map(|sa| sa.zpools.iter())
.map(|z| (z.id, z.health))
.collect();

// Currently-faulty, control-plane-managed zpools.
//
// Out-of-service zpools are intentionally ignored: a non-`Online` zpool
// whose disk has been expunged is no longer the control plane's concern.
let faulty: BTreeMap<ZpoolUuid, ZpoolHealth> = observed
.iter()
.filter(|(id, _)| in_service_by_zpool.contains_key(*id))
.filter(|(_, h)| **h != ZpoolHealth::Online)
.map(|(id, h)| (*id, *h))
.collect();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, okay, this code all makes me feel like ZpoolHealth ought to be a struct with a ZpoolUuid in it, and use iddqd::IdOrdMap for these. Or, perhaps IdMap; I don't think we care about ordering here as we are not printing these out or serializing them.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6ae7bc6

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
Comment on lines +67 to +71
// Inspect parent-forwarded Disk cases from the input (i.e., the state
// copied from the parent sitrep — *not* the in-progress builder, which
// we will mutate below). Each case's facts are JSON blobs owned by this
// engine; deserialize each one as DiskFact. Skip (with a warning) any
// fact we can't read.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it's necessary to restate the explanation of what facts are in this comment, though maybe that's somewhat valuable if this DE is intended as a sort of prototypical example DE. This feels a bit like "claude decided to restate his prompt again" to me though, which always rubs me wrong...but maybe that's just because I'm an old man yelling at claudes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping it.

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
Comment on lines +109 to +110
// points at zpools that are now Online or expunged. Closed cases are not
// copied forward, so their facts naturally drop with them.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment about "closed cases" feels unnecessary to me, it's restating the semantics of sitreps which are documented at a higher level. not a big deal but yet again feels claudey in a way that makes me irrationally irritated 🤷‍♀️

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

j'adore!

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
Comment on lines +128 to +131
.close(
"all ZpoolUnhealthy facts have resolved (zpool back to \
Online, or disk no longer in service)",
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to do this in a way where the comment closing the case could actually state the cause (either "zpool back to online" or "disk no longer in service"?)

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
}
let any_still_unhealthy =
summary.zpool_unhealthy.keys().any(|zpool_id| {
in_service_by_zpool.contains_key(zpool_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just making sure I understand this correctly: if a zpool ID is not present in in_service_by_zpool, that is an EXPLICIT, POSITIVE INDICATION that the disk or the sled it is in has been expunged, correct? And if the disk has not been expunged, it will be present in in_service_by_zpool no matter what bad thing happens to it?

I might want a comment here explaining that, (maybe even instead of the above comment that "absence is not a recovery signal"). to the unfamiliar reader, this looks like it's checking for "absence from inventory" rather than "explicit signal of expungement", but my sense is that this is not actually what it's doing?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the input code, i am increasingly uncomfy about this and am starting to feel like i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent? but that might be because i don't know the semantics of the physical_disk/zpool tables well enough to know if what we're doing now is safe or not...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am relying on the contents of in_service_by_zpool - really, the contents of in_service_disks, we read during the preparation phase - to be the full set of in-service disks.

I explained in the comment above that we ignore "whether or not the thing is in inventory" as a signal about whether to update the disk - inventory is lossy, could have transient issues, etc.

Disk lifecycle goes through the following phases:

  1. Blueprint adds disk in-service
  2. Execution creates a PhysicalDisk row (with policy marked as "in-service", and state as "active")
  3. (Disk lifetime goes here)
  4. Blueprint later marks a disk as expunged
  5. Execution marks that PhysicalDisk row (policy is now "expunged")
  6. Expungement happens, and the "disk state" eventually is updated to "decommissioned"
  7. The PhysicalDisk row could presumably be deleted here. It isn't today, but it probably will be in the future. However, the decommissioned_disk_cleaner is already deleting the Zpool rows! So it's basically just a matter of "CRDB rows are effectively deleted here, more cleanup will happen".

The current PR already finds pools/disks of interest by joining on zpool, and filters by "disks that are in-service". This means:

If a physical disk is in step (2) - (3), we'll treat it as "alive and observable". Otherwise: It either is being expunged, or has been expunged.

i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent?

Yeah this would be sorta problematic, because now we can't delete zpools / physical disk rows until we have confirmed that all their associated cases have "finished up"! Otherwise, if we do expungement before the case gets to close, it'll be stuck open forever, waiting to see a "positive signal of decommission" that will never arrive.

We hit a bunch of these backward dependencies when we tried going through expungement, and it's a real pain-in-the ass. I have a preference for - when we can - stating: "this is the set of all in-service stuff", which is what we're currently doing here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, thanks for the explanation. i think this makes sense then!

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
.expect("unreadable case should be copied forward");
assert!(
unreadable.is_open(),
"unreadable case must not be closed by the diagnoser",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"unreadable case must not be closed by the diagnoser",
"unreadable case must not be closed by the diagnosis engine",

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 6ae7bc6, here and elsewhere

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still trying to grok the de stuff

Comment thread nexus/fm/src/analysis_input.rs
Comment on lines +242 to +244
if disk.disk_policy != PhysicalDiskPolicy::InService {
continue;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. so i am a bit worried here that we are not able to differentiate between disks which have been decommissioned and disks which have just been deleted from the table for "some reason" in the DE code that checks if an unhealthy disk is still in service? but maybe that's fine and we rely on the rest of the system to not mess with this in a way that will make us sad.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to reconcile this. The database state is our representation of "what disks we consider to be in-service or not". Even the blueprint is just intent; the database rows actually enact that policy.

Coping with "a disk that gets deleted from CRDB for some reason" is akin to coping with arbitrary database corruption IMO. I am not sure I can reasonably accept inputs from CRDB for this DE if we are trying to model things in a byzantine way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, your answer to my subsequent comment made me feel a bit less sketchy about this. i wasn't super familiar with the lifecycle of the disk tables, so walking through it was helpful. i think this is fine, thank you for clarifying stuff!

Comment on lines +235 to +239
let zpools_and_disks = self
.datastore
.zpool_list_all_external_batched(opctx)
.await
.context("failed to load in-service control plane disks")?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zpool_list_all_external_batched has a comment on it saying, essentially, that this can take a while. i kinda wonder if, since this and the ereport loading code both could load a lot of data and are basically completely isolated from each other, might we want to spawn separate tokio tasks to do the collection of those different inputs in parallel?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to be clear, this is not a blocker for this PR, just a thought)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably a good idea. I suspect basically all the "preparation" logic for reading from DB, potentially reading from clickhouse, etc, etc, etc, before we start analysis can be parallelized.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to be clear, not doing it in this PR

Comment on lines +43 to +44
/// All control plane managed disks
in_service_disks: Arc<IdOrdMap<InServiceDisk>>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the list of in service disks never makes it into the AnalysisInputReport. do we want it to, or is that too much data to spit out in every Status object from every activation of the analysis task? might we want to at least summarize it with say, the number of in-service disks? that might help to spot some obviously weird things such as an analysis pass that loaded 0 in service disks...?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 6ae7bc6 (printing some basic UUID info)

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
}
let any_still_unhealthy =
summary.zpool_unhealthy.keys().any(|zpool_id| {
in_service_by_zpool.contains_key(zpool_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the input code, i am increasingly uncomfy about this and am starting to feel like i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent? but that might be because i don't know the semantics of the physical_disk/zpool tables well enough to know if what we're doing now is safe or not...

@smklein
Copy link
Copy Markdown
Collaborator Author

smklein commented May 20, 2026

Couple thoughts on the DE in particular:

  • Since "cases" have no concept of identity aside from their facts - basically, "which disk are you working on" - theoretically, a single case could be full of facts about different disks. This would be... bad? Probably?
  • So there is kinda a concept of like, "what is the identity of the resource we are building a case about" that is kinda implied by facts today. Perhaps that's okay? It's flexible? but it also allows the data to model impossible situations.

This may be justification for a "sitrep version of blippy/clippy". Slippy. Which validates "this case has facts which are relevant, parseable, and coherent".

@AlejandroME AlejandroME added the fault-management Everything related to the fault-management initiative (RFD480 and others) label May 21, 2026
@AlejandroME AlejandroME added this to the 21 milestone May 27, 2026
Comment thread nexus/src/lib.rs Outdated
Comment on lines +531 to +534
Arc<(
nexus_types::fm::SitrepVersion,
nexus_types::fm::Sitrep,
)>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we have a type alias for this someplace? can we use that here so that the type is a little bit less horrid?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do; I had to reshuffle things a bit to use it, but I'm using that alias in 2accb87

Comment on lines +41 to +45
/// Inventory collection that produced this observation. Recorded for
/// provenance: if multiple `ZpoolUnhealthy` facts ever end up on the
/// same case, this lets a human reader see which inventory each came
/// from.
pub observed_in_inv: CollectionUuid,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this ought to have an explicit note stating that we don't expect the DE to ever actually look back at past inventory collections, because they may have been GC'd?

on the other hand i suppose we might want to use this to check if a fact was created with a previous iteration of the DE that had the same input inventory as the current one?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this ought to have an explicit note stating that we don't expect the DE to ever actually look back at past inventory collections, because they may have been GC'd?

Sure, I can add this

on the other hand i suppose we might want to use this to check if a fact was created with a previous iteration of the DE that had the same input inventory as the current one?

I don't understand when we'd want to do this sorta check - I was adding this really for debugging/visibility only

Comment on lines +46 to +47
/// `time_done` of `observed_in_inv`.
pub time_observed: DateTime<Utc>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the DE expect to use this field for e.g. ordering facts? or is it just for debugging?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently expecting the only ordering we'll care about is "last recorded fact" vs "what is currently recorded in inventory" - with the implication that inventory is more recent.

So: it's really for debugging?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, maybe worth a comment?

Comment thread nexus/fm/src/diagnosis/physical_disk.rs Outdated
Comment on lines +67 to +69
/// One in-service disk paired with its current observed health.
/// `health` is `None` when the disk's zpool was not seen in the current
/// inventory (e.g., sled down, lossy collection).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming nit (sorry): this is really a zpool health snapshot paired with an in-service disk. it's not necessarily a snapshot of information about the health of the physical disk, which we are not currently collecting; it's indicative of the health of the zpool on the disk.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh that might just mean changing the name of the health field to zpool_health, since i imagine the index we are building with this type might eventually have other observations about the disk's health from other sources also stored on this type...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, happy to rename it to "zpool_health", I think that's more specific

Comment on lines +149 to +157
slog::warn!(
&builder.log,
"skipping Disk case: facts reference \
different physical disks (1 expected)";
"case_id" => %case_id,
"expected_physical_disk_id" => %topic,
"fact_physical_disk_id" =>
%payload.physical_disk_id,
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, these warnings feel like the type of thing that it would also be nice to include in the AnalysisReport we construct for this sitrep, so that they can be attached to the sitrep in the DB...eventually i would like to extend the sitrep analysis report log thingy so that it will also slog log things for you, so you can get both in one place.

don't worry about fixing this for now, it's just something i would like to figure out a way to improve eventually.

Comment thread nexus/fm/src/diagnosis/physical_disk.rs Outdated
did not deserialize as DiskFact";
"case_id" => %case_id,
"fact_id" => %fact.id,
"error" => InlineErrorChain::new(&*e).to_string(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought the point of InlineErrorChain was that it implemented slog's serialize trait and didn't have to be .to_string()ed?

Suggested change
"error" => InlineErrorChain::new(&*e).to_string(),
"error" => InlineErrorChain::new(&*e),

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call, dropping this

smklein added 4 commits May 28, 2026 14:28
Addresses review comment: replace the spelled-out
watch::Receiver<Option<Arc<(SitrepVersion, Sitrep)>>> in the NexusServer
trait and its impl with the CurrentSitrep type alias, now hoisted into
nexus_types::fm so both the test-interface crate and omicron-nexus can
share it.
Addresses review comment: note explicitly that the disk diagnosis engine
never re-fetches the recorded inventory collection (it may have been GC'd);
the field exists purely for human/debug provenance.
Addresses review comment: the field reflects the health of the zpool on the
disk, not the physical disk itself. Renaming makes room for other per-disk
health observations on this type later, while keeping the struct name.
Addresses review comment: InlineErrorChain implements slog::Value, so it can
be logged without allocating an intermediate String.
hawkw added a commit that referenced this pull request May 29, 2026
`nexus_types::fm` has a little utility for formatting
`serde_json::Value`s as bulleted lists. Currently, this is only used for
formatting sitrep analysis reports. #10460 factored it out of the
`analysis_reports` module to use it for formatting case facts. While
working on #10505, I realized it would also be useful to have this for
use in `omdb` commands for reconstructing analysis reports from CRDB.
Since #10460 isn't going to merge until after R20, I figured it would be
helpful to just pull out the change to factor this out into its own
commit that both our branches could depend on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants