Clean up ereport data flow in bundle collection#10500
Conversation
Replaces `BundleCollection.bundle: SupportBundle` with a slim
`BundleInfo { id, reason_for_creation }`. Moves the sled-storage
chunked transfer (`store_bundle_on_sled`), zip helpers
(`bundle_to_zipfile`, `recursively_add_directory_to_zipfile`,
`sha2_hash`), the `CHUNK_SIZE` and `TEMPDIR` constants, and the
DB-polling cancellation (`check_for_cancellation`) out of the inner
`support_bundle/` module and into `support_bundle_collector.rs`.
After this change the inner layer is a pure mechanism: it never reads
the `support_bundle` DB row, never talks to a sled-agent's bundle
storage endpoints, and treats CRDB only as a source of facts about
sleds, ereports, and blueprints. The outer collector remains the
manager of the bundle lifecycle.
This is the first step toward a future shared crate that omdb can use
to collect bundles when Nexus is down.
Lifts `nexus/src/app/background/tasks/support_bundle/` (the mechanism layer) into a new top-level crate `support-bundle-collection` so that both Nexus and omdb can call it. No logic changes; pure relocation plus import rewriting.
Wires a new subcommand on omdb that calls into the `support-bundle-collection` crate to gather a bundle locally. Unlike the Nexus background task, this path does not register a row in the `support_bundle` table, does not transfer the bundle to a sled agent, and does not require Nexus to be up — it only needs CRDB, internal DNS, MGS, and sled-agents reachable on the underlay. This is intended for incident response: when Nexus is down (the most important time to gather a bundle), an operator can still produce one locally.
hawkw
left a comment
There was a problem hiding this comment.
this makes sense to me! i'm not sure if erebor is actually the formatter you want for displaying things in omdb though...
| "{}", | ||
| erebor::Displayer::new(details) | ||
| .with_initial_indent_spaces(8) | ||
| ); |
There was a problem hiding this comment.
n.b. that this will display all integers in hex, which may not be what you want here --- maybe use the other indented JSON display thingy we have lying around?
| warn!( | ||
| self.log, | ||
| "Failed to write report file"; | ||
| "error" => ?err |
There was a problem hiding this comment.
should this be an InlineErrorChain or something?
There was a problem hiding this comment.
also, does this error make it into the bg task status output someplace? should it?
| tokio::fs::create_dir_all(&meta_dir).await.with_context(|| { | ||
| format!("Failed to create meta directory {meta_dir}") | ||
| })?; | ||
|
|
||
| let report_path = meta_dir.join("report.json"); | ||
| let report_content = serde_json::to_string_pretty(report) | ||
| .context("Failed to serialize collection report")?; | ||
|
|
||
| tokio::fs::write(&report_path, report_content).await.with_context( | ||
| || format!("Failed to write report file to {report_path}"), | ||
| )?; |
There was a problem hiding this comment.
turbo nitpick: usually errors are lowercased:
| tokio::fs::create_dir_all(&meta_dir).await.with_context(|| { | |
| format!("Failed to create meta directory {meta_dir}") | |
| })?; | |
| let report_path = meta_dir.join("report.json"); | |
| let report_content = serde_json::to_string_pretty(report) | |
| .context("Failed to serialize collection report")?; | |
| tokio::fs::write(&report_path, report_content).await.with_context( | |
| || format!("Failed to write report file to {report_path}"), | |
| )?; | |
| tokio::fs::create_dir_all(&meta_dir).await.with_context(|| { | |
| format!("failed to create meta directory {meta_dir}") | |
| })?; | |
| let report_path = meta_dir.join("report.json"); | |
| let report_content = serde_json::to_string_pretty(report) | |
| .context("failed to serialize collection report")?; | |
| tokio::fs::write(&report_path, report_content).await.with_context( | |
| || format!("failed to write report file to {report_path}"), | |
| )?; |
| let step = collection | ||
| .steps | ||
| .iter() | ||
| .find(|s| s.name == SupportBundleCollectionStep::STEP_EREPORTS) | ||
| .expect("should have ereports step"); |
There was a problem hiding this comment.
this makes me wonder if steps should be an iddqd::IdOrdMap sorted by step names, so that we can look things up in it by key. this would also allow us to serialize it as an array while ensuring that it's always in lexicographic order, which seems maybe nice for making the omdb output consistent every time? dunno if this really matters though.
|
|
||
| Ok(CollectionStepOutput::Ereports(status)) | ||
| let details = serde_json::to_value(&status) | ||
| .context("failed to serialize ereport collection status")?; |
There was a problem hiding this comment.
this really shouldn't happen, right? not that i think we should panic here, but...
This PR addresses feedback from #10376 (comment)
It performs a few small-but-related changes that clean up the data flow for support bundle collection:
SupportBundleCollectionReportandSupportBundleActivationReport. The "collection report" is the sequence of steps that were involved in collecting the bundle itself. The activation report includes that, but also auxiliary information about "storing and activating the bundle" durably within Nexus. By distinguishing these two, we can (and now, do) store the "collection report" within the bundle itself. For context: Seewrite_report_file.CollectionStepOutputenum adds a variant calledDetailsthat allows any JSON value. We convertereportsto use this, and pretty-print that JSON. This allows parsers of bundle output - like omdb - to only act on the generic "bundle step" structures, rather than embedding knowledge about more step-specific output (like ereports were doing before).