Skip to content

[ntp] collect debugging information#10489

Draft
karencfv wants to merge 2 commits into
oxidecomputer:mainfrom
karencfv:ntp-debug-api
Draft

[ntp] collect debugging information#10489
karencfv wants to merge 2 commits into
oxidecomputer:mainfrom
karencfv:ntp-debug-api

Conversation

@karencfv
Copy link
Copy Markdown
Contributor

Closes: #10407

Comment thread ntp-admin/api/src/lib.rs
// which CRDB panics early during control plane startup because the clocks
// are not synchronized well-enough. We're adding this as part of a
// two-phase rollout to get around #9290 for now.
(3, ADD_DEBUG_ENDPOINT),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise the text above this says not to add new versions. If anyone has a better idea of how to surface all of this information I'm all ears

@karencfv
Copy link
Copy Markdown
Contributor Author

Chatted about this during an update watercooler.

The scope of this work will focus on connectivity diagnostics, since every NTP issue to date has essentially been a connectivity problem. The three checks NTP admin should run in an interval (written in Rust rather than shelling out):

Can I reach the DNS server?
Can I resolve the upstream NTP server's name via DNS?
Can I reach the upstream NTP server (e.g. ICMP ping)?
Is this a boundary NTP server? (We most likely want to do slightly different checks for the internal ones)

A couple of caveats: the DNS resolution check has to handle the case where the upstream is specified by IP rather than name (the SMF properties allow either), and the ping check should track "last successful" because some customer NTP servers don't respond to ping at all (so a "no" only means something if there was a previous "yes").

Implementation:

  • Get NTP admin collecting the data and logging it.
  • Expose it via an API endpoint (with a comment noting the client-side-versioning situation, similar to what Ben did recently).
  • Add an OMDB command to surface it.
  • Inventory integration is explicitly out of scope for now

To consider:

The chrony.conf file should probably just go into the support bundle rather than being logged on an interval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Collect NTP zone debugging data

1 participant