On dogfood, we observed that two CRDB nodes restarted on the morning of 2026-05-20:
root@oxz_switch1:~# pilot host exec -c 'zlogin $(zoneadm list | grep cockroach) ps -ef -o comm,stime | grep "cockroach "' 9 10 14 16 17
9 BRM44220005 ok: /opt/oxide/cockroachdb/bin/cockroach 04:37:28
10 BRM42220009 ok: /opt/oxide/cockroachdb/bin/cockroach 04:41:33
14 BRM42220051 ok: /opt/oxide/cockroachdb/bin/cockroach May_16
16 BRM42220014 ok: /opt/oxide/cockroachdb/bin/cockroach May_16
17 BRM42220017 ok: /opt/oxide/cockroachdb/bin/cockroach May_16
The last few seconds of the logs prior to the restart are unfortunately not present in the archived cockroach.log files, but in the time leading up it we see warnings that the node may be overloaded:
I260520 04:37:14.565540 735 kv/kvserver/store_raft.go:522 ⋮ [n2,s2,r1105/1:‹/System/tsd/cr.node.{kv…-rp…}›,raft] 235892 raft ready handling: 3.69s [append=0.00s, apply=2.47s, commit-batch-sync=1.22s, other=0.00s], wrote 43 K
iB sync [append-ent=42 KiB (7), apply=42 KiB (7 in 2 batches)]; node might be overloaded
W260520 04:37:14.704457 77340944 kv/kvserver/liveness/liveness.go:883 <E2><8B><AE> [n2,s2,r118/6:<E2><80><B9>/Table/34{8-9}<E2><80><BA>] 235893 slow heartbeat took 6.000507049s; err=context canceled
E260520 04:37:14.704524 77340944 kv/kvserver/replica_range_lease.go:427 <E2><8B><AE> [n2,s2,r118/6:<E2><80><B9>/Table/34{8-9}<E2><80><BA>] 235894 failed to heartbeat own liveness record: context canceled
W260520 04:37:15.139560 899 kv/kvserver/liveness/liveness.go:883 <E2><8B><AE> [n2,liveness-hb] 235895 slow heartbeat took 4.500268177s; err=context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 failed node liveness heartbeat: <E2><80><B9>operation "node liveness heartbeat" timed out after 4.5s (given timeout 4.5s)
<E2><80><BA>: context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +(1) <E2><80><B9>operation "node liveness heartbeat" timed out after 4.5s (given timeout 4.5s)<E2><80><BA>
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +Wraps: (2) context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededError
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +An inability to maintain liveness will prevent a node from participating in a
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +cluster. If this problem persists, it may be a sign of resource starvation or
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +of network connectivity problems. For help troubleshooting, visit:
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 + https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues
The svcadm logs indicate the new process started at 04:37:26, 11 seconds after the last message logged.
BRM44220005 # head -3 /pool/ext/4eb2e4eb-41d8-496c-9a5a-687d7e004aa4/crypt/debug/oxz_cockroachdb_4c3ef132-ec83-4b1b-9574-7c7d3035f9e9/oxide-cockroachdb:default.log.1779251862
[ May 20 04:37:26 Stopping because all processes in service exited. ]
[ May 20 04:37:26 Executing stop method (:kill). ]
[ May 20 04:37:26 Executing start method ("/opt/oxide/lib/svc/manifest/cockroachdb.sh"). ]
There is no indication that unusually heavy queries were being executed by CRDB, but both of these nodes are hosted a U.2 drive that was initializing a local disk at the time of the crash.
On sled 9 CRDB is held on oxp_b358fb1e-f52a-4a63-9aab-170225509b37.
During this interval, sled-agent is logging that is it initializing a new local storage volume:
2026-05-20 04:37:01.176Z INFO SledAgent/658 (dropshot (SledAgent)) on BRM44220005: request completed
error_message_external = Service Unavailable
error_message_internal = Failed to ensure volume 'oxp_b358fb1e-f52a-4a63-9aab-170225509b37/local_storage_unencrypted/3ae6c3f2-f450-4924-9d3b-b9c96bab514a/vol': created but not ready yet: still zero initializing
file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.17.0/src/server.rs:862
latency_us = 6098673
local_addr = [fd00:1122:3344:105::1]:12345
method = POST
remote_addr = [fd00:1122:3344:101::89]:57137
req_id = 828a2954-e4dc-4eae-9f11-fbbcd14a24ff
response_code = 503
uri = /local-storage
The period these messages are logged matches the "overloaded" period in CRDB closely:
will@castle /staff/core/dogfood-crdb-crash-2026-05-20 $ grep 'T04:3' oxide-sled-agent:default.log.1779333004 | egrep 'oxp_b358fb1e-f52a-4a63-9aab-170225509b37.*still zero initializing' | jq -r .time
2026-05-20T04:35:43.632979477Z
2026-05-20T04:35:44.532078728Z
2026-05-20T04:35:45.901170989Z
2026-05-20T04:35:46.187545552Z
2026-05-20T04:35:47.160109624Z
2026-05-20T04:35:47.62813723Z
2026-05-20T04:35:48.431222544Z
2026-05-20T04:35:49.98001739Z
2026-05-20T04:35:50.786933625Z
2026-05-20T04:35:53.318926925Z
2026-05-20T04:35:55.137188354Z
2026-05-20T04:35:57.664169831Z
2026-05-20T04:36:05.842494794Z
2026-05-20T04:36:08.355253085Z
2026-05-20T04:36:26.300306902Z
2026-05-20T04:36:28.505755413Z
2026-05-20T04:37:01.17697614Z
2026-05-20T04:37:03.275917293Z
2026-05-20T04:38:00.457628507Z
2026-05-20T04:38:03.686246866Z
A similar pattern is seen on sled 10, hosted on oxp_0e485ad3-04e6-404b-b619-87d4fea9f5ae, we see a similar pattern.
New process starts at 04:41:32.
BRM42220009 # head -3 /pool/ext/0e485ad3-04e6-404b-b619-87d4fea9f5ae/crypt/debug/oxz_cockroachdb_a3628a56-6f85-43b5-be50-71d8f0e04877/oxide-cockroachdb:default.log.1779252106
[ May 20 04:41:32 Stopping because all processes in service exited. ]
[ May 20 04:41:32 Executing stop method (:kill). ]
[ May 20 04:41:32 Executing start method ("/opt/oxide/lib/svc/manifest/cockroachdb.sh"). ]
In the period leading up to that time a local disk is being initialized on it:
BRM42220009 # grep 'T04:[3-4]' /pool/ext/b252b176-3974-436a-915b-60382b21eb76/crypt/debug/global/oxide-sled-agent:default.log.1779333005 | egrep 'oxp_0e485ad3-04e6-404b-b619-87d4fea9f5ae.*still zero initializing' | jq -r .time
2026-05-20T04:37:57.815494599Z
2026-05-20T04:37:58.773125217Z
2026-05-20T04:37:59.961247628Z
2026-05-20T04:38:01.468213757Z
2026-05-20T04:38:04.542124621Z
2026-05-20T04:38:08.710330616Z
2026-05-20T04:39:59.623875444Z
2026-05-20T04:40:00.726780259Z
2026-05-20T04:40:02.042064876Z
2026-05-20T04:40:03.847478895Z
2026-05-20T04:40:06.998500211Z
2026-05-20T04:40:14.603304114Z
2026-05-20T04:40:28.356296048Z
2026-05-20T04:40:46.729091263Z
2026-05-20T04:41:18.850718116Z
With some horrifying shell, we can see that the "node may be overloaded" messages correlate with local disk creation on that U.2:
BRM42220009 # cat /pool/ext/b252b176-3974-436a-915b-60382b21eb76/crypt/debug/global/oxide-sled-agent:default.log.1779333005 | egrep 'oxp_0e485ad3-04e6-404b-b619-87d4fea9f5ae.*still zero initializing' | jq -r .time | cut -d ':' -f -2 | uniq -c
6 2026-05-20T03:12
3 2026-05-20T03:14
5 2026-05-20T03:15
6 2026-05-20T03:51
7 2026-05-20T03:52
2 2026-05-20T03:53
3 2026-05-20T04:37
3 2026-05-20T04:38
1 2026-05-20T04:39
7 2026-05-20T04:40
1 2026-05-20T04:41
BRM42220009 # grep 'node might be overloaded' /pool/ext/0e485ad3-04e6-404b-b619-87d4fea9f5ae/crypt/zone/oxz_cockroachdb_a3628a56-6f85-43b5-be50-71d8f0e04877/root/data/logs/cockroach.oxzcockroachdba3628a56-6f85-43b5-be50-71d8f0e04877.root.2026-05-19T11_26_10Z.021989.log | awk '{print $2}' | cut -d ':' -f -2 | uniq -c
186 00:38
58 00:39
228 00:40
206 00:41
457 03:15
35 03:51
12 03:52
128 03:53
202 04:40
On dogfood, we observed that two CRDB nodes restarted on the morning of 2026-05-20:
The last few seconds of the logs prior to the restart are unfortunately not present in the archived
cockroach.logfiles, but in the time leading up it we see warnings that the node may be overloaded:The
svcadmlogs indicate the new process started at 04:37:26, 11 seconds after the last message logged.There is no indication that unusually heavy queries were being executed by CRDB, but both of these nodes are hosted a U.2 drive that was initializing a local disk at the time of the crash.
On sled 9 CRDB is held on
oxp_b358fb1e-f52a-4a63-9aab-170225509b37.During this interval,
sled-agentis logging that is it initializing a new local storage volume:The period these messages are logged matches the "overloaded" period in CRDB closely:
A similar pattern is seen on sled 10, hosted on
oxp_0e485ad3-04e6-404b-b619-87d4fea9f5ae, we see a similar pattern.New process starts at 04:41:32.
In the period leading up to that time a local disk is being initialized on it:
With some horrifying shell, we can see that the "node may be overloaded" messages correlate with local disk creation on that U.2: