Skip to content

Commit 9981b5b

Browse files
Matthew Valancyclaude
andcommitted
feat: Integrate Tailscale health monitoring into setup
Adds automatic Tailscale health monitoring to GraphDone Core development and production environments. Changes: - scripts/setup-tailscale-monitor.sh - Standalone installer (96 lines) - Creates health check script with 5-second timeout detection - Installs systemd service and timer for 5-minute checks - Auto-restarts tailscaled when frozen/unresponsive - Logs all events to /var/log/tailscale-health.log - tools/setup.sh - Auto-detection and installation - Detects if Tailscale is installed during environment setup - Automatically runs monitoring setup if Tailscale present - Gracefully skips if Tailscale not installed - Provides manual setup instructions if sudo unavailable - docs/infrastructure/tailscale-monitoring.md - Complete documentation - Installation (automatic and manual) - Verification and troubleshooting - Configuration options - Integration with GraphDone architecture This prevents the "zombie Tailscale" issue where daemon runs but hangs, blocking connectivity. Critical for demo instances, multi-server deployments, and developer experience. Related: GraphDone-Devops/monitoring/tailscale Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 0814c7d commit 9981b5b

3 files changed

Lines changed: 348 additions & 0 deletions

File tree

Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
# Tailscale Health Monitoring
2+
3+
## Overview
4+
5+
GraphDone Core includes automatic Tailscale health monitoring to ensure reliable mesh networking across all instances.
6+
7+
## Problem
8+
9+
Tailscale can enter a "zombie" state where:
10+
- The daemon process runs but doesn't respond
11+
- Commands like `tailscale status` hang indefinitely
12+
- DNS resolution fails (hostnames become unreachable)
13+
- Network connectivity is lost across the mesh
14+
15+
This requires manual intervention to restart the service.
16+
17+
## Solution
18+
19+
Automated health monitoring that:
20+
- ✅ Checks Tailscale every 5 minutes
21+
- ✅ Detects hangs using timeout (not just process checks)
22+
- ✅ Auto-restarts when unresponsive
23+
- ✅ Logs all events for debugging
24+
- ✅ Requires zero manual intervention
25+
26+
## Installation
27+
28+
### Automatic (Recommended)
29+
30+
The monitoring is automatically set up when you run:
31+
32+
```bash
33+
./tools/setup.sh
34+
```
35+
36+
If Tailscale is installed, the script will detect it and offer to set up monitoring.
37+
38+
### Manual Installation
39+
40+
If you need to install it manually:
41+
42+
```bash
43+
sudo bash ./scripts/setup-tailscale-monitor.sh
44+
```
45+
46+
## Verification
47+
48+
Check if monitoring is running:
49+
50+
```bash
51+
sudo systemctl status tailscale-health-monitor.timer
52+
```
53+
54+
View monitoring logs:
55+
56+
```bash
57+
sudo cat /var/log/tailscale-health.log
58+
```
59+
60+
Check next scheduled run:
61+
62+
```bash
63+
systemctl list-timers tailscale-health-monitor.timer
64+
```
65+
66+
## How It Works
67+
68+
### Health Check Logic
69+
70+
Every 5 minutes, the monitor:
71+
72+
1. Runs `timeout 5s tailscale status`
73+
2. If successful → Logs "healthy" and exits
74+
3. If timeout → Restarts `tailscaled` service
75+
4. Verifies restart was successful
76+
5. Logs all actions with timestamps
77+
78+
### Files Installed
79+
80+
```
81+
/usr/local/bin/tailscale-health-monitor.sh # Health check script
82+
/etc/systemd/system/tailscale-health-monitor.service # Systemd service
83+
/etc/systemd/system/tailscale-health-monitor.timer # 5-minute timer
84+
/var/log/tailscale-health.log # Event log
85+
```
86+
87+
### Systemd Timer Configuration
88+
89+
```ini
90+
[Timer]
91+
OnBootSec=2min # First check 2 minutes after boot
92+
OnUnitActiveSec=5min # Then every 5 minutes
93+
AccuracySec=1s # Precise timing
94+
```
95+
96+
## Log Examples
97+
98+
### Healthy Status
99+
```
100+
[2025-11-16 06:15:00] Starting Tailscale health check (GraphDone Core)...
101+
[2025-11-16 06:15:01] ✓ Tailscale is healthy
102+
```
103+
104+
### Auto-Recovery
105+
```
106+
[2025-11-16 06:20:00] Starting Tailscale health check (GraphDone Core)...
107+
[2025-11-16 06:20:05] ✗ Tailscale is frozen or unresponsive
108+
[2025-11-16 06:20:05] Restarting tailscaled service...
109+
[2025-11-16 06:20:08] ✓ Tailscale successfully restarted
110+
```
111+
112+
## Troubleshooting
113+
114+
### Timer not running
115+
116+
Enable and start the timer:
117+
118+
```bash
119+
sudo systemctl enable tailscale-health-monitor.timer
120+
sudo systemctl start tailscale-health-monitor.timer
121+
```
122+
123+
### Check for errors
124+
125+
View systemd journal:
126+
127+
```bash
128+
sudo journalctl -u tailscale-health-monitor.service -n 50
129+
```
130+
131+
### Manually trigger check
132+
133+
```bash
134+
sudo systemctl start tailscale-health-monitor.service
135+
```
136+
137+
### Disable monitoring
138+
139+
```bash
140+
sudo systemctl stop tailscale-health-monitor.timer
141+
sudo systemctl disable tailscale-health-monitor.timer
142+
```
143+
144+
## Integration with GraphDone
145+
146+
This monitoring is critical for GraphDone Core because:
147+
148+
1. **Demo Instances**: Demo servers need reliable mesh connectivity
149+
2. **Multi-Server Deployments**: Clusters rely on Tailscale for service discovery
150+
3. **Developer Experience**: Frozen Tailscale blocks local development
151+
4. **Production Reliability**: Auto-recovery prevents manual intervention
152+
153+
## Configuration
154+
155+
### Modify Check Interval
156+
157+
Edit the timer file:
158+
159+
```bash
160+
sudo nano /etc/systemd/system/tailscale-health-monitor.timer
161+
```
162+
163+
Change `OnUnitActiveSec=5min` to desired interval, then:
164+
165+
```bash
166+
sudo systemctl daemon-reload
167+
sudo systemctl restart tailscale-health-monitor.timer
168+
```
169+
170+
### Modify Timeout
171+
172+
Edit the health check script:
173+
174+
```bash
175+
sudo nano /usr/local/bin/tailscale-health-monitor.sh
176+
```
177+
178+
Change `TIMEOUT=5` to desired value (in seconds).
179+
180+
## Related Documentation
181+
182+
- [GraphDone DevOps Monitoring](../../../GraphDone-Devops/monitoring/tailscale/README.md)
183+
- [Tailscale Official Docs](https://tailscale.com/kb/)
184+
- [Setup Script](../../tools/setup.sh)
185+
186+
## Change Log
187+
188+
### 2025-11-16 - Initial Integration
189+
- Added Tailscale health monitoring to GraphDone Core
190+
- Integrated into `tools/setup.sh`
191+
- Created setup script and documentation
192+
- Part of core infrastructure capabilities
193+
194+
---
195+
196+
**Status**: Production Ready
197+
**Maintainer**: GraphDone DevOps
198+
**Related**: GraphDone-Devops/monitoring/tailscale

scripts/setup-tailscale-monitor.sh

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
#!/bin/bash
2+
# Setup Tailscale Health Monitoring for GraphDone Core
3+
# Ensures Tailscale auto-recovers from freezes/hangs
4+
5+
set -euo pipefail
6+
7+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
8+
9+
# Colors
10+
GREEN='\033[0;32m'
11+
YELLOW='\033[1;33m'
12+
BLUE='\033[0;34m'
13+
NC='\033[0m'
14+
15+
echo -e "${GREEN}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
16+
echo -e "${GREEN} GraphDone Core - Tailscale Health Monitor Setup${NC}"
17+
echo -e "${GREEN}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
18+
echo ""
19+
20+
# Step 1: Create health monitor script
21+
echo -e "${YELLOW}Step 1/4: Creating Tailscale health monitor script...${NC}"
22+
23+
sudo tee /usr/local/bin/tailscale-health-monitor.sh > /dev/null << 'EOF'
24+
#!/bin/bash
25+
# Tailscale Health Monitor - Auto-restart on freeze
26+
# Part of GraphDone Core infrastructure
27+
28+
LOG_FILE="/var/log/tailscale-health.log"
29+
TIMEOUT=5
30+
31+
log() {
32+
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
33+
}
34+
35+
check_tailscale() {
36+
if timeout $TIMEOUT tailscale status > /dev/null 2>&1; then
37+
return 0
38+
else
39+
return 1
40+
fi
41+
}
42+
43+
log "Starting Tailscale health check (GraphDone Core)..."
44+
45+
if check_tailscale; then
46+
log "✓ Tailscale is healthy"
47+
exit 0
48+
else
49+
log "✗ Tailscale is frozen or unresponsive"
50+
log "Restarting tailscaled service..."
51+
systemctl restart tailscaled
52+
sleep 3
53+
if check_tailscale; then
54+
log "✓ Tailscale successfully restarted"
55+
exit 0
56+
else
57+
log "✗ Tailscale restart failed - manual intervention needed"
58+
exit 1
59+
fi
60+
fi
61+
EOF
62+
63+
sudo chmod +x /usr/local/bin/tailscale-health-monitor.sh
64+
echo -e "${BLUE}${NC} Created health monitor script"
65+
66+
# Step 2: Create systemd service
67+
echo -e "${YELLOW}Step 2/4: Creating systemd service...${NC}"
68+
69+
sudo tee /etc/systemd/system/tailscale-health-monitor.service > /dev/null << 'EOF'
70+
[Unit]
71+
Description=Tailscale Health Monitor (GraphDone Core)
72+
After=tailscaled.service
73+
Documentation=https://github.com/graphdone/graphdone-core
74+
75+
[Service]
76+
Type=oneshot
77+
ExecStart=/usr/local/bin/tailscale-health-monitor.sh
78+
StandardOutput=journal
79+
StandardError=journal
80+
User=root
81+
82+
[Install]
83+
WantedBy=multi-user.target
84+
EOF
85+
86+
echo -e "${BLUE}${NC} Created systemd service"
87+
88+
# Step 3: Create systemd timer
89+
echo -e "${YELLOW}Step 3/4: Creating systemd timer...${NC}"
90+
91+
sudo tee /etc/systemd/system/tailscale-health-monitor.timer > /dev/null << 'EOF'
92+
[Unit]
93+
Description=Tailscale Health Monitor Timer (GraphDone Core)
94+
Requires=tailscale-health-monitor.service
95+
Documentation=https://github.com/graphdone/graphdone-core
96+
97+
[Timer]
98+
# Run 2 minutes after boot
99+
OnBootSec=2min
100+
# Then every 5 minutes
101+
OnUnitActiveSec=5min
102+
# Precise timing
103+
AccuracySec=1s
104+
Persistent=true
105+
106+
[Install]
107+
WantedBy=timers.target
108+
EOF
109+
110+
echo -e "${BLUE}${NC} Created systemd timer"
111+
112+
# Step 4: Enable and start timer
113+
echo -e "${YELLOW}Step 4/4: Enabling and starting timer...${NC}"
114+
115+
sudo systemctl daemon-reload
116+
sudo systemctl enable tailscale-health-monitor.timer
117+
sudo systemctl start tailscale-health-monitor.timer
118+
119+
echo -e "${BLUE}${NC} Timer enabled and started"
120+
121+
# Verify installation
122+
echo ""
123+
echo -e "${GREEN}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
124+
echo -e "${GREEN} Installation Complete${NC}"
125+
echo -e "${GREEN}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
126+
echo ""
127+
echo "Verification:"
128+
sudo systemctl status tailscale-health-monitor.timer --no-pager | head -10
129+
echo ""
130+
echo "Log file: /var/log/tailscale-health.log"
131+
echo "Next check: $(systemctl list-timers tailscale-health-monitor.timer --no-pager | grep tailscale)"
132+
echo ""
133+
echo -e "${GREEN}✓ Tailscale health monitoring is now active for GraphDone Core${NC}"

tools/setup.sh

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -253,6 +253,23 @@ echo "📦 Building core package..."
253253
echo "🏗️ Building all packages..."
254254
npm run build
255255

256+
# Setup Tailscale health monitoring (if Tailscale is installed)
257+
if command -v tailscale &> /dev/null; then
258+
echo "🔧 Setting up Tailscale health monitoring..."
259+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
260+
if [ -f "$SCRIPT_DIR/../scripts/setup-tailscale-monitor.sh" ]; then
261+
if sudo -n true 2>/dev/null; then
262+
bash "$SCRIPT_DIR/../scripts/setup-tailscale-monitor.sh"
263+
else
264+
echo "⚠️ Tailscale monitoring requires sudo. Run manually:"
265+
echo " sudo bash $SCRIPT_DIR/../scripts/setup-tailscale-monitor.sh"
266+
fi
267+
fi
268+
else
269+
echo "ℹ️ Tailscale not installed - skipping health monitoring setup"
270+
echo " (Install Tailscale and run: sudo bash ./scripts/setup-tailscale-monitor.sh)"
271+
fi
272+
256273
echo "✅ Setup complete!"
257274
echo ""
258275

0 commit comments

Comments
 (0)