This document describes the monitoring and status tracking architecture implemented across issues #7, #19, and #33.
The monitoring system uses a three-file architecture that separates monitoring data from incident tracking:
- Append-only monitoring data to eliminate Git history pollution
- GitHub Issue-based incidents for structured problem tracking
- Smart deployment triggers for critical vs non-critical updates
- Optimized for performance with hot files and compression
status-data/
├── current.json # Time-series monitoring readings (updated every 5min)
├── daily-summary.json # 90-day aggregated stats (v0.17.0+)
├── incidents.json # Active and resolved incidents from GitHub Issues
├── maintenance.json # Scheduled maintenance windows
└── archives/
└── 2025/11/
├── history-2025-11-01.jsonl.gz # Compressed daily archives
├── history-2025-11-02.jsonl.gz
└── history-2025-11-03.jsonl # Today (uncompressed)
Purpose: Real-time endpoint health checks and response time tracking
Updated By: monitor-systems.yml workflow (every 5 minutes)
Source: Automated HTTP checks to configured endpoints
Retention: Rolling 14-day window (~4,000 readings for 5-minute checks)
Format: Array of compact readings
[
{
"t": 1699123456789,
"svc": "api",
"state": "up",
"code": 200,
"lat": 145,
"err": null
},
{
"t": 1699123756789,
"svc": "website",
"state": "down",
"code": 500,
"lat": 2500,
"err": "Connection timeout"
}
]Fields:
t- Timestamp (milliseconds since epoch)svc- Service name (e.g., 'api', 'website', 'database')state- Status: 'up', 'down', 'degraded', or 'maintenance'code- HTTP status codelat- Latency in millisecondserr- Error message (optional, only present if request failed)
Commit Strategy:
- Committed with
[skip ci]tag - Does NOT trigger deployments (filtered by
paths-ignoreindeploy.yml) - Creates critical GitHub Issues when services go down
Purpose: Aggregated daily statistics for 90-day heatmap visualization
Updated By: monitor-systems.yml workflow (regenerated on every monitor run)
Source: Aggregation from current.json and archived JSONL files
Retention: Rolling 90-day window (~10KB per 2 services)
The Problem: The 90-day heatmap was 84% empty because it only read from current.json (14-day window).
The Solution: Aggregate daily statistics from archives into a compact summary file.
Format: Object with per-service daily entries
{
"version": 1,
"lastUpdated": "2025-12-31T22:00:00Z",
"windowDays": 90,
"services": {
"api": [
{
"date": "2025-12-30",
"uptimePct": 0.998,
"avgLatencyMs": 145,
"p95LatencyMs": 320,
"checksTotal": 144,
"checksPassed": 143,
"incidentCount": 0
}
]
}
}Fields:
version- Schema version (currently 1)lastUpdated- ISO timestamp of last updatewindowDays- Number of days coveredservices- Object with service names as keysdate- ISO date string (YYYY-MM-DD)uptimePct- Uptime percentage as decimal (0.0 to 1.0)avgLatencyMs- Average latency in ms (null if no successful checks)p95LatencyMs- 95th percentile latency in ms (null if no successful checks)checksTotal- Total number of checks performedchecksPassed- Number of successful checks (up or maintenance)incidentCount- Number of up→down transitions
Design Principles (ADR-002):
- Store percentages, not strings:
uptimePct: 0.998notstatus: "ok" - Include p95 latency: Averages hide spikes; p95 reveals bad days
- UTC dates: All date fields are ISO 8601 in UTC
- Schema versioning: Enables future migrations
- Raw counts:
checksTotalandchecksPassedallow recalculating with different rules
Hybrid Read Pattern:
The frontend uses a hybrid approach to ensure today's data is live:
// Frontend loads both files in parallel
const [summary, current] = await Promise.all([
fetch('/status-data/daily-summary.json'),
fetch('/status-data/current.json')
]);
// Merge: today from current.json + history from summary
const data = [aggregateToday(current), ...summary.services[serviceName]];Bootstrap Existing Data:
npx bootstrap-summary --output-dir status-data --window 90See ADR-002 for full architecture documentation.
Purpose: Track incidents reported via GitHub Issues
Updated By: status-update.yml workflow (on issue events + hourly)
Source: GitHub Issues with status label
Retention: Active incidents + last 30 days of resolved incidents
Format: Array of incident objects
[
{
"id": 123,
"title": "API experiencing high latency",
"severity": "major",
"status": "open",
"systems": ["api", "database"],
"createdAt": "2025-11-03T10:00:00Z",
"updatedAt": "2025-11-03T12:30:00Z",
"closedAt": null,
"body": "Users reporting slow API responses...",
"url": "https://github.com/org/repo/issues/123",
"comments": [
{
"author": "devops-bot",
"createdAt": "2025-11-03T11:00:00Z",
"body": "Database query optimization in progress"
}
]
}
]Severity Levels (from issue labels):
critical- Complete service outagemajor- Significant degradationminor- Minor issues, partial impactmaintenance- Planned maintenance
Commit Strategy:
- Committed with
[skip ci]tag - If contains
criticalincidents → triggersrepository_dispatchevent repository_dispatchtriggers immediate deployment (~2 min)
Purpose: Track scheduled and completed maintenance windows
Updated By: status-update.yml workflow (on issue events + hourly)
Source: GitHub Issues with maintenance label and YAML frontmatter
Retention: Upcoming + in-progress + last 60 days of completed
Format: Array of maintenance objects
[
{
"id": 456,
"title": "Database upgrade to v2.0",
"status": "upcoming",
"systems": ["api", "database"],
"start": "2025-11-15T02:00:00Z",
"end": "2025-11-15T04:00:00Z",
"createdAt": "2025-11-01T10:00:00Z",
"body": "Scheduled database upgrade to improve performance...",
"url": "https://github.com/org/repo/issues/456"
}
]Status Calculation:
upcoming- Start time is in the futurein-progress- Current time between start and endcompleted- End time has passed OR issue is closed
Issue Format:
Create a GitHub issue with the maintenance label and YAML frontmatter:
---
start: 2025-11-15T02:00:00Z
end: 2025-11-15T04:00:00Z
systems:
- api
- database
---
Scheduled database upgrade to improve performance.
**Expected Impact:**
- API will be unavailable during the maintenance windowCommit Strategy:
- Committed with
[skip ci]tag - Does NOT trigger immediate deployment
- Picked up by hourly scheduled deployment
A rolling 14-day window of all monitoring readings in a compact format:
[
{
"t": 1699123456789,
"svc": "api",
"state": "up",
"code": 200,
"lat": 145,
"err": null
},
{
"t": 1699123756789,
"svc": "website",
"state": "down",
"code": 500,
"lat": 2500,
"err": "Connection timeout"
}
]Fields:
t- Timestamp (milliseconds since epoch)svc- Service name (e.g., 'api', 'website', 'database')state- Status: 'up', 'down', 'degraded', or 'maintenance'code- HTTP status codelat- Latency in millisecondserr- Error message (optional, only present if request failed)
Daily JSONL files containing all readings for that day. Each line is a complete JSON object:
{"t":1699123456789,"svc":"api","state":"up","code":200,"lat":145}
{"t":1699123756789,"svc":"website","state":"up","code":200,"lat":98}
{"t":1699124056789,"svc":"database","state":"degraded","code":200,"lat":850}Archive Structure:
build/status-data/
├── current.json # Hot file (14-day rolling window)
└── archives/
└── 2025/
└── 11/
├── history-2025-11-01.jsonl.gz # Compressed (yesterday and older)
├── history-2025-11-02.jsonl.gz # Compressed
└── history-2025-11-03.jsonl # Uncompressed (today)
The monitoring script (scripts/monitor.js) performs the following tasks:
- Check Endpoint - Make HTTP request and measure response time
- Determine Status - Calculate status (up/down/degraded) based on response
- Append to JSONL - Write one line to today's
history-YYYY-MM-DD.jsonlfile - Rebuild current.json - Scan last 14 days and rebuild the hot file
- Generate Commit Message - Create emoji-decorated message for Git commit
Usage:
# Single system check
node scripts/monitor.js --system api --url https://api.example.com/health
# Multiple systems from config file
node scripts/monitor.js --config .monitorrc.json
# With custom options
node scripts/monitor.js \
--system website \
--url https://example.com \
--method GET \
--timeout 10000 \
--expected-codes 200,301,302 \
--max-response-time 30000 \
--output-dir status-data \
--verboseOptions:
--system <name>- System name (e.g., 'api', 'website')--url <url>- URL to monitor--method <method>- HTTP method (default: GET)--timeout <ms>- Request timeout in milliseconds (default: 10000)--expected-codes <codes>- Comma-separated expected status codes (default: 200,301,302)--max-response-time <ms>- Maximum response time before degraded (default: 30000)--output-dir <path>- Output directory (default: status-data)--config <file>- JSON config file with system definitions--verbose- Enable verbose logging
The monitoring system uses three coordinated workflows:
Trigger: Every 5 minutes (cron: */5 * * * *)
Purpose: Check endpoint health and update monitoring data
Process:
- Checkout repository
- Setup Node.js 20
- Run monitoring script with
--config .monitorrc.json - Script monitors each system sequentially
- Append readings to
archives/YYYY/MM/history-YYYY-MM-DD.jsonl - Rebuild
current.jsonfrom last 14 days - Commit with
[skip ci]tag - If critical failure detected → Create GitHub Issue with
critical+statuslabels
Configuration: .monitorrc.json in repository root
{
"systems": [
{
"system": "api",
"url": "https://api.example.com/health",
"expectedCodes": [200],
"maxResponseTime": 30000
},
{
"system": "website",
"url": "https://example.com"
}
]
}Sequential Architecture (v0.4.10+):
- Single job monitors all systems in sequence
- Zero data loss (no race conditions)
- Single commit with all data
- Runtime: ~5 seconds per system
Commit Message: Update monitoring data [skip ci]
Critical Issue Creation:
- When service goes down → Automatically creates GitHub Issue
- Labels:
status,critical,<service-name> - Title:
🔴 <Service> is down - Triggers
status-update.ymlvia issue event
Trigger:
- GitHub Issue events (opened, closed, labeled, edited)
- Schedule: Every hour (cron:
0 * * * *) - Manual:
workflow_dispatch
Purpose: Generate incidents.json and maintenance.json from GitHub Issues
Process:
- Checkout repository
- Setup Node.js 20
- Run
npx stentorosaur-update-status --write-incidents --write-maintenance - Fetch all GitHub Issues with
statusormaintenancelabels - Generate
incidents.json(active + last 30 days resolved) - Generate
maintenance.json(upcoming + in-progress + last 60 days) - Commit with
[skip ci]tag - Smart Deployment Trigger:
- If
incidents.jsoncontainscriticalincidents → Dispatchrepository_dispatchevent - Event type:
status-updated - Triggers immediate deployment via
deploy.yml
- If
CLI Command:
npx stentorosaur-update-status \
--write-incidents \
--write-maintenance \
--output-dir status-data \
--verboseCommit Message: Update status data [skip ci]
Critical Dispatch Logic:
- name: Trigger deployment for critical incidents
if: contains(github.event.issue.labels.*.name, 'critical')
uses: peter-evans/repository-dispatch@v2
with:
event-type: status-updated
token: ${{ secrets.GITHUB_TOKEN }}Two deployment workflows work together:
Triggers:
pushtomainbranch (code changes)repository_dispatchwith typestatus-updated(critical incidents)workflow_dispatch(manual)
Path Filtering (v0.4.13+):
on:
push:
branches: [main]
paths-ignore:
- 'status-data/current.json'
- 'status-data/archives/**'
repository_dispatch:
types: [status-updated]Result:
- Monitoring commits (every 5 min) → NOT deployed
- Code changes → Deployed immediately
- Critical incidents → Deployed immediately (~2 min)
Trigger: Schedule every hour (cron: 0 * * * *)
Purpose: Pick up non-critical status updates
Process:
- Checkout repository with all status files
- Build Docusaurus site (plugin reads 3 status files)
- Deploy to GitHub Pages
Result:
- Non-critical incidents → Deployed within 1 hour
- Maintenance updates → Deployed within 1 hour
Every 5 min:
monitor-systems.yml
↓ Check endpoints
↓ Update current.json
↓ Commit [skip ci]
↓ NO DEPLOYMENT
↓ If critical down
↓ Create Issue → Triggers status-update.yml
On Issue Events + Hourly:
status-update.yml
↓ Fetch GitHub Issues
↓ Generate incidents.json + maintenance.json
↓ Commit [skip ci]
↓ Check for critical
├─ Critical → repository_dispatch → deploy.yml (immediate)
└─ Non-critical → Wait for deploy-scheduled.yml (hourly)
Deployment:
deploy.yml (immediate)
↓ Triggered by: code push, repository_dispatch, manual
↓ Ignores: current.json, archives/**
↓ Reads: incidents.json, maintenance.json
↓ Build + Deploy
deploy-scheduled.yml (hourly)
↓ Triggered by: schedule
↓ Reads: current.json, incidents.json, maintenance.json
↓ Build + Deploy
| Event | Workflow | Files Updated | Deployment | Latency |
|---|---|---|---|---|
| Endpoint check (every 5m) | monitor-systems.yml |
current.json |
None | N/A |
| Critical endpoint down | monitor-systems.yml |
current.json + creates Issue |
Via status-update.yml → deploy.yml |
~2 min |
| Issue opened/closed | status-update.yml |
incidents.json, maintenance.json |
deploy.yml if critical, else hourly |
2 min / 1 hour |
| Hourly check | status-update.yml |
incidents.json, maintenance.json |
deploy-scheduled.yml |
1 hour |
| Code push to main | N/A | N/A | deploy.yml |
~5 min |
- 2 systems: ~10s total (vs 5s parallel with 50% data loss)
- 10 systems: ~50s total (vs 5s parallel with 90% data loss)
- Still completes within 5-minute cron interval for most deployments
Commit Messages:
- Single commit contains summary:
Update monitoring data [skip ci] - Check output shows: 🟩
api is up (200 in 145 ms), 🟨website degraded, etc.
Runs daily at 00:05 UTC to compress yesterday's JSONL files.
Steps:
- Checkout repository
- Find yesterday's uncompressed JSONL file
- Compress with gzip
- Commit with message:
📦 Compress archive for YYYY-MM-DD
File Lifecycle:
- Today -
history-2025-11-03.jsonl(uncompressed, actively appending) - Tomorrow - File becomes
history-2025-11-03.jsonl.gz(compressed, read-only)
The plugin's loadHistoricalData() function has been updated to read from current.json:
// Fetch current.json (14-day rolling window)
const response = await fetch(`/${dataPath}/current.json`);
const readings: CompactReading[] = await response.json();
// Filter by service name
const serviceReadings = readings.filter(r => r.svc === 'api');
// Convert to legacy format for backward compatibility
const history = serviceReadings.map(r => ({
timestamp: new Date(r.t).toISOString(),
status: r.state,
code: r.code,
responseTime: r.lat,
}));Fallback: If current.json doesn't exist, the function falls back to the legacy systems/*.json format.
- Each check appends one line to today's JSONL file
- Only one file changes per check (not entire JSON array)
- Git diffs are minimal and readable
current.jsoncontains only last 14 days (~4,000 readings for 5-minute checks)- Small file size (~200-400 KB) loads quickly
- No need to parse years of historical data
- Daily JSONL files are compressed after 24 hours
- Gzip compression typically achieves 80-90% reduction
- Old data remains accessible but doesn't bloat repository
- No JSON parsing/stringifying for every check
- Just append one line:
echo '{"t":...}' >> file.jsonl - Works efficiently even with large files
- JSONL is line-oriented, perfect for streaming
- Can use
grep,jq, or other CLI tools - Easy to merge multiple days for analysis
The plugin automatically detects and supports both formats:
- New format (
current.json) - Used if available - Legacy format (
systems/*.json) - Fallback for backward compatibility
No migration required - Start using the new monitoring script and it will create the new format. The plugin will automatically use it once current.json exists.
Create .monitorrc.json in your repository root:
{
"systems": [
{
"system": "api",
"url": "https://api.example.com/health",
"method": "GET",
"timeout": 10000,
"expectedCodes": [200, 301, 302],
"maxResponseTime": 30000
},
{
"system": "website",
"url": "https://example.com",
"method": "GET",
"timeout": 10000,
"expectedCodes": [200],
"maxResponseTime": 30000
}
]
}Then update your workflow to use it:
- name: Monitor all systems
run: node scripts/monitor.js --config .monitorrc.json- Check that
status-data/archives/directory exists - Verify monitoring script is running (check GitHub Actions logs)
- Ensure workflow has
contents: writepermission
- Delete
status-data/systems/*.jsonfiles - Wait for next monitoring run to create
current.json - Plugin will automatically switch to new format
- Check that
compress-archives.ymlworkflow is enabled - Verify it's running daily (check workflow runs)
- Ensure yesterday's JSONL file exists before compression
Monitoring Script:
- HTTP request: ~100-500ms (depends on endpoint)
- JSONL append: ~1-5ms (simple file write)
- current.json rebuild: ~50-200ms (scan 14 days)
- Total time per check: ~200-700ms
Data Loading:
- current.json fetch: ~50-200ms (200-400 KB file)
- Parsing 4,000 readings: ~10-30ms
- Total load time: ~100-300ms
Storage:
- One day of 5-minute checks: ~50 KB uncompressed
- After gzip: ~5-10 KB compressed
- 14 days in current.json: ~200-400 KB
- One year of archives: ~2-4 MB (compressed)
Issue #7 proposes creating a reusable GitHub Action (similar to Upptime's upptime/uptime-monitor@v1.41.0). This would:
- Package the monitoring script as a Docker action
- Provide standardized configuration
- Make it easy to use in any repository
- Support advanced features (Globalping, custom checks, etc.)
Example usage:
- uses: amiable-dev/status-monitor-action@v1
with:
systems: |
api: https://api.example.com/health
website: https://example.com- Issue #7: Create reusable GitHub Action for endpoint monitoring
- Issue #19: Status Data Optimization with hot file + daily archives
- Issue #22: Add Globalping configuration options
- Issue #23: Implement ICMP ping support