Monitoring System Documentation

This document describes the monitoring and status tracking architecture implemented across issues #7, #19, and #33.

Overview

The monitoring system uses a three-file architecture that separates monitoring data from incident tracking:

Append-only monitoring data to eliminate Git history pollution
GitHub Issue-based incidents for structured problem tracking
Smart deployment triggers for critical vs non-critical updates
Optimized for performance with hot files and compression

Three-File Architecture (v0.4.11+)

File Structure

status-data/
├── current.json          # Time-series monitoring readings (updated every 5min)
├── daily-summary.json    # 90-day aggregated stats (v0.17.0+)
├── incidents.json        # Active and resolved incidents from GitHub Issues
├── maintenance.json      # Scheduled maintenance windows
└── archives/
    └── 2025/11/
        ├── history-2025-11-01.jsonl.gz   # Compressed daily archives
        ├── history-2025-11-02.jsonl.gz
        └── history-2025-11-03.jsonl      # Today (uncompressed)

File Details

1. current.json (Monitoring Data)

Purpose: Real-time endpoint health checks and response time tracking
Updated By: monitor-systems.yml workflow (every 5 minutes)
Source: Automated HTTP checks to configured endpoints
Retention: Rolling 14-day window (~4,000 readings for 5-minute checks)

Format: Array of compact readings

[
  {
    "t": 1699123456789,
    "svc": "api",
    "state": "up",
    "code": 200,
    "lat": 145,
    "err": null
  },
  {
    "t": 1699123756789,
    "svc": "website",
    "state": "down",
    "code": 500,
    "lat": 2500,
    "err": "Connection timeout"
  }
]

Fields:

t - Timestamp (milliseconds since epoch)
svc - Service name (e.g., 'api', 'website', 'database')
state - Status: 'up', 'down', 'degraded', or 'maintenance'
code - HTTP status code
lat - Latency in milliseconds
err - Error message (optional, only present if request failed)

Commit Strategy:

Committed with [skip ci] tag
Does NOT trigger deployments (filtered by paths-ignore in deploy.yml)
Creates critical GitHub Issues when services go down

1b. daily-summary.json (Historical Aggregation - v0.17.0+)

Purpose: Aggregated daily statistics for 90-day heatmap visualization Updated By: monitor-systems.yml workflow (regenerated on every monitor run) Source: Aggregation from current.json and archived JSONL files Retention: Rolling 90-day window (~10KB per 2 services)

The Problem: The 90-day heatmap was 84% empty because it only read from current.json (14-day window).

The Solution: Aggregate daily statistics from archives into a compact summary file.

Format: Object with per-service daily entries

{
  "version": 1,
  "lastUpdated": "2025-12-31T22:00:00Z",
  "windowDays": 90,
  "services": {
    "api": [
      {
        "date": "2025-12-30",
        "uptimePct": 0.998,
        "avgLatencyMs": 145,
        "p95LatencyMs": 320,
        "checksTotal": 144,
        "checksPassed": 143,
        "incidentCount": 0
      }
    ]
  }
}

Fields:

version - Schema version (currently 1)
lastUpdated - ISO timestamp of last update
windowDays - Number of days covered
services - Object with service names as keys
date - ISO date string (YYYY-MM-DD)
uptimePct - Uptime percentage as decimal (0.0 to 1.0)
avgLatencyMs - Average latency in ms (null if no successful checks)
p95LatencyMs - 95th percentile latency in ms (null if no successful checks)
checksTotal - Total number of checks performed
checksPassed - Number of successful checks (up or maintenance)
incidentCount - Number of up→down transitions

Design Principles (ADR-002):

Store percentages, not strings: uptimePct: 0.998 not status: "ok"
Include p95 latency: Averages hide spikes; p95 reveals bad days
UTC dates: All date fields are ISO 8601 in UTC
Schema versioning: Enables future migrations
Raw counts: checksTotal and checksPassed allow recalculating with different rules

Hybrid Read Pattern:

The frontend uses a hybrid approach to ensure today's data is live:

// Frontend loads both files in parallel
const [summary, current] = await Promise.all([
  fetch('/status-data/daily-summary.json'),
  fetch('/status-data/current.json')
]);

// Merge: today from current.json + history from summary
const data = [aggregateToday(current), ...summary.services[serviceName]];

Bootstrap Existing Data:

npx bootstrap-summary --output-dir status-data --window 90

See ADR-002 for full architecture documentation.

2. incidents.json (Issue-Based Incidents)

Purpose: Track incidents reported via GitHub Issues
Updated By: status-update.yml workflow (on issue events + hourly)
Source: GitHub Issues with status label
Retention: Active incidents + last 30 days of resolved incidents

Format: Array of incident objects

[
  {
    "id": 123,
    "title": "API experiencing high latency",
    "severity": "major",
    "status": "open",
    "systems": ["api", "database"],
    "createdAt": "2025-11-03T10:00:00Z",
    "updatedAt": "2025-11-03T12:30:00Z",
    "closedAt": null,
    "body": "Users reporting slow API responses...",
    "url": "https://github.com/org/repo/issues/123",
    "comments": [
      {
        "author": "devops-bot",
        "createdAt": "2025-11-03T11:00:00Z",
        "body": "Database query optimization in progress"
      }
    ]
  }
]

Severity Levels (from issue labels):

critical - Complete service outage
major - Significant degradation
minor - Minor issues, partial impact
maintenance - Planned maintenance

Commit Strategy:

Committed with [skip ci] tag
If contains critical incidents → triggers repository_dispatch event
repository_dispatch triggers immediate deployment (~2 min)

3. maintenance.json (Scheduled Maintenance)

Purpose: Track scheduled and completed maintenance windows
Updated By: status-update.yml workflow (on issue events + hourly)
Source: GitHub Issues with maintenance label and YAML frontmatter
Retention: Upcoming + in-progress + last 60 days of completed

Format: Array of maintenance objects

[
  {
    "id": 456,
    "title": "Database upgrade to v2.0",
    "status": "upcoming",
    "systems": ["api", "database"],
    "start": "2025-11-15T02:00:00Z",
    "end": "2025-11-15T04:00:00Z",
    "createdAt": "2025-11-01T10:00:00Z",
    "body": "Scheduled database upgrade to improve performance...",
    "url": "https://github.com/org/repo/issues/456"
  }
]

Status Calculation:

upcoming - Start time is in the future
in-progress - Current time between start and end
completed - End time has passed OR issue is closed

Issue Format:

Create a GitHub issue with the maintenance label and YAML frontmatter:

---
start: 2025-11-15T02:00:00Z
end: 2025-11-15T04:00:00Z
systems:
  - api
  - database
---

Scheduled database upgrade to improve performance.

**Expected Impact:**
- API will be unavailable during the maintenance window

Commit Strategy:

Committed with [skip ci] tag
Does NOT trigger immediate deployment
Picked up by hourly scheduled deployment

Architecture

Data Format

Current Data (`current.json`)

A rolling 14-day window of all monitoring readings in a compact format:

[
  {
    "t": 1699123456789,
    "svc": "api",
    "state": "up",
    "code": 200,
    "lat": 145,
    "err": null
  },
  {
    "t": 1699123756789,
    "svc": "website",
    "state": "down",
    "code": 500,
    "lat": 2500,
    "err": "Connection timeout"
  }
]

Fields:

t - Timestamp (milliseconds since epoch)
svc - Service name (e.g., 'api', 'website', 'database')
state - Status: 'up', 'down', 'degraded', or 'maintenance'
code - HTTP status code
lat - Latency in milliseconds
err - Error message (optional, only present if request failed)

Archive Files (`archives/YYYY/MM/history-YYYY-MM-DD.jsonl`)

Daily JSONL files containing all readings for that day. Each line is a complete JSON object:

{"t":1699123456789,"svc":"api","state":"up","code":200,"lat":145}
{"t":1699123756789,"svc":"website","state":"up","code":200,"lat":98}
{"t":1699124056789,"svc":"database","state":"degraded","code":200,"lat":850}

Archive Structure:

build/status-data/
├── current.json                           # Hot file (14-day rolling window)
└── archives/
    └── 2025/
        └── 11/
            ├── history-2025-11-01.jsonl.gz   # Compressed (yesterday and older)
            ├── history-2025-11-02.jsonl.gz   # Compressed
            └── history-2025-11-03.jsonl      # Uncompressed (today)

Monitoring Script

The monitoring script (scripts/monitor.js) performs the following tasks:

Check Endpoint - Make HTTP request and measure response time
Determine Status - Calculate status (up/down/degraded) based on response
Append to JSONL - Write one line to today's history-YYYY-MM-DD.jsonl file
Rebuild current.json - Scan last 14 days and rebuild the hot file
Generate Commit Message - Create emoji-decorated message for Git commit

Usage:

# Single system check
node scripts/monitor.js --system api --url https://api.example.com/health

# Multiple systems from config file
node scripts/monitor.js --config .monitorrc.json

# With custom options
node scripts/monitor.js \
  --system website \
  --url https://example.com \
  --method GET \
  --timeout 10000 \
  --expected-codes 200,301,302 \
  --max-response-time 30000 \
  --output-dir status-data \
  --verbose

Options:

--system <name> - System name (e.g., 'api', 'website')
--url <url> - URL to monitor
--method <method> - HTTP method (default: GET)
--timeout <ms> - Request timeout in milliseconds (default: 10000)
--expected-codes <codes> - Comma-separated expected status codes (default: 200,301,302)
--max-response-time <ms> - Maximum response time before degraded (default: 30000)
--output-dir <path> - Output directory (default: status-data)
--config <file> - JSON config file with system definitions
--verbose - Enable verbose logging

Workflows

The monitoring system uses three coordinated workflows:

1. Monitor Systems (`monitor-systems.yml`)

Trigger: Every 5 minutes (cron: */5 * * * *)

Purpose: Check endpoint health and update monitoring data

Process:

Checkout repository
Setup Node.js 20
Run monitoring script with --config .monitorrc.json
Script monitors each system sequentially
Append readings to archives/YYYY/MM/history-YYYY-MM-DD.jsonl
Rebuild current.json from last 14 days
Commit with [skip ci] tag
If critical failure detected → Create GitHub Issue with critical + status labels

Configuration: .monitorrc.json in repository root

{
  "systems": [
    {
      "system": "api",
      "url": "https://api.example.com/health",
      "expectedCodes": [200],
      "maxResponseTime": 30000
    },
    {
      "system": "website",
      "url": "https://example.com"
    }
  ]
}

Sequential Architecture (v0.4.10+):

Single job monitors all systems in sequence
Zero data loss (no race conditions)
Single commit with all data
Runtime: ~5 seconds per system

Commit Message: Update monitoring data [skip ci]

Critical Issue Creation:

When service goes down → Automatically creates GitHub Issue
Labels: status, critical, <service-name>
Title: 🔴 <Service> is down
Triggers status-update.yml via issue event

2. Status Update (`status-update.yml`)

Trigger:

GitHub Issue events (opened, closed, labeled, edited)
Schedule: Every hour (cron: 0 * * * *)
Manual: workflow_dispatch

Purpose: Generate incidents.json and maintenance.json from GitHub Issues

Process:

Checkout repository
Setup Node.js 20
Run npx stentorosaur-update-status --write-incidents --write-maintenance
Fetch all GitHub Issues with status or maintenance labels
Generate incidents.json (active + last 30 days resolved)
Generate maintenance.json (upcoming + in-progress + last 60 days)
Commit with [skip ci] tag
Smart Deployment Trigger:
- If incidents.json contains critical incidents → Dispatch repository_dispatch event
- Event type: status-updated
- Triggers immediate deployment via deploy.yml

CLI Command:

npx stentorosaur-update-status \
  --write-incidents \
  --write-maintenance \
  --output-dir status-data \
  --verbose

Commit Message: Update status data [skip ci]

Critical Dispatch Logic:

- name: Trigger deployment for critical incidents
  if: contains(github.event.issue.labels.*.name, 'critical')
  uses: peter-evans/repository-dispatch@v2
  with:
    event-type: status-updated
    token: ${{ secrets.GITHUB_TOKEN }}

3. Deploy Workflows

Two deployment workflows work together:

deploy.yml (Immediate Deployment)

Triggers:

push to main branch (code changes)
repository_dispatch with type status-updated (critical incidents)
workflow_dispatch (manual)

Path Filtering (v0.4.13+):

on:
  push:
    branches: [main]
    paths-ignore:
      - 'status-data/current.json'
      - 'status-data/archives/**'
  repository_dispatch:
    types: [status-updated]

Result:

Monitoring commits (every 5 min) → NOT deployed
Code changes → Deployed immediately
Critical incidents → Deployed immediately (~2 min)

deploy-scheduled.yml (Hourly Deployment)

Trigger: Schedule every hour (cron: 0 * * * *)

Purpose: Pick up non-critical status updates

Process:

Checkout repository with all status files
Build Docusaurus site (plugin reads 3 status files)
Deploy to GitHub Pages

Result:

Non-critical incidents → Deployed within 1 hour
Maintenance updates → Deployed within 1 hour

Workflow Interaction Diagram

Every 5 min:
  monitor-systems.yml
    ↓ Check endpoints
    ↓ Update current.json
    ↓ Commit [skip ci]
    ↓ NO DEPLOYMENT
    ↓ If critical down
    ↓ Create Issue → Triggers status-update.yml

On Issue Events + Hourly:
  status-update.yml
    ↓ Fetch GitHub Issues
    ↓ Generate incidents.json + maintenance.json
    ↓ Commit [skip ci]
    ↓ Check for critical
    ├─ Critical → repository_dispatch → deploy.yml (immediate)
    └─ Non-critical → Wait for deploy-scheduled.yml (hourly)

Deployment:
  deploy.yml (immediate)
    ↓ Triggered by: code push, repository_dispatch, manual
    ↓ Ignores: current.json, archives/**
    ↓ Reads: incidents.json, maintenance.json
    ↓ Build + Deploy
  
  deploy-scheduled.yml (hourly)
    ↓ Triggered by: schedule
    ↓ Reads: current.json, incidents.json, maintenance.json
    ↓ Build + Deploy

Data Flow Summary

Event	Workflow	Files Updated	Deployment	Latency
Endpoint check (every 5m)	`monitor-systems.yml`	`current.json`	None	N/A
Critical endpoint down	`monitor-systems.yml`	`current.json` + creates Issue	Via `status-update.yml` → `deploy.yml`	~2 min
Issue opened/closed	`status-update.yml`	`incidents.json`, `maintenance.json`	`deploy.yml` if critical, else hourly	2 min / 1 hour
Hourly check	`status-update.yml`	`incidents.json`, `maintenance.json`	`deploy-scheduled.yml`	1 hour
Code push to main	N/A	N/A	`deploy.yml`	~5 min

Workflows

2 systems: ~10s total (vs 5s parallel with 50% data loss)
10 systems: ~50s total (vs 5s parallel with 90% data loss)
Still completes within 5-minute cron interval for most deployments

Commit Messages:

Single commit contains summary: Update monitoring data [skip ci]
Check output shows: 🟩 api is up (200 in 145 ms), 🟨 website degraded, etc.

Compress Archives (`compress-archives.yml`)

Runs daily at 00:05 UTC to compress yesterday's JSONL files.

Steps:

Checkout repository
Find yesterday's uncompressed JSONL file
Compress with gzip
Commit with message: 📦 Compress archive for YYYY-MM-DD

File Lifecycle:

Today - history-2025-11-03.jsonl (uncompressed, actively appending)
Tomorrow - File becomes history-2025-11-03.jsonl.gz (compressed, read-only)

Data Consumption

The plugin's loadHistoricalData() function has been updated to read from current.json:

// Fetch current.json (14-day rolling window)
const response = await fetch(`/${dataPath}/current.json`);
const readings: CompactReading[] = await response.json();

// Filter by service name
const serviceReadings = readings.filter(r => r.svc === 'api');

// Convert to legacy format for backward compatibility
const history = serviceReadings.map(r => ({
  timestamp: new Date(r.t).toISOString(),
  status: r.state,
  code: r.code,
  responseTime: r.lat,
}));

Fallback: If current.json doesn't exist, the function falls back to the legacy systems/*.json format.

Benefits

1. No Git History Pollution

Each check appends one line to today's JSONL file
Only one file changes per check (not entire JSON array)
Git diffs are minimal and readable

2. Fast Site Loads

current.json contains only last 14 days (~4,000 readings for 5-minute checks)
Small file size (~200-400 KB) loads quickly
No need to parse years of historical data

3. Efficient Storage

Daily JSONL files are compressed after 24 hours
Gzip compression typically achieves 80-90% reduction
Old data remains accessible but doesn't bloat repository

4. Simple Append Operations

No JSON parsing/stringifying for every check
Just append one line: echo '{"t":...}' >> file.jsonl
Works efficiently even with large files

5. Easy to Query

JSONL is line-oriented, perfect for streaming
Can use grep, jq, or other CLI tools
Easy to merge multiple days for analysis

Migration from Legacy Format

The plugin automatically detects and supports both formats:

New format (current.json) - Used if available
Legacy format (systems/*.json) - Fallback for backward compatibility

No migration required - Start using the new monitoring script and it will create the new format. The plugin will automatically use it once current.json exists.

Configuration Example

Create .monitorrc.json in your repository root:

{
  "systems": [
    {
      "system": "api",
      "url": "https://api.example.com/health",
      "method": "GET",
      "timeout": 10000,
      "expectedCodes": [200, 301, 302],
      "maxResponseTime": 30000
    },
    {
      "system": "website",
      "url": "https://example.com",
      "method": "GET",
      "timeout": 10000,
      "expectedCodes": [200],
      "maxResponseTime": 30000
    }
  ]
}

Then update your workflow to use it:

- name: Monitor all systems
  run: node scripts/monitor.js --config .monitorrc.json

Troubleshooting

No data appearing in current.json

Check that status-data/archives/ directory exists
Verify monitoring script is running (check GitHub Actions logs)
Ensure workflow has contents: write permission

Old format still being used

Delete status-data/systems/*.json files
Wait for next monitoring run to create current.json
Plugin will automatically switch to new format

Archive compression not working

Check that compress-archives.yml workflow is enabled
Verify it's running daily (check workflow runs)
Ensure yesterday's JSONL file exists before compression

Performance Characteristics

Monitoring Script:

HTTP request: ~100-500ms (depends on endpoint)
JSONL append: ~1-5ms (simple file write)
current.json rebuild: ~50-200ms (scan 14 days)
Total time per check: ~200-700ms

Data Loading:

current.json fetch: ~50-200ms (200-400 KB file)
Parsing 4,000 readings: ~10-30ms
Total load time: ~100-300ms

Storage:

One day of 5-minute checks: ~50 KB uncompressed
After gzip: ~5-10 KB compressed
14 days in current.json: ~200-400 KB
One year of archives: ~2-4 MB (compressed)

Future Enhancements

Issue #7 proposes creating a reusable GitHub Action (similar to Upptime's upptime/uptime-monitor@v1.41.0). This would:

Package the monitoring script as a Docker action
Provide standardized configuration
Make it easy to use in any repository
Support advanced features (Globalping, custom checks, etc.)

Example usage:

- uses: amiable-dev/status-monitor-action@v1
  with:
    systems: |
      api: https://api.example.com/health
      website: https://example.com

Related Issues

Issue #7: Create reusable GitHub Action for endpoint monitoring
Issue #19: Status Data Optimization with hot file + daily archives
Issue #22: Add Globalping configuration options
Issue #23: Implement ICMP ping support

FilesExpand file tree

MONITORING_SYSTEM.md

Latest commit

History

MONITORING_SYSTEM.md

File metadata and controls

Monitoring System Documentation

Overview

Three-File Architecture (v0.4.11+)

File Structure

File Details

1. current.json (Monitoring Data)

1b. daily-summary.json (Historical Aggregation - v0.17.0+)

2. incidents.json (Issue-Based Incidents)

3. maintenance.json (Scheduled Maintenance)

Architecture

Data Format

Current Data (current.json)

Archive Files (archives/YYYY/MM/history-YYYY-MM-DD.jsonl)

Monitoring Script

Workflows

1. Monitor Systems (monitor-systems.yml)

2. Status Update (status-update.yml)

3. Deploy Workflows

deploy.yml (Immediate Deployment)

deploy-scheduled.yml (Hourly Deployment)

Workflow Interaction Diagram

Data Flow Summary

Workflows

Compress Archives (compress-archives.yml)

Data Consumption

Benefits

1. No Git History Pollution

2. Fast Site Loads

3. Efficient Storage

4. Simple Append Operations

5. Easy to Query

Migration from Legacy Format

Configuration Example

Troubleshooting

No data appearing in current.json

Old format still being used

Archive compression not working

Performance Characteristics

Future Enhancements

Related Issues

Current Data (`current.json`)

Archive Files (`archives/YYYY/MM/history-YYYY-MM-DD.jsonl`)

1. Monitor Systems (`monitor-systems.yml`)

2. Status Update (`status-update.yml`)

Compress Archives (`compress-archives.yml`)