Buffden · Buffden · Mar 18, 2026 · Mar 17, 2026 · Mar 17, 2026 · Mar 18, 2026
diff --git a/docs/implementation/PHASE_3_ALERT_BASELINE.md b/docs/implementation/PHASE_3_ALERT_BASELINE.md
@@ -0,0 +1,33 @@
+# Phase 3 Alert Baseline
+
+This baseline defines starter alerts for runtime reliability. Thresholds should be tuned with production traffic history.
+
+## SLO-aligned Starter Alerts
+
+1. Error rate alert
+- Condition: 5xx rate > 1% over 5 minutes
+- Signal: `tinyurl.http.server.requests.total` with `status_class=5xx`
+- Action: check recent deploy, downstream DB state, and app logs by `correlation_id`
+
+2. Latency alert
+- Condition: P99 latency > 500 ms over 10 minutes
+- Signal: `http.server.requests` percentile metrics from Actuator/Prometheus
+- Action: inspect slow endpoints, DB pool pressure, and host CPU/memory
+
+3. Readiness degradation
+- Condition: `/actuator/health/readiness` not `UP` for 2+ checks
+- Signal: readiness endpoint health status
+- Action: inspect DB connectivity and Flyway startup validation state
+
+4. Liveness instability
+- Condition: frequent restarts or liveness failures in 10 minutes
+- Signal: container restart count + `/actuator/health/liveness`
+- Action: inspect fatal exceptions, memory pressure, and image/runtime mismatches
+
+## Triage Flow
+
+1. Confirm user impact from error and latency graphs.
+2. Filter logs by `correlation_id` and endpoint path.
+3. Validate dependency health (`db`, readiness group).
+4. Roll back recent deploy if regression is confirmed.
+5. Add post-incident action item with metric threshold adjustment if needed.
diff --git a/docs/implementation/PHASE_3_IMPLEMENTATION_EXPLANATION.md b/docs/implementation/PHASE_3_IMPLEMENTATION_EXPLANATION.md
@@ -0,0 +1,149 @@
+# Phase 3 Implementation Explanation
+
+This document explains what was implemented in Phase 3 (Observability), why it was added, and how to verify it before committing.
+
+## Goal of Phase 3
+
+Add production-grade observability for:
+
+1. Structured logging with correlation id
+2. Request/error metrics and latency visibility
+3. Readiness and liveness health visibility
+4. Basic alerting and log-aggregation operational guidance
+
+## What Was Implemented
+
+### 1) Structured JSON logging
+
+Files:
+
+- tinyurl/src/main/resources/logback-spring.xml
+- tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java
+
+What changed:
+
+- Logging output is now JSON using logstash-logback-encoder.
+- A correlation id is propagated through X-Correlation-Id.
+- Correlation id is added to MDC as correlation_id.
+- Request metadata is logged (method, route, status, duration, client_ip, user_agent).
+
+Why:
+
+- Makes logs machine-parseable and searchable.
+- Enables request-level traceability across services and failures.
+
+### 2) Metrics instrumentation
+
+Files:
+
+- tinyurl/build.gradle.kts
+- tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java
+- tinyurl/src/main/java/com/tinyurl/controller/GlobalExceptionHandler.java
+- tinyurl/src/main/resources/application.yaml
+
+What changed:
+
+- Added Prometheus registry dependency.
+- Added request counter metric:
+  - tinyurl.http.server.requests.total
+- Added request duration metric:
+  - tinyurl.http.server.request.duration
+- Added explicit error counter metric in exception handlers:
+  - tinyurl.http.server.errors.total
+- Enabled actuator metrics and prometheus endpoints.
+- Enabled percentiles/histogram for http.server.requests.
+- Added common metric tag: application=tinyurl.
+
+Why:
+
+- Supports request rate, latency, and error-rate monitoring.
+- Enables baseline alert conditions (error and p99 latency).
+
+### 3) Health model hardening
+
+File:
+
+- tinyurl/src/main/resources/application.yaml
+
+What changed:
+
+- Enabled component-level health details and components.
+- Defined explicit health groups:
+  - readiness: readinessState, db, diskSpace, ping
+  - liveness: livenessState, ping
+- Exposed endpoints:
+  - /actuator/health
+  - /actuator/metrics
+  - /actuator/prometheus
+
+Why:
+
+- Improves dependency-aware readiness behavior.
+- Makes startup/degraded dependency states visible and actionable.
+
+### 4) Operational documentation for alerting and logs
+
+Files:
+
+- docs/implementation/PHASE_3_ALERT_BASELINE.md
+- docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md
+- docs/implementation/PHASE_3_OBSERVABILITY.md
+
+What changed:
+
+- Added starter alert thresholds and triage flow.
+- Added baseline log aggregation architecture and checklist.
+- Linked both documents in Phase 3 observability execution guide.
+
+Why:
+
+- Ensures implementation is operable, not only code-complete.
+- Gives clear next actions for production rollout.
+
+## Verification Steps Used
+
+1. Unit/integration tests:
+
+- run tests for app context, service logic, and encoder tests
+- expected result: all passing
+
+2. Runtime verification:
+
+- docker compose up -d --build
+- check readiness endpoint
+- check liveness endpoint
+- check prometheus endpoint exports:
+  - tinyurl_http_server_requests_total
+  - tinyurl_http_server_errors_total
+  - hikaricp metrics
+
+3. Error metric trigger check:
+
+- send a known invalid request (for example unknown short code format)
+- confirm tinyurl_http_server_errors_total increments with tags
+
+## Commit Scope (Phase 3)
+
+Code/config files:
+
+- tinyurl/build.gradle.kts
+- tinyurl/src/main/resources/application.yaml
+- tinyurl/src/main/resources/logback-spring.xml
+- tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java
+- tinyurl/src/main/java/com/tinyurl/controller/GlobalExceptionHandler.java
+
+Docs:
+
+- docs/implementation/PHASE_3_OBSERVABILITY.md
+- docs/implementation/PHASE_3_ALERT_BASELINE.md
+- docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md
+- docs/implementation/PHASE_3_IMPLEMENTATION_EXPLANATION.md
+
+## Suggested Commit Message
+
+feat(observability): implement phase 3 logging, metrics, health groups, and operational baselines
+
+## Notes
+
+- This implementation intentionally stops at Phase 3 baseline level.
+- Full external dashboard platform rollout and distributed tracing are still out of scope for this phase.
diff --git a/docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md b/docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md
@@ -0,0 +1,63 @@
+# Phase 3 Log Aggregation Baseline
+
+This document defines a minimal production-ready approach for collecting and querying structured application logs.
+
+## Goal
+
+Ensure logs from all services can be searched by time range, severity, endpoint, and correlation id.
+
+## Required Log Fields
+
+All application logs should include at least:
+
+- `@timestamp`
+- `level`
+- `message`
+- `service`
+- `correlation_id`
+- `logger_name`
+- request metadata fields when available (`method`, `route`, `status`, `duration_ms`)
+
+## Recommended Pipeline (Baseline)
+
+1. App emits JSON logs to stdout.
+2. Container runtime captures stdout/stderr.
+3. Log shipper (CloudWatch Agent, Fluent Bit, Filebeat, or Vector) forwards logs.
+4. Central store indexes logs (CloudWatch Logs, ELK/OpenSearch, Grafana Loki).
+5. Dashboards and alerts query centralized logs.
+
+## Minimum Alerts for Logs
+
+1. Error volume spike
+- Trigger when ERROR logs exceed baseline over 5 minutes.
+
+2. Correlation-id missing rate
+- Trigger when logs without `correlation_id` exceed 1%.
+
+3. Exception signature surge
+- Trigger on sudden spikes for repeated exception signatures.
+
+## Triage Query Examples
+
+1. Correlate one failing request
+- Filter by `correlation_id` and inspect all matching events.
+
+2. Endpoint failure analysis
+- Filter by `route` and `status>=500`, group by exception/error code.
+
+3. Slow request analysis
+- Filter by high `duration_ms`, group by route and time window.
+
+## Security and Privacy Notes
+
+- Never log credentials, tokens, or full secrets.
+- Mask sensitive fields before logging.
+- Keep retention and access controls aligned with security policy.
+
+## Rollout Checklist
+
+- [ ] JSON logs enabled in all environments
+- [ ] Log shipping configured for app containers
+- [ ] Correlation id searchable in central logs
+- [ ] Error and latency dashboards created
+- [ ] Alert rules validated with test events