Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions docs/implementation/PHASE_3_ALERT_BASELINE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Phase 3 Alert Baseline

This baseline defines starter alerts for runtime reliability. Thresholds should be tuned with production traffic history.

## SLO-aligned Starter Alerts

1. Error rate alert
- Condition: 5xx rate > 1% over 5 minutes
- Signal: `tinyurl.http.server.requests.total` with `status_class=5xx`
- Action: check recent deploy, downstream DB state, and app logs by `correlation_id`

2. Latency alert
- Condition: P99 latency > 500 ms over 10 minutes
- Signal: `http.server.requests` percentile metrics from Actuator/Prometheus
- Action: inspect slow endpoints, DB pool pressure, and host CPU/memory

3. Readiness degradation
- Condition: `/actuator/health/readiness` not `UP` for 2+ checks
- Signal: readiness endpoint health status
- Action: inspect DB connectivity and Flyway startup validation state

4. Liveness instability
- Condition: frequent restarts or liveness failures in 10 minutes
- Signal: container restart count + `/actuator/health/liveness`
- Action: inspect fatal exceptions, memory pressure, and image/runtime mismatches

## Triage Flow

1. Confirm user impact from error and latency graphs.
2. Filter logs by `correlation_id` and endpoint path.
3. Validate dependency health (`db`, readiness group).
4. Roll back recent deploy if regression is confirmed.
5. Add post-incident action item with metric threshold adjustment if needed.
149 changes: 149 additions & 0 deletions docs/implementation/PHASE_3_IMPLEMENTATION_EXPLANATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Phase 3 Implementation Explanation

This document explains what was implemented in Phase 3 (Observability), why it was added, and how to verify it before committing.

## Goal of Phase 3

Add production-grade observability for:

1. Structured logging with correlation id
2. Request/error metrics and latency visibility
3. Readiness and liveness health visibility
4. Basic alerting and log-aggregation operational guidance

## What Was Implemented

### 1) Structured JSON logging

Files:

- tinyurl/src/main/resources/logback-spring.xml
- tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java

What changed:

- Logging output is now JSON using logstash-logback-encoder.
- A correlation id is propagated through X-Correlation-Id.
- Correlation id is added to MDC as correlation_id.
- Request metadata is logged (method, route, status, duration, client_ip, user_agent).

Why:

- Makes logs machine-parseable and searchable.
- Enables request-level traceability across services and failures.

### 2) Metrics instrumentation

Files:

- tinyurl/build.gradle.kts
- tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java
- tinyurl/src/main/java/com/tinyurl/controller/GlobalExceptionHandler.java
- tinyurl/src/main/resources/application.yaml

What changed:

- Added Prometheus registry dependency.
- Added request counter metric:
- tinyurl.http.server.requests.total
- Added request duration metric:
- tinyurl.http.server.request.duration
- Added explicit error counter metric in exception handlers:
- tinyurl.http.server.errors.total
- Enabled actuator metrics and prometheus endpoints.
- Enabled percentiles/histogram for http.server.requests.
- Added common metric tag: application=tinyurl.

Why:

- Supports request rate, latency, and error-rate monitoring.
- Enables baseline alert conditions (error and p99 latency).

### 3) Health model hardening

File:

- tinyurl/src/main/resources/application.yaml

What changed:

- Enabled component-level health details and components.
- Defined explicit health groups:
- readiness: readinessState, db, diskSpace, ping
- liveness: livenessState, ping
- Exposed endpoints:
- /actuator/health
- /actuator/metrics
- /actuator/prometheus

Why:

- Improves dependency-aware readiness behavior.
- Makes startup/degraded dependency states visible and actionable.

### 4) Operational documentation for alerting and logs

Files:

- docs/implementation/PHASE_3_ALERT_BASELINE.md
- docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md
- docs/implementation/PHASE_3_OBSERVABILITY.md

What changed:

- Added starter alert thresholds and triage flow.
- Added baseline log aggregation architecture and checklist.
- Linked both documents in Phase 3 observability execution guide.

Why:

- Ensures implementation is operable, not only code-complete.
- Gives clear next actions for production rollout.

## Verification Steps Used

1. Unit/integration tests:

- run tests for app context, service logic, and encoder tests
- expected result: all passing

2. Runtime verification:

- docker compose up -d --build
- check readiness endpoint
- check liveness endpoint
- check prometheus endpoint exports:
- tinyurl_http_server_requests_total
- tinyurl_http_server_errors_total
- hikaricp metrics

3. Error metric trigger check:

- send a known invalid request (for example unknown short code format)
- confirm tinyurl_http_server_errors_total increments with tags

## Commit Scope (Phase 3)

Code/config files:

- tinyurl/build.gradle.kts
- tinyurl/src/main/resources/application.yaml
- tinyurl/src/main/resources/logback-spring.xml
- tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java
- tinyurl/src/main/java/com/tinyurl/controller/GlobalExceptionHandler.java

Docs:

- docs/implementation/PHASE_3_OBSERVABILITY.md
- docs/implementation/PHASE_3_ALERT_BASELINE.md
- docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md
- docs/implementation/PHASE_3_IMPLEMENTATION_EXPLANATION.md

## Suggested Commit Message

feat(observability): implement phase 3 logging, metrics, health groups, and operational baselines

## Notes

- This implementation intentionally stops at Phase 3 baseline level.
- Full external dashboard platform rollout and distributed tracing are still out of scope for this phase.
63 changes: 63 additions & 0 deletions docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Phase 3 Log Aggregation Baseline

This document defines a minimal production-ready approach for collecting and querying structured application logs.

## Goal

Ensure logs from all services can be searched by time range, severity, endpoint, and correlation id.

## Required Log Fields

All application logs should include at least:

- `@timestamp`
- `level`
- `message`
- `service`
- `correlation_id`
- `logger_name`
- request metadata fields when available (`method`, `route`, `status`, `duration_ms`)

## Recommended Pipeline (Baseline)

1. App emits JSON logs to stdout.
2. Container runtime captures stdout/stderr.
3. Log shipper (CloudWatch Agent, Fluent Bit, Filebeat, or Vector) forwards logs.
4. Central store indexes logs (CloudWatch Logs, ELK/OpenSearch, Grafana Loki).
5. Dashboards and alerts query centralized logs.

## Minimum Alerts for Logs

1. Error volume spike
- Trigger when ERROR logs exceed baseline over 5 minutes.

2. Correlation-id missing rate
- Trigger when logs without `correlation_id` exceed 1%.

3. Exception signature surge
- Trigger on sudden spikes for repeated exception signatures.

## Triage Query Examples

1. Correlate one failing request
- Filter by `correlation_id` and inspect all matching events.

2. Endpoint failure analysis
- Filter by `route` and `status>=500`, group by exception/error code.

3. Slow request analysis
- Filter by high `duration_ms`, group by route and time window.

## Security and Privacy Notes

- Never log credentials, tokens, or full secrets.
- Mask sensitive fields before logging.
- Keep retention and access controls aligned with security policy.

## Rollout Checklist

- [ ] JSON logs enabled in all environments
- [ ] Log shipping configured for app containers
- [ ] Correlation id searchable in central logs
- [ ] Error and latency dashboards created
- [ ] Alert rules validated with test events
Loading
Loading