diff --git a/docs/implementation/PHASE_3_ALERT_BASELINE.md b/docs/implementation/PHASE_3_ALERT_BASELINE.md new file mode 100644 index 0000000..84daff1 --- /dev/null +++ b/docs/implementation/PHASE_3_ALERT_BASELINE.md @@ -0,0 +1,33 @@ +# Phase 3 Alert Baseline + +This baseline defines starter alerts for runtime reliability. Thresholds should be tuned with production traffic history. + +## SLO-aligned Starter Alerts + +1. Error rate alert +- Condition: 5xx rate > 1% over 5 minutes +- Signal: `tinyurl.http.server.requests.total` with `status_class=5xx` +- Action: check recent deploy, downstream DB state, and app logs by `correlation_id` + +2. Latency alert +- Condition: P99 latency > 500 ms over 10 minutes +- Signal: `http.server.requests` percentile metrics from Actuator/Prometheus +- Action: inspect slow endpoints, DB pool pressure, and host CPU/memory + +3. Readiness degradation +- Condition: `/actuator/health/readiness` not `UP` for 2+ checks +- Signal: readiness endpoint health status +- Action: inspect DB connectivity and Flyway startup validation state + +4. Liveness instability +- Condition: frequent restarts or liveness failures in 10 minutes +- Signal: container restart count + `/actuator/health/liveness` +- Action: inspect fatal exceptions, memory pressure, and image/runtime mismatches + +## Triage Flow + +1. Confirm user impact from error and latency graphs. +2. Filter logs by `correlation_id` and endpoint path. +3. Validate dependency health (`db`, readiness group). +4. Roll back recent deploy if regression is confirmed. +5. Add post-incident action item with metric threshold adjustment if needed. diff --git a/docs/implementation/PHASE_3_IMPLEMENTATION_EXPLANATION.md b/docs/implementation/PHASE_3_IMPLEMENTATION_EXPLANATION.md new file mode 100644 index 0000000..655dc9d --- /dev/null +++ b/docs/implementation/PHASE_3_IMPLEMENTATION_EXPLANATION.md @@ -0,0 +1,149 @@ +# Phase 3 Implementation Explanation + +This document explains what was implemented in Phase 3 (Observability), why it was added, and how to verify it before committing. + +## Goal of Phase 3 + +Add production-grade observability for: + +1. Structured logging with correlation id +2. Request/error metrics and latency visibility +3. Readiness and liveness health visibility +4. Basic alerting and log-aggregation operational guidance + +## What Was Implemented + +### 1) Structured JSON logging + +Files: + +- tinyurl/src/main/resources/logback-spring.xml +- tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java + +What changed: + +- Logging output is now JSON using logstash-logback-encoder. +- A correlation id is propagated through X-Correlation-Id. +- Correlation id is added to MDC as correlation_id. +- Request metadata is logged (method, route, status, duration, client_ip, user_agent). + +Why: + +- Makes logs machine-parseable and searchable. +- Enables request-level traceability across services and failures. + +### 2) Metrics instrumentation + +Files: + +- tinyurl/build.gradle.kts +- tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java +- tinyurl/src/main/java/com/tinyurl/controller/GlobalExceptionHandler.java +- tinyurl/src/main/resources/application.yaml + +What changed: + +- Added Prometheus registry dependency. +- Added request counter metric: + - tinyurl.http.server.requests.total +- Added request duration metric: + - tinyurl.http.server.request.duration +- Added explicit error counter metric in exception handlers: + - tinyurl.http.server.errors.total +- Enabled actuator metrics and prometheus endpoints. +- Enabled percentiles/histogram for http.server.requests. +- Added common metric tag: application=tinyurl. + +Why: + +- Supports request rate, latency, and error-rate monitoring. +- Enables baseline alert conditions (error and p99 latency). + +### 3) Health model hardening + +File: + +- tinyurl/src/main/resources/application.yaml + +What changed: + +- Enabled component-level health details and components. +- Defined explicit health groups: + - readiness: readinessState, db, diskSpace, ping + - liveness: livenessState, ping +- Exposed endpoints: + - /actuator/health + - /actuator/metrics + - /actuator/prometheus + +Why: + +- Improves dependency-aware readiness behavior. +- Makes startup/degraded dependency states visible and actionable. + +### 4) Operational documentation for alerting and logs + +Files: + +- docs/implementation/PHASE_3_ALERT_BASELINE.md +- docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md +- docs/implementation/PHASE_3_OBSERVABILITY.md + +What changed: + +- Added starter alert thresholds and triage flow. +- Added baseline log aggregation architecture and checklist. +- Linked both documents in Phase 3 observability execution guide. + +Why: + +- Ensures implementation is operable, not only code-complete. +- Gives clear next actions for production rollout. + +## Verification Steps Used + +1. Unit/integration tests: + +- run tests for app context, service logic, and encoder tests +- expected result: all passing + +2. Runtime verification: + +- docker compose up -d --build +- check readiness endpoint +- check liveness endpoint +- check prometheus endpoint exports: + - tinyurl_http_server_requests_total + - tinyurl_http_server_errors_total + - hikaricp metrics + +3. Error metric trigger check: + +- send a known invalid request (for example unknown short code format) +- confirm tinyurl_http_server_errors_total increments with tags + +## Commit Scope (Phase 3) + +Code/config files: + +- tinyurl/build.gradle.kts +- tinyurl/src/main/resources/application.yaml +- tinyurl/src/main/resources/logback-spring.xml +- tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java +- tinyurl/src/main/java/com/tinyurl/controller/GlobalExceptionHandler.java + +Docs: + +- docs/implementation/PHASE_3_OBSERVABILITY.md +- docs/implementation/PHASE_3_ALERT_BASELINE.md +- docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md +- docs/implementation/PHASE_3_IMPLEMENTATION_EXPLANATION.md + +## Suggested Commit Message + +feat(observability): implement phase 3 logging, metrics, health groups, and operational baselines + +## Notes + +- This implementation intentionally stops at Phase 3 baseline level. +- Full external dashboard platform rollout and distributed tracing are still out of scope for this phase. diff --git a/docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md b/docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md new file mode 100644 index 0000000..d94f761 --- /dev/null +++ b/docs/implementation/PHASE_3_LOG_AGGREGATION_BASELINE.md @@ -0,0 +1,63 @@ +# Phase 3 Log Aggregation Baseline + +This document defines a minimal production-ready approach for collecting and querying structured application logs. + +## Goal + +Ensure logs from all services can be searched by time range, severity, endpoint, and correlation id. + +## Required Log Fields + +All application logs should include at least: + +- `@timestamp` +- `level` +- `message` +- `service` +- `correlation_id` +- `logger_name` +- request metadata fields when available (`method`, `route`, `status`, `duration_ms`) + +## Recommended Pipeline (Baseline) + +1. App emits JSON logs to stdout. +2. Container runtime captures stdout/stderr. +3. Log shipper (CloudWatch Agent, Fluent Bit, Filebeat, or Vector) forwards logs. +4. Central store indexes logs (CloudWatch Logs, ELK/OpenSearch, Grafana Loki). +5. Dashboards and alerts query centralized logs. + +## Minimum Alerts for Logs + +1. Error volume spike +- Trigger when ERROR logs exceed baseline over 5 minutes. + +2. Correlation-id missing rate +- Trigger when logs without `correlation_id` exceed 1%. + +3. Exception signature surge +- Trigger on sudden spikes for repeated exception signatures. + +## Triage Query Examples + +1. Correlate one failing request +- Filter by `correlation_id` and inspect all matching events. + +2. Endpoint failure analysis +- Filter by `route` and `status>=500`, group by exception/error code. + +3. Slow request analysis +- Filter by high `duration_ms`, group by route and time window. + +## Security and Privacy Notes + +- Never log credentials, tokens, or full secrets. +- Mask sensitive fields before logging. +- Keep retention and access controls aligned with security policy. + +## Rollout Checklist + +- [ ] JSON logs enabled in all environments +- [ ] Log shipping configured for app containers +- [ ] Correlation id searchable in central logs +- [ ] Error and latency dashboards created +- [ ] Alert rules validated with test events diff --git a/docs/implementation/PHASE_3_LOG_STORAGE_MIGRATION_LOKI_PROMTAIL_GRAFANA.md b/docs/implementation/PHASE_3_LOG_STORAGE_MIGRATION_LOKI_PROMTAIL_GRAFANA.md new file mode 100644 index 0000000..2eeec5d --- /dev/null +++ b/docs/implementation/PHASE_3_LOG_STORAGE_MIGRATION_LOKI_PROMTAIL_GRAFANA.md @@ -0,0 +1,230 @@ +# Phase 3 Log Storage Migration Plan (Free): Loki + Promtail + Grafana + +This document provides a production-oriented migration plan for storing logs securely at zero license cost using the recommended stack: + +- Grafana Loki (log store) +- Promtail (log shipper) +- Grafana (query, dashboards, alerting) + +## 1) Why this stack + +### Free options considered + +1. Loki + Promtail + Grafana (recommended) +- Pros: low-cost indexing model, simple operations, good Docker support, strong Grafana integration +- Cons: log query language is different from Elasticsearch ecosystem + +2. OpenSearch + Fluent Bit/Filebeat +- Pros: powerful full-text search, mature ecosystem +- Cons: heavier memory/storage footprint and higher ops complexity + +3. ELK (Elasticsearch + Logstash + Kibana) +- Pros: mature and widely known +- Cons: most resource-heavy for small teams + +### Recommendation + +Use Loki + Promtail + Grafana for v1/v2 scale because it is easiest to operate and secure on a single-host or small-cluster deployment. + +--- + +## 2) Target architecture (production) + +1. TinyURL app writes structured JSON logs to stdout. +2. Container runtime writes stdout/stderr to local log files. +3. Promtail tails container logs and ships to Loki over private network. +4. Loki stores log streams on encrypted disk. +5. Grafana queries Loki and provides dashboards/alerts. + +Logical flow: + +`App -> stdout -> Promtail -> Loki -> Grafana` + +--- + +## 3) Security controls (minimum baseline) + +1. Network isolation +- Run Loki and Promtail on private network only. +- Do not expose Loki directly to public internet. + +2. Transport security +- Use TLS for Grafana access. +- If Loki is remote, use TLS/mTLS between Promtail and Loki. + +3. Authentication and authorization +- Enable Grafana login (no anonymous access in production). +- Use strong admin password and role-based access. +- Restrict datasource edit rights to admins only. + +4. Data at rest +- Store Loki data on encrypted volume. +- Restrict filesystem permissions for log directories. + +5. Retention and lifecycle +- Set finite retention (for example 14-30 days to start). +- Enforce deletion and compaction to control risk/cost. + +6. Sensitive data handling +- Never log secrets, tokens, or passwords. +- Use Promtail pipeline stages to drop/mask sensitive patterns if needed. + +--- + +## 4) Migration strategy (phased) + +### Phase A: Prepare + +1. Confirm JSON logging is enabled in application. +2. Define required labels: `service`, `env`, `level`, `correlation_id`. +3. Define retention target and incident triage queries. + +### Phase B: Deploy logging stack + +1. Deploy Loki with persistent encrypted storage. +2. Deploy Promtail on same host(s) as app containers. +3. Deploy Grafana and add Loki datasource. + +### Phase C: Connect TinyURL logs + +1. Configure Promtail scrape job for container logs. +2. Parse JSON fields from app log lines. +3. Map key labels (`service=tinyurl`, `env=prod`, `level`, optional `route`). + +### Phase D: Validate + +1. Generate synthetic requests and errors. +2. Search by `correlation_id` end-to-end. +3. Verify alert rules trigger for error spikes. + +### Phase E: Harden + +1. Enable TLS and auth on Grafana endpoint. +2. Restrict network ingress to admin CIDRs/VPN. +3. Tune retention and label cardinality. + +--- + +## 5) Example production Compose skeleton + +Use this as a conceptual baseline and adapt to your deployment model. + +```yaml +services: + loki: + image: grafana/loki:3.0.0 + command: -config.file=/etc/loki/config.yaml + volumes: + - ./observability/loki/config.yaml:/etc/loki/config.yaml:ro + - loki-data:/loki + networks: [observability] + restart: unless-stopped + + promtail: + image: grafana/promtail:3.0.0 + command: -config.file=/etc/promtail/config.yaml + volumes: + - ./observability/promtail/config.yaml:/etc/promtail/config.yaml:ro + - /var/lib/docker/containers:/var/lib/docker/containers:ro + - /var/run/docker.sock:/var/run/docker.sock:ro + networks: [observability] + restart: unless-stopped + + grafana: + image: grafana/grafana:11.0.0 + environment: + - GF_SECURITY_ADMIN_USER=admin + - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD} + - GF_AUTH_ANONYMOUS_ENABLED=false + volumes: + - grafana-data:/var/lib/grafana + ports: + - "3000:3000" + networks: [observability] + restart: unless-stopped + +volumes: + loki-data: + grafana-data: + +networks: + observability: + internal: true +``` + +Notes: + +- Prefer not to publish Loki port publicly. +- Publish Grafana through secure reverse proxy with TLS. + +--- + +## 6) Promtail parsing recommendations + +1. Use docker/container scrape configs. +2. Apply JSON pipeline stage to extract: +- `level` +- `correlation_id` +- `logger_name` +- `message` + +3. Keep label cardinality low: +- Good labels: service, env, level +- Avoid high-cardinality labels: user_id, request_id as labels +- Keep high-cardinality fields in log body, not labels + +--- + +## 7) Dashboard and alert baseline + +Create Grafana panels for: + +1. Error volume by level and route +2. Top exception signatures +3. Missing-correlation-id count +4. Slow-request logs (`duration_ms` threshold) + +Starter alerts: + +1. Error log rate spike in 5 minutes +2. Repeated exception signature spike +3. Promtail ingestion failures +4. Loki disk usage threshold breach + +--- + +## 8) Operational runbook (minimum) + +1. Incident lookup flow +- Start with alert time window +- Filter `service=tinyurl` +- Pivot by `correlation_id` +- Correlate with metrics and health endpoints + +2. Capacity checks +- Loki disk growth/day +- Query latency in Grafana Explore +- Promtail backlog/retry behavior + +3. Backup/restore +- Backup Loki persistent volume snapshots +- Test restore at least once per quarter + +--- + +## 9) Acceptance criteria for migration completion + +1. Logs searchable in Grafana for all app instances. +2. Correlation-id trace works for one request across full flow. +3. Security controls enabled (auth, TLS, network restrictions). +4. Retention policy active and verified. +5. At least three log-based alerts configured and tested. + +--- + +## 10) Future upgrades (optional) + +1. Move from Promtail to Grafana Alloy when standardizing telemetry agents. +2. Add long-term object storage backend for Loki if retention grows. +3. Add SSO for Grafana access control. +4. Add trace correlation once distributed tracing is introduced. diff --git a/docs/implementation/PHASE_3_OBSERVABILITY.md b/docs/implementation/PHASE_3_OBSERVABILITY.md index 911ff5e..eb62e1f 100644 --- a/docs/implementation/PHASE_3_OBSERVABILITY.md +++ b/docs/implementation/PHASE_3_OBSERVABILITY.md @@ -75,6 +75,12 @@ Tasks: - Define target alert thresholds (error and latency) - Document triage paths for common failures +Reference baseline: + +- [PHASE_3_ALERT_BASELINE.md](PHASE_3_ALERT_BASELINE.md) +- [PHASE_3_LOG_AGGREGATION_BASELINE.md](PHASE_3_LOG_AGGREGATION_BASELINE.md) +- [PHASE_3_LOG_STORAGE_MIGRATION_LOKI_PROMTAIL_GRAFANA.md](PHASE_3_LOG_STORAGE_MIGRATION_LOKI_PROMTAIL_GRAFANA.md) + ## Deliverables - Structured logs enabled and validated diff --git a/tinyurl/build.gradle.kts b/tinyurl/build.gradle.kts index d0be12f..6fd08ae 100644 --- a/tinyurl/build.gradle.kts +++ b/tinyurl/build.gradle.kts @@ -31,6 +31,8 @@ dependencies { implementation("org.springframework.boot:spring-boot-starter-web") implementation("org.flywaydb:flyway-core") implementation("org.flywaydb:flyway-database-postgresql") + implementation("io.micrometer:micrometer-registry-prometheus") + implementation("net.logstash.logback:logstash-logback-encoder:7.4") compileOnly("org.projectlombok:lombok") developmentOnly("org.springframework.boot:spring-boot-devtools") runtimeOnly("org.postgresql:postgresql") diff --git a/tinyurl/src/main/java/com/tinyurl/controller/GlobalExceptionHandler.java b/tinyurl/src/main/java/com/tinyurl/controller/GlobalExceptionHandler.java index 075d301..4c90c97 100644 --- a/tinyurl/src/main/java/com/tinyurl/controller/GlobalExceptionHandler.java +++ b/tinyurl/src/main/java/com/tinyurl/controller/GlobalExceptionHandler.java @@ -3,6 +3,8 @@ import com.tinyurl.dto.ErrorResponse; import com.tinyurl.exception.GoneException; import com.tinyurl.exception.NotFoundException; +import io.micrometer.core.instrument.Counter; +import io.micrometer.core.instrument.MeterRegistry; import jakarta.validation.ConstraintViolationException; import jakarta.persistence.PersistenceException; import org.springframework.dao.DataAccessException; @@ -17,6 +19,12 @@ @RestControllerAdvice public class GlobalExceptionHandler { + private final MeterRegistry meterRegistry; + + public GlobalExceptionHandler(MeterRegistry meterRegistry) { + this.meterRegistry = meterRegistry; + } + @ExceptionHandler(MethodArgumentNotValidException.class) public ResponseEntity handleValidation(MethodArgumentNotValidException ex) { String code = "INVALID_REQUEST"; @@ -24,11 +32,13 @@ public ResponseEntity handleValidation(MethodArgumentNotValidExce if (fieldError != null && fieldError.getDefaultMessage() != null) { code = fieldError.getDefaultMessage(); } + incrementErrorMetric(HttpStatus.BAD_REQUEST, code); return ResponseEntity.badRequest().body(new ErrorResponse(code, messageForCode(code))); } @ExceptionHandler(ConstraintViolationException.class) public ResponseEntity handleConstraintViolation(ConstraintViolationException ex) { + incrementErrorMetric(HttpStatus.BAD_REQUEST, "INVALID_URL"); return ResponseEntity.badRequest() .body(new ErrorResponse("INVALID_URL", messageForCode("INVALID_URL"))); } @@ -39,6 +49,7 @@ public ResponseEntity handleIllegalArgument(IllegalArgumentExcept HttpStatus status = "INVALID_EXPIRY".equals(code) || "INVALID_URL".equals(code) ? HttpStatus.BAD_REQUEST : HttpStatus.INTERNAL_SERVER_ERROR; + incrementErrorMetric(status, code); return ResponseEntity.status(status).body(new ErrorResponse(code, messageForCode(code))); } @@ -46,6 +57,7 @@ public ResponseEntity handleIllegalArgument(IllegalArgumentExcept public ResponseEntity handleServiceUnavailable(Exception ex) { HttpHeaders headers = new HttpHeaders(); headers.add("Retry-After", "30"); + incrementErrorMetric(HttpStatus.SERVICE_UNAVAILABLE, "SERVICE_UNAVAILABLE"); return ResponseEntity.status(HttpStatus.SERVICE_UNAVAILABLE) .headers(headers) .body(new ErrorResponse("SERVICE_UNAVAILABLE", "The service is temporarily unavailable. Please try again.")); @@ -53,22 +65,46 @@ public ResponseEntity handleServiceUnavailable(Exception ex) { @ExceptionHandler(NotFoundException.class) public ResponseEntity handleNotFound(NotFoundException ex) { + incrementErrorMetric(HttpStatus.NOT_FOUND, "NOT_FOUND"); return ResponseEntity.status(HttpStatus.NOT_FOUND) .body(new ErrorResponse("NOT_FOUND", ex.getMessage())); } @ExceptionHandler(GoneException.class) public ResponseEntity handleGone(GoneException ex) { + incrementErrorMetric(HttpStatus.GONE, "GONE"); return ResponseEntity.status(HttpStatus.GONE) .body(new ErrorResponse("GONE", ex.getMessage())); } @ExceptionHandler(Exception.class) public ResponseEntity handleUnexpected(Exception ex) { + incrementErrorMetric(HttpStatus.INTERNAL_SERVER_ERROR, "INTERNAL_ERROR"); return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR) .body(new ErrorResponse("INTERNAL_ERROR", "An unexpected error occurred. Please try again.")); } + private void incrementErrorMetric(HttpStatus status, String errorCode) { + String normalizedCode = normalizeErrorCode(errorCode); + Counter.builder("tinyurl.http.server.errors.total") + .tag("status", Integer.toString(status.value())) + .tag("error_code", normalizedCode) + .register(meterRegistry) + .increment(); + } + + private String normalizeErrorCode(String errorCode) { + if (errorCode == null || errorCode.isBlank()) { + return "UNKNOWN"; + } + // Map to bounded set of known error codes + return switch (errorCode) { + case "INVALID_URL", "INVALID_EXPIRY", "INVALID_REQUEST" -> errorCode; + case "SERVICE_UNAVAILABLE", "NOT_FOUND", "GONE", "INTERNAL_ERROR" -> errorCode; + default -> "UNKNOWN_ERROR"; + }; + } + private String messageForCode(String code) { return switch (code) { case "INVALID_URL" -> "URL must be a valid HTTP or HTTPS address (max 2048 characters)."; diff --git a/tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java b/tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java new file mode 100644 index 0000000..29c0281 --- /dev/null +++ b/tinyurl/src/main/java/com/tinyurl/observability/RequestObservabilityFilter.java @@ -0,0 +1,102 @@ +package com.tinyurl.observability; + +import io.micrometer.core.instrument.Counter; +import io.micrometer.core.instrument.MeterRegistry; +import io.micrometer.core.instrument.Timer; +import jakarta.servlet.FilterChain; +import jakarta.servlet.ServletException; +import jakarta.servlet.http.HttpServletRequest; +import jakarta.servlet.http.HttpServletResponse; +import java.io.IOException; +import java.util.UUID; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.slf4j.MDC; +import org.springframework.core.Ordered; +import org.springframework.core.annotation.Order; +import org.springframework.stereotype.Component; +import org.springframework.web.filter.OncePerRequestFilter; +import org.springframework.web.servlet.HandlerMapping; + +@Component +@Order(Ordered.HIGHEST_PRECEDENCE) +public class RequestObservabilityFilter extends OncePerRequestFilter { + + private static final Logger log = LoggerFactory.getLogger(RequestObservabilityFilter.class); + private static final String CORRELATION_ID_HEADER = "X-Correlation-Id"; + private static final String CORRELATION_ID_MDC_KEY = "correlation_id"; + + private final MeterRegistry meterRegistry; + + public RequestObservabilityFilter(MeterRegistry meterRegistry) { + this.meterRegistry = meterRegistry; + } + + @Override + protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response, FilterChain filterChain) + throws ServletException, IOException { + + String correlationId = request.getHeader(CORRELATION_ID_HEADER); + if (correlationId == null || correlationId.isBlank()) { + correlationId = UUID.randomUUID().toString(); + } + + MDC.put(CORRELATION_ID_MDC_KEY, correlationId); + response.setHeader(CORRELATION_ID_HEADER, correlationId); + + long startNanos = System.nanoTime(); + int statusCode = HttpServletResponse.SC_INTERNAL_SERVER_ERROR; + + try { + filterChain.doFilter(request, response); + statusCode = response.getStatus(); + } finally { + try { + String method = request.getMethod(); + String route = (String) request.getAttribute(HandlerMapping.BEST_MATCHING_PATTERN_ATTRIBUTE); + if (route == null || route.isBlank()) { + route = "UNMAPPED"; + } + + String status = Integer.toString(statusCode); + String statusClass = (statusCode / 100) + "xx"; + String outcome = statusCode >= 400 ? "error" : "success"; + + long durationNanos = System.nanoTime() - startNanos; + Timer.builder("tinyurl.http.server.request.duration") + .tag("method", method) + .tag("route", route) + .tag("status", status) + .register(meterRegistry) + .record(durationNanos, java.util.concurrent.TimeUnit.NANOSECONDS); + + Counter.builder("tinyurl.http.server.requests.total") + .tag("method", method) + .tag("route", route) + .tag("status_class", statusClass) + .tag("outcome", outcome) + .register(meterRegistry) + .increment(); + + log.info( + "http_request method={} route={} status={} duration_ms={} client_ip={} user_agent={}", + method, + route, + statusCode, + durationNanos / 1_000_000, + request.getRemoteAddr(), + sanitize(request.getHeader("User-Agent")) + ); + } finally { + MDC.remove(CORRELATION_ID_MDC_KEY); + } + } + } + + private String sanitize(String value) { + if (value == null) { + return "unknown"; + } + return value.replaceAll("[\r\n]", " "); + } +} diff --git a/tinyurl/src/main/resources/application-prod.yaml b/tinyurl/src/main/resources/application-prod.yaml new file mode 100644 index 0000000..9a3dc92 --- /dev/null +++ b/tinyurl/src/main/resources/application-prod.yaml @@ -0,0 +1,7 @@ +# Production-specific configuration +management: + endpoints: + web: + exposure: + # Production: only expose health endpoint. Metrics/prometheus require separate admin port + authentication + include: health diff --git a/tinyurl/src/main/resources/application.yaml b/tinyurl/src/main/resources/application.yaml index 31f2cdc..04137e9 100644 --- a/tinyurl/src/main/resources/application.yaml +++ b/tinyurl/src/main/resources/application.yaml @@ -19,11 +19,26 @@ management: health: probes: enabled: true + show-components: when_authorized show-details: when_authorized + group: + readiness: + include: readinessState,db,diskSpace,ping + liveness: + include: livenessState,ping endpoints: web: exposure: - include: health + # Default: expose metrics for dev/test. Restricted in production via application-prod.yaml + include: health,metrics,prometheus + metrics: + tags: + application: ${spring.application.name} + distribution: + percentiles-histogram: + http.server.requests: true + percentiles: + http.server.requests: 0.95,0.99 tinyurl: base-url: ${TINYURL_BASE_URL:http://localhost} diff --git a/tinyurl/src/main/resources/logback-spring.xml b/tinyurl/src/main/resources/logback-spring.xml new file mode 100644 index 0000000..7e34c6b --- /dev/null +++ b/tinyurl/src/main/resources/logback-spring.xml @@ -0,0 +1,14 @@ + + + + + + {"service":"${appName}"} + correlation_id + + + + + + +