Sakeeb91
diff --git a/‎.github/PR_DESCRIPTION.md‎
Lines changed: 126 additions & 0 deletions b/‎.github/PR_DESCRIPTION.md‎
Lines changed: 126 additions & 0 deletions
diff --git a/‎.github/workflows/cd.yml‎
Lines changed: 2 additions & 1 deletion b/‎.github/workflows/cd.yml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 10 additions & 8 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 10 additions & 8 deletions
diff --git a/‎docs/issues/0005-health-check-system.md‎
Lines changed: 86 additions & 0 deletions b/‎docs/issues/0005-health-check-system.md‎
Lines changed: 86 additions & 0 deletions
@@ -0,0 +1,126 @@
+## Summary
+
+Implements a standardized health check system across all microservices to enable reliable monitoring, load balancing, and orchestration. This PR adds consistent health endpoints that can be consumed by Kubernetes liveness/readiness probes, API gateways, and monitoring systems.
+
+## Changes
+
+### New Package: `@scribemed/health`
+
+- Created reusable health check utilities package
+- Supports liveness, readiness, and comprehensive health checks
+- Database connectivity checks
+- Memory usage monitoring
+- Configurable health check handlers
+
+### Service Updates
+
+- **Transcription Service**: Added `/health`, `/health/live`, and `/health/ready` endpoints
+- **Documentation Service**: Added health endpoints with database connectivity checks
+- **Coding Service**: Added standardized health endpoints
+- All services return consistent JSON response format
+
+### Kubernetes Integration
+
+- Created deployment manifests for all services with liveness/readiness probes
+- Configured appropriate timeouts and thresholds
+- Added health check probes to staging environment
+
+### Testing
+
+- Unit tests for health package (9 tests, all passing)
+- Integration tests for service health endpoints
+- Updated existing service tests
+
+## Testing
+
+- [x] `pnpm lint` - All code passes linting
+- [x] `pnpm test` - All tests passing
+  - Health package: 9/9 tests passing
+  - Coding service: 5/5 tests passing
+  - Transcription service: 4/4 tests passing
+  - Documentation service: 4/4 tests passing
+- [x] `pnpm build` - All packages build successfully
+
+## Health Check Endpoints
+
+All services now expose three standardized endpoints:
+
+- **`GET /health/live`** - Liveness probe (always returns healthy if process is running)
+- **`GET /health/ready`** - Readiness probe (checks critical dependencies like database)
+- **`GET /health`** - Comprehensive health check (includes all checks + memory usage)
+
+### Response Format
+
+```json
+{
+  "status": "healthy" | "degraded" | "unhealthy",
+  "timestamp": "2024-01-01T00:00:00.000Z",
+  "service": "service-name",
+  "checks": {
+    "database": {
+      "status": "healthy",
+      "responseTime": 5
+    },
+    "memory": {
+      "status": "healthy",
+      "heapUsedMB": 45.2,
+      "heapUsagePercent": 45.2
+    }
+  }
+}
+```
+
+## Kubernetes Probes
+
+Example probe configuration:
+
+```yaml
+livenessProbe:
+  httpGet:
+    path: /health/live
+    port: 8080
+  initialDelaySeconds: 10
+  periodSeconds: 10
+
+readinessProbe:
+  httpGet:
+    path: /health/ready
+    port: 8080
+  initialDelaySeconds: 5
+  periodSeconds: 5
+```
+
+## Files Changed
+
+- **New Files:**
+  - `packages/health/` - Complete health check package
+  - `infrastructure/kubernetes/staging/*-service.yaml` - Service deployments with probes
+  - `docs/issues/0005-health-check-system.md` - Issue documentation
+  - `docs/issues/0006-health-check-enhancements.md` - Follow-up enhancements issue
+
+- **Modified Files:**
+  - `services/*/package.json` - Added health package dependency
+  - `services/*/src/server.js` - Integrated health endpoints
+  - `services/*/tests/server.test.js` - Added health check tests
+
+## Related Issues
+
+- Closes #5
+- Related: #6 (follow-up enhancements identified during implementation)
+
+## Documentation
+
+- Health package README: `packages/health/README.md`
+- Issue documentation: `docs/issues/0005-health-check-system.md`
+- Follow-up enhancements: `docs/issues/0006-health-check-enhancements.md`
+
+## Next Steps
+
+After merge, consider implementing enhancements from issue #6:
+
+- Timeout management for health checks
+- Metrics integration (Prometheus)
+- Circuit breaker pattern
+- Health check result caching
+- Configuration flexibility
+
@@ -59,7 +59,8 @@ jobs:
       - name: Run environment smoke checks
         run: ${{ steps.kubeconfig.outputs.health_script }}
       - name: Notify Slack
-        if: secrets.SLACK_WEBHOOK != ''
+        if: always()
+        continue-on-error: true
         uses: slackapi/slack-github-action@v1.25.0
         with:
           payload: |
 
@@ -31,7 +31,6 @@ jobs:
       - uses: actions/setup-node@v4
         with:
           node-version: 20
-          cache: pnpm
       - name: Install dependencies
         run: pnpm install --frozen-lockfile
       - name: Run ESLint
@@ -59,7 +58,6 @@ jobs:
       - uses: actions/setup-node@v4
         with:
           node-version: ${{ matrix.node }}
-          cache: pnpm
       - name: Install dependencies
         run: pnpm install --frozen-lockfile
       - name: Run unit test suite
@@ -77,7 +75,6 @@ jobs:
       - uses: actions/setup-node@v4
         with:
           node-version: 20
-          cache: pnpm
       - name: Install dependencies
         run: pnpm install --frozen-lockfile
       - name: Run integration suite
@@ -140,7 +137,6 @@ jobs:
       - uses: actions/setup-node@v4
         with:
           node-version: 20
-          cache: pnpm
       - name: Install dependencies
         run: pnpm install --frozen-lockfile
       - name: Build packages and services
@@ -167,6 +163,9 @@ jobs:
       contents: read
     steps:
       - uses: actions/checkout@v4
+      - name: Derive image metadata
+        id: image-meta
+        run: echo "repository=$(echo '${{ github.repository }}' | tr '[:upper:]' '[:lower:]')" >> "$GITHUB_OUTPUT"
       - name: Log in to GitHub Container Registry
         uses: docker/login-action@v3
         with:
@@ -179,7 +178,9 @@ jobs:
           context: .
           file: services/${{ matrix.service }}/Dockerfile
           push: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
-          tags: ghcr.io/${{ github.repository }}/${{ matrix.service }}:${{ github.sha }}
+          tags: |
+            ghcr.io/${{ steps.image-meta.outputs.repository }}/${{ matrix.service }}-service:${{ github.sha }}
+            ghcr.io/${{ steps.image-meta.outputs.repository }}/${{ matrix.service }}-service:latest
           cache-from: type=gha
           cache-to: type=gha,mode=max
 
@@ -204,7 +205,8 @@ jobs:
       - name: Run staging smoke tests
         run: scripts/ci/health-check-staging.sh
       - name: Notify Slack
-        if: secrets.SLACK_WEBHOOK != ''
+        if: always()
+        continue-on-error: true
         uses: slackapi/slack-github-action@v1.25.0
         with:
           payload: |
@@ -235,7 +237,8 @@ jobs:
       - name: Run production smoke tests
         run: scripts/ci/health-check-production.sh
       - name: Notify Slack
-        if: secrets.SLACK_WEBHOOK != ''
+        if: always()
+        continue-on-error: true
         uses: slackapi/slack-github-action@v1.25.0
         with:
           payload: |
@@ -244,4 +247,3 @@ jobs:
             }
         env:
           SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
-
 
@@ -0,0 +1,86 @@
+# Issue #5: Implement Standardized Health Check System
+
+## Summary
+
+Implement a standardized health check system across all microservices to enable reliable monitoring, load balancing, and orchestration. This system should provide consistent health endpoints that can be consumed by Kubernetes liveness/readiness probes, API gateways, and monitoring systems.
+
+## Background
+
+Currently, services lack standardized health check endpoints. This makes it difficult to:
+
+- Configure Kubernetes liveness and readiness probes
+- Monitor service health in production
+- Implement proper load balancing
+- Detect and handle degraded service states
+- Integrate with monitoring and alerting systems
+
+## Proposed Changes
+
+1. **Create shared health check package**
+   - Add `packages/health/` package with reusable health check utilities
+   - Support for basic health, readiness, and liveness checks
+   - Database connectivity checks
+   - Dependency health checks (external services, etc.)
+
+2. **Implement health endpoints in all services**
+   - Add `/health`, `/health/ready`, and `/health/live` endpoints
+   - Integrate with existing services (transcription, documentation, coding)
+   - Return standardized JSON responses
+
+3. **Update Kubernetes configurations**
+   - Add liveness and readiness probes to service deployments
+   - Configure appropriate timeouts and thresholds
+
+4. **Add health check tests**
+   - Unit tests for health check logic
+   - Integration tests for health endpoints
+
+## Acceptance Criteria
+
+- [x] `packages/health/` package created with reusable health check utilities
+- [x] All services expose `/health`, `/health/ready`, and `/health/live` endpoints
+- [x] Health endpoints return standardized JSON responses
+- [x] Database connectivity is checked in readiness probes
+- [x] Kubernetes manifests updated with liveness/readiness probes
+- [x] Unit and integration tests added for health checks
+- [x] Documentation updated with health check usage
+
+## Implementation Details
+
+### Health Check Types
+
+- **Liveness**: Indicates if the service is running (should always return 200 if process is alive)
+- **Readiness**: Indicates if the service is ready to accept traffic (checks dependencies like DB)
+- **Health**: Comprehensive health status including dependencies
+
+### Runtime Behavior
+
+- Services with external dependencies keep `/health/ready` in a failing state until critical checks pass. For example, the documentation service now surfaces an `unhealthy` status (503) whenever the PostgreSQL pool cannot be established, preventing Kubernetes from routing traffic prematurely.
+- Dependency initialization is retried automatically. The retry cadence is controlled via the optional `DATABASE_RETRY_DELAY_MS` environment variable (defaults to 5000ms).
+- Comprehensive `/health` responses always include dependency check details, even when readiness is blocked, to aid debugging and alerting.
+
+### Response Format
+
+```json
+{
+  "status": "healthy" | "degraded" | "unhealthy",
+  "timestamp": "2024-01-01T00:00:00Z",
+  "checks": {
+    "database": {
+      "status": "healthy",
+      "responseTime": 5
+    },
+    "memory": {
+      "status": "healthy",
+      "usage": 45.2
+    }
+  }
+}
+```
+
+## Status: Completed
+
+## Related Issues
+
+- Issue #2: Monorepo Developer Experience
+- Issue #15: CI/CD Pipeline (health checks needed for deployment)