Skip to content

Commit cc6f765

Browse files
authored
Merge pull request #16 from Sakeeb91/feature/issue-5-health-check-system
feat: Implement Standardized Health Check System (Issue #5)
2 parents dc0174b + 545575e commit cc6f765

29 files changed

Lines changed: 4750 additions & 26 deletions

.github/PR_DESCRIPTION.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
## Summary
2+
3+
Implements a standardized health check system across all microservices to enable reliable monitoring, load balancing, and orchestration. This PR adds consistent health endpoints that can be consumed by Kubernetes liveness/readiness probes, API gateways, and monitoring systems.
4+
5+
## Changes
6+
7+
### New Package: `@scribemed/health`
8+
9+
- Created reusable health check utilities package
10+
- Supports liveness, readiness, and comprehensive health checks
11+
- Database connectivity checks
12+
- Memory usage monitoring
13+
- Configurable health check handlers
14+
15+
### Service Updates
16+
17+
- **Transcription Service**: Added `/health`, `/health/live`, and `/health/ready` endpoints
18+
- **Documentation Service**: Added health endpoints with database connectivity checks
19+
- **Coding Service**: Added standardized health endpoints
20+
- All services return consistent JSON response format
21+
22+
### Kubernetes Integration
23+
24+
- Created deployment manifests for all services with liveness/readiness probes
25+
- Configured appropriate timeouts and thresholds
26+
- Added health check probes to staging environment
27+
28+
### Testing
29+
30+
- Unit tests for health package (9 tests, all passing)
31+
- Integration tests for service health endpoints
32+
- Updated existing service tests
33+
34+
## Testing
35+
36+
- [x] `pnpm lint` - All code passes linting
37+
- [x] `pnpm test` - All tests passing
38+
- Health package: 9/9 tests passing
39+
- Coding service: 5/5 tests passing
40+
- Transcription service: 4/4 tests passing
41+
- Documentation service: 4/4 tests passing
42+
- [x] `pnpm build` - All packages build successfully
43+
44+
## Health Check Endpoints
45+
46+
All services now expose three standardized endpoints:
47+
48+
- **`GET /health/live`** - Liveness probe (always returns healthy if process is running)
49+
- **`GET /health/ready`** - Readiness probe (checks critical dependencies like database)
50+
- **`GET /health`** - Comprehensive health check (includes all checks + memory usage)
51+
52+
### Response Format
53+
54+
```json
55+
{
56+
"status": "healthy" | "degraded" | "unhealthy",
57+
"timestamp": "2024-01-01T00:00:00.000Z",
58+
"service": "service-name",
59+
"checks": {
60+
"database": {
61+
"status": "healthy",
62+
"responseTime": 5
63+
},
64+
"memory": {
65+
"status": "healthy",
66+
"heapUsedMB": 45.2,
67+
"heapUsagePercent": 45.2
68+
}
69+
}
70+
}
71+
```
72+
73+
## Kubernetes Probes
74+
75+
Example probe configuration:
76+
77+
```yaml
78+
livenessProbe:
79+
httpGet:
80+
path: /health/live
81+
port: 8080
82+
initialDelaySeconds: 10
83+
periodSeconds: 10
84+
85+
readinessProbe:
86+
httpGet:
87+
path: /health/ready
88+
port: 8080
89+
initialDelaySeconds: 5
90+
periodSeconds: 5
91+
```
92+
93+
## Files Changed
94+
95+
- **New Files:**
96+
- `packages/health/` - Complete health check package
97+
- `infrastructure/kubernetes/staging/*-service.yaml` - Service deployments with probes
98+
- `docs/issues/0005-health-check-system.md` - Issue documentation
99+
- `docs/issues/0006-health-check-enhancements.md` - Follow-up enhancements issue
100+
101+
- **Modified Files:**
102+
- `services/*/package.json` - Added health package dependency
103+
- `services/*/src/server.js` - Integrated health endpoints
104+
- `services/*/tests/server.test.js` - Added health check tests
105+
106+
## Related Issues
107+
108+
- Closes #5
109+
- Related: #6 (follow-up enhancements identified during implementation)
110+
111+
## Documentation
112+
113+
- Health package README: `packages/health/README.md`
114+
- Issue documentation: `docs/issues/0005-health-check-system.md`
115+
- Follow-up enhancements: `docs/issues/0006-health-check-enhancements.md`
116+
117+
## Next Steps
118+
119+
After merge, consider implementing enhancements from issue #6:
120+
121+
- Timeout management for health checks
122+
- Metrics integration (Prometheus)
123+
- Circuit breaker pattern
124+
- Health check result caching
125+
- Configuration flexibility
126+

.github/workflows/cd.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,8 @@ jobs:
5959
- name: Run environment smoke checks
6060
run: ${{ steps.kubeconfig.outputs.health_script }}
6161
- name: Notify Slack
62-
if: secrets.SLACK_WEBHOOK != ''
62+
if: always()
63+
continue-on-error: true
6364
uses: slackapi/slack-github-action@v1.25.0
6465
with:
6566
payload: |

.github/workflows/ci.yml

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,6 @@ jobs:
3131
- uses: actions/setup-node@v4
3232
with:
3333
node-version: 20
34-
cache: pnpm
3534
- name: Install dependencies
3635
run: pnpm install --frozen-lockfile
3736
- name: Run ESLint
@@ -59,7 +58,6 @@ jobs:
5958
- uses: actions/setup-node@v4
6059
with:
6160
node-version: ${{ matrix.node }}
62-
cache: pnpm
6361
- name: Install dependencies
6462
run: pnpm install --frozen-lockfile
6563
- name: Run unit test suite
@@ -77,7 +75,6 @@ jobs:
7775
- uses: actions/setup-node@v4
7876
with:
7977
node-version: 20
80-
cache: pnpm
8178
- name: Install dependencies
8279
run: pnpm install --frozen-lockfile
8380
- name: Run integration suite
@@ -140,7 +137,6 @@ jobs:
140137
- uses: actions/setup-node@v4
141138
with:
142139
node-version: 20
143-
cache: pnpm
144140
- name: Install dependencies
145141
run: pnpm install --frozen-lockfile
146142
- name: Build packages and services
@@ -167,6 +163,9 @@ jobs:
167163
contents: read
168164
steps:
169165
- uses: actions/checkout@v4
166+
- name: Derive image metadata
167+
id: image-meta
168+
run: echo "repository=$(echo '${{ github.repository }}' | tr '[:upper:]' '[:lower:]')" >> "$GITHUB_OUTPUT"
170169
- name: Log in to GitHub Container Registry
171170
uses: docker/login-action@v3
172171
with:
@@ -179,7 +178,9 @@ jobs:
179178
context: .
180179
file: services/${{ matrix.service }}/Dockerfile
181180
push: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
182-
tags: ghcr.io/${{ github.repository }}/${{ matrix.service }}:${{ github.sha }}
181+
tags: |
182+
ghcr.io/${{ steps.image-meta.outputs.repository }}/${{ matrix.service }}-service:${{ github.sha }}
183+
ghcr.io/${{ steps.image-meta.outputs.repository }}/${{ matrix.service }}-service:latest
183184
cache-from: type=gha
184185
cache-to: type=gha,mode=max
185186

@@ -204,7 +205,8 @@ jobs:
204205
- name: Run staging smoke tests
205206
run: scripts/ci/health-check-staging.sh
206207
- name: Notify Slack
207-
if: secrets.SLACK_WEBHOOK != ''
208+
if: always()
209+
continue-on-error: true
208210
uses: slackapi/slack-github-action@v1.25.0
209211
with:
210212
payload: |
@@ -235,7 +237,8 @@ jobs:
235237
- name: Run production smoke tests
236238
run: scripts/ci/health-check-production.sh
237239
- name: Notify Slack
238-
if: secrets.SLACK_WEBHOOK != ''
240+
if: always()
241+
continue-on-error: true
239242
uses: slackapi/slack-github-action@v1.25.0
240243
with:
241244
payload: |
@@ -244,4 +247,3 @@ jobs:
244247
}
245248
env:
246249
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
247-
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Issue #5: Implement Standardized Health Check System
2+
3+
## Summary
4+
5+
Implement a standardized health check system across all microservices to enable reliable monitoring, load balancing, and orchestration. This system should provide consistent health endpoints that can be consumed by Kubernetes liveness/readiness probes, API gateways, and monitoring systems.
6+
7+
## Background
8+
9+
Currently, services lack standardized health check endpoints. This makes it difficult to:
10+
11+
- Configure Kubernetes liveness and readiness probes
12+
- Monitor service health in production
13+
- Implement proper load balancing
14+
- Detect and handle degraded service states
15+
- Integrate with monitoring and alerting systems
16+
17+
## Proposed Changes
18+
19+
1. **Create shared health check package**
20+
- Add `packages/health/` package with reusable health check utilities
21+
- Support for basic health, readiness, and liveness checks
22+
- Database connectivity checks
23+
- Dependency health checks (external services, etc.)
24+
25+
2. **Implement health endpoints in all services**
26+
- Add `/health`, `/health/ready`, and `/health/live` endpoints
27+
- Integrate with existing services (transcription, documentation, coding)
28+
- Return standardized JSON responses
29+
30+
3. **Update Kubernetes configurations**
31+
- Add liveness and readiness probes to service deployments
32+
- Configure appropriate timeouts and thresholds
33+
34+
4. **Add health check tests**
35+
- Unit tests for health check logic
36+
- Integration tests for health endpoints
37+
38+
## Acceptance Criteria
39+
40+
- [x] `packages/health/` package created with reusable health check utilities
41+
- [x] All services expose `/health`, `/health/ready`, and `/health/live` endpoints
42+
- [x] Health endpoints return standardized JSON responses
43+
- [x] Database connectivity is checked in readiness probes
44+
- [x] Kubernetes manifests updated with liveness/readiness probes
45+
- [x] Unit and integration tests added for health checks
46+
- [x] Documentation updated with health check usage
47+
48+
## Implementation Details
49+
50+
### Health Check Types
51+
52+
- **Liveness**: Indicates if the service is running (should always return 200 if process is alive)
53+
- **Readiness**: Indicates if the service is ready to accept traffic (checks dependencies like DB)
54+
- **Health**: Comprehensive health status including dependencies
55+
56+
### Runtime Behavior
57+
58+
- Services with external dependencies keep `/health/ready` in a failing state until critical checks pass. For example, the documentation service now surfaces an `unhealthy` status (503) whenever the PostgreSQL pool cannot be established, preventing Kubernetes from routing traffic prematurely.
59+
- Dependency initialization is retried automatically. The retry cadence is controlled via the optional `DATABASE_RETRY_DELAY_MS` environment variable (defaults to 5000ms).
60+
- Comprehensive `/health` responses always include dependency check details, even when readiness is blocked, to aid debugging and alerting.
61+
62+
### Response Format
63+
64+
```json
65+
{
66+
"status": "healthy" | "degraded" | "unhealthy",
67+
"timestamp": "2024-01-01T00:00:00Z",
68+
"checks": {
69+
"database": {
70+
"status": "healthy",
71+
"responseTime": 5
72+
},
73+
"memory": {
74+
"status": "healthy",
75+
"usage": 45.2
76+
}
77+
}
78+
}
79+
```
80+
81+
## Status: Completed
82+
83+
## Related Issues
84+
85+
- Issue #2: Monorepo Developer Experience
86+
- Issue #15: CI/CD Pipeline (health checks needed for deployment)

0 commit comments

Comments
 (0)