Infrastructure Improvements v3.17.0

Summary

Fixed 3 critical infrastructure issues to improve production reliability, prevent memory leaks, and protect against abuse.

Issue #4: Error Recovery Mechanisms ✅

Problem: No automatic retry for failed normalizations. Transient errors (network issues, worker crashes) caused permanent data loss.

Solution: Added automatic retry with exponential backoff to ChunkedNormalizer.

Changes Made:

Added retry configuration to ChunkedNormalizerConfig:
- maxRetries (default: 3)
- retryDelayMs (default: 1000ms)
Implemented processChunkWithRetry() method:
- Exponential backoff: 1s, 2s, 4s, 8s, max 30s
- Tracks retry attempts in stats (retriedChunks)
- Logs warnings with attempt count and delay
- Throws error only after max retries exceeded
Updated ProcessingStats interface:
- Added retriedChunks: number field

Code Example:

// Retry logic with exponential backoff
const delay = Math.min(
  this.config.retryDelayMs * Math.pow(2, attempt),
  30000 // max 30 seconds
);

console.warn(
  `Chunk ${chunkIndex} failed (attempt ${attempt + 1}/${this.config.maxRetries}), ` +
  `retrying in ${delay}ms...`,
  error
);

Impact:

✅ Transient errors automatically recovered
✅ Reduced data loss from temporary failures
✅ Better user experience (fewer failed jobs)
✅ Detailed logging for debugging

Issue #5: Memory Leaks in Worker Pool ✅

Problem: Workers never recycled, causing memory buildup over time. No cleanup of worker tracking data.

Solution: Added automatic worker recycling and proper cleanup.

Changes Made:

Added memory monitoring configuration:
- maxWorkerMemoryMB (default: 500MB)
- workerRecycleAfterChunks (default: 100 chunks)
Implemented worker chunk tracking:
- Added workerChunkCounts: Map<Worker, number> to track chunks per worker
- Incremented count after each chunk processed
Implemented getWorker() method:
- Checks if worker has processed too many chunks
- Terminates and recreates worker if threshold exceeded
- Resets chunk count for new worker
Enhanced terminateWorkers():
- Clears workerChunkCounts map
- Logs cleanup confirmation
Added lifecycle logging:
- Worker initialization: [Worker 0] Initialized
- Worker recycling: [Worker 0] Recycling after 100 chunks
- Cleanup: [Workers] All workers terminated and cleaned up

Code Example:

// Recycle worker if it has processed too many chunks
if (chunkCount >= this.config.workerRecycleAfterChunks) {
  console.log(`[Worker ${workerIndex}] Recycling after ${chunkCount} chunks`);
  worker.terminate();
  
  // Create new worker
  const newWorker = new Worker(
    new URL('../../../client/src/workers/normalization.worker.ts', import.meta.url),
    { type: 'module' }
  );
  this.workers[workerIndex] = newWorker;
  this.workerChunkCounts.set(newWorker, 0);
}

Impact:

✅ Prevents memory leaks from long-running workers
✅ Automatic worker recycling every 100 chunks
✅ Proper cleanup on termination
✅ Better observability with lifecycle logging

Issue #6: Rate Limiting ✅

Problem: No rate limiting on API endpoints. Users could abuse the system by submitting unlimited jobs.

Solution: Added Redis-based rate limiting with sliding window algorithm.

Changes Made:

Created server/_core/rateLimit.ts:
- Redis client with connection retry
- checkRateLimit() function with sliding window algorithm
- rateLimitMiddleware() for tRPC procedures
- Predefined rate limit configurations
Implemented sliding window algorithm:
- Uses Redis sorted sets with timestamps as scores
- Removes old entries outside time window
- Counts current requests in window
- Adds new request if under limit
- Returns remaining requests and reset time
Added rate limits:
- Job creation: 10 jobs per hour
- Job listing: 100 requests per minute
- Report submission: 5 reports per hour
Applied rate limiting to jobRouter:
- Added rateLimitMiddleware() to job creation endpoint
- Returns TOO_MANY_REQUESTS error with reset time
Fail-open design:
- If Redis fails, allows request (doesn't block users)
- Logs error for monitoring

Code Example:

// Rate limiting in job creation
.mutation(async ({ ctx, input }) => {
  // Rate limiting: 10 jobs per hour
  await rateLimitMiddleware(ctx.user.id, RateLimits.JOB_CREATE);
  
  // ... rest of job creation logic
});

Rate Limit Response:

{
  "error": {
    "code": "TOO_MANY_REQUESTS",
    "message": "Rate limit exceeded. Try again in 3456 seconds."
  }
}

Impact:

✅ Prevents abuse of job submission
✅ Protects system resources
✅ Fair usage across all users
✅ Clear error messages with reset time
✅ Fail-open design (doesn't block on Redis failure)

Files Modified

Issue #4 (Error Recovery):

shared/normalization/intelligent/ChunkedNormalizer.ts
- Added maxRetries and retryDelayMs config
- Added retriedChunks to stats
- Implemented processChunkWithRetry() method

Issue #5 (Memory Leaks):

shared/normalization/intelligent/ChunkedNormalizer.ts
- Added maxWorkerMemoryMB and workerRecycleAfterChunks config
- Added workerChunkCounts Map for tracking
- Implemented getWorker() method for recycling
- Enhanced terminateWorkers() with cleanup
- Added lifecycle logging

Issue #6 (Rate Limiting):

server/_core/rateLimit.ts (NEW FILE)
- Redis client setup
- Sliding window algorithm
- Rate limit middleware
- Predefined configurations
server/jobRouter.ts
- Added rate limiting to job creation endpoint

Testing Notes

Error Recovery:

Simulate worker failures to test retry logic
Verify exponential backoff delays
Check retriedChunks stat increments correctly

Memory Leaks:

Process 200+ chunks to trigger worker recycling
Monitor memory usage over time
Verify workers are properly terminated

Rate Limiting:

Submit 11 jobs within 1 hour to trigger rate limit
Verify error message includes reset time
Test Redis failure scenario (fail-open)

Known Issues

Redis Connection: Rate limiting requires Redis to be running. If Redis is down, rate limiting is disabled (fail-open).
TypeScript Errors: 112 type errors in PhoneEnhanced.ts (non-blocking, app runs correctly).

Next Steps

Add rate limiting to more endpoints:
- Job listing
- Report submission
- File uploads
Add rate limit headers to responses:
- X-RateLimit-Limit
- X-RateLimit-Remaining
- X-RateLimit-Reset
Add memory monitoring dashboard:
- Track worker memory usage
- Alert on high memory consumption
- Visualize recycling events
Fix PhoneEnhanced TypeScript errors:
- Address 112 type safety issues
- Improve type definitions

Time Spent

Issue #4 (Error Recovery): ~45 minutes
Issue #5 (Memory Leaks): ~45 minutes
Issue #6 (Rate Limiting): ~1 hour
Total: ~2.5 hours

Version History

v3.16.1 - Critical deployment fix (environment validation)
v3.16.0 - Infrastructure fixes (TypeScript, Redis, env validation)
v3.17.0 - Infrastructure improvements (error recovery, memory leaks, rate limiting)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infrastructure Improvements v3.17.0

Summary

Issue #4: Error Recovery Mechanisms ✅

Changes Made:

Code Example:

Impact:

Issue #5: Memory Leaks in Worker Pool ✅

Changes Made:

Code Example:

Impact:

Issue #6: Rate Limiting ✅

Changes Made:

Code Example:

Rate Limit Response:

Impact:

Files Modified

Issue #4 (Error Recovery):

Issue #5 (Memory Leaks):

Issue #6 (Rate Limiting):

Testing Notes

Error Recovery:

Memory Leaks:

Rate Limiting:

Known Issues

Next Steps

Time Spent

Version History

FilesExpand file tree

INFRASTRUCTURE_FIXES_v3.17.0.md

Latest commit

History

INFRASTRUCTURE_FIXES_v3.17.0.md

File metadata and controls

Infrastructure Improvements v3.17.0

Summary

Issue #4: Error Recovery Mechanisms ✅

Changes Made:

Code Example:

Impact:

Issue #5: Memory Leaks in Worker Pool ✅

Changes Made:

Code Example:

Impact:

Issue #6: Rate Limiting ✅

Changes Made:

Code Example:

Rate Limit Response:

Impact:

Files Modified

Issue #4 (Error Recovery):

Issue #5 (Memory Leaks):

Issue #6 (Rate Limiting):

Testing Notes

Error Recovery:

Memory Leaks:

Rate Limiting:

Known Issues

Next Steps

Time Spent

Version History