Skip to content

Latest commit

Β 

History

History
324 lines (258 loc) Β· 9.62 KB

File metadata and controls

324 lines (258 loc) Β· 9.62 KB

Critical Fixes for Prometheus/Grafana Monitoring System

Overview

This document outlines critical issues identified during the code review of PR #1 (feature/prometheus-grafana-export) and provides actionable fix tasks with priority levels, effort estimates, and implementation guidance.

Critical Issues by Category

πŸ”΄ CRITICAL: Performance Issues

P1.1: Unbounded Query Results

Issue: Loading entire history into memory every 15 seconds Location: tools/prometheus_exporter.py:331-337 Impact: At 1M records = 2GB RAM, 8s collection time Effort: 2 hours

Fix Tasks:

# Task 1: Add time-based filtering
- Modify query to only fetch recent data (last hour/day)
- Add LIMIT clause to prevent unbounded growth
- Example: WHERE timestamp > datetime('now', '-1 hour') LIMIT 10000

# Task 2: Implement pagination for large datasets
- Add offset/limit parameters
- Process data in chunks

P1.2: N+1 Query Pattern

Issue: 250+ database queries per collection cycle Location: tools/prometheus_exporter.py:435-565 Impact: 500-1250ms query latency Effort: 3 hours

Fix Tasks:

-- Task 1: Create single aggregated query
SELECT 
    session_id,
    COUNT(DISTINCT agent_name) as unique_agents,
    COUNT(*) as total_invocations,
    SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as success_count,
    AVG(duration) as avg_duration
FROM agent_invocations
WHERE session_id IN (SELECT session_id FROM sessions LIMIT 50)
GROUP BY session_id

-- Task 2: Use JOIN instead of nested queries
-- Task 3: Cache session results for reuse

P1.3: Missing Database Indexes

Issue: No indexes on frequently queried columns Impact: 10x slower queries Effort: 30 minutes

Fix Tasks:

-- Task 1: Create performance indexes
CREATE INDEX idx_invocations_timestamp ON agent_invocations(timestamp);
CREATE INDEX idx_invocations_session ON agent_invocations(session_id, start_time, end_time);
CREATE INDEX idx_invocations_agent ON agent_invocations(agent_name, phase);
CREATE INDEX idx_tool_uses_agent ON agent_tool_uses(agent_name, agent_invocation_id);
CREATE INDEX idx_sessions_status ON sessions(status, start_time);

-- Task 2: Add composite indexes for complex queries
CREATE INDEX idx_invocations_composite ON agent_invocations(session_id, agent_name, status);

P1.4: No Incremental Processing

Issue: Recalculates all metrics from scratch Impact: CPU waste, growing collection time Effort: 4 hours

Fix Tasks:

# Task 1: Track last processed timestamp
class MetricsCollector:
    def __init__(self):
        self.last_processed_timestamp = self.load_checkpoint()
    
    def collect_incremental(self):
        cursor.execute('''
            SELECT * FROM agent_invocations 
            WHERE timestamp > ? 
            ORDER BY timestamp
        ''', (self.last_processed_timestamp,))
        
    def save_checkpoint(self, timestamp):
        # Persist to file or database

# Task 2: Implement differential metrics
- Only update changed metrics
- Keep running totals in memory

πŸ”΄ CRITICAL: Security Issues

S1.1: No Authentication on Metrics Endpoint

Issue: Anyone can access metrics on port 9090 Location: tools/prometheus_exporter.py:736 Impact: Information disclosure, potential DDoS Effort: 3 hours

Fix Tasks:

# Task 1: Add Basic Authentication
from prometheus_client import start_http_server
from prometheus_client.exposition import basic_auth_handler

def start_secure_server(port, username, password):
    # Implement basic auth wrapper
    handler = basic_auth_handler(handler, username, password)
    
# Task 2: Implement Bearer Token Authentication
import hashlib
import hmac

def verify_token(token):
    expected = os.environ.get('METRICS_TOKEN')
    return hmac.compare_digest(token, expected)

# Task 3: Bind to localhost only
start_http_server(port, addr='127.0.0.1', registry=registry)

# Task 4: Add reverse proxy configuration
# nginx.conf example for production

S1.2: Information Disclosure in Metrics

Issue: Exposing sensitive paths, session IDs Location: Multiple locations Impact: System architecture exposure Effort: 2 hours

Fix Tasks:

# Task 1: Hash sensitive identifiers
import hashlib

def hash_session_id(session_id):
    return hashlib.sha256(session_id.encode()).hexdigest()[:8]

# Task 2: Remove absolute paths
database_info.info({
    'database': 'agent_workflow.db',  # Not full path
    'total_invocations': str(total_invocations),
})

# Task 3: Add configuration for metric sensitivity
SENSITIVE_METRICS = ['session_id', 'file_path', 'agent_name']

S1.3: Temporary File Vulnerability

Issue: Predictable temp file in installation Location: install-logging.sh:182 Impact: Symlink attacks, code injection Effort: 1 hour

Fix Tasks:

# Task 1: Use mktemp for unique files
TEMP_FILE=$(mktemp /tmp/update_claude_XXXXXX.py)
trap "rm -f $TEMP_FILE" EXIT

# Task 2: Set restrictive permissions
chmod 600 "$TEMP_FILE"

# Task 3: Use user-specific temp directory
TEMP_DIR="${TMPDIR:-/tmp}/${USER}"
mkdir -p -m 700 "$TEMP_DIR"

S1.4: Unvalidated JSON Input

Issue: No schema validation for hook input Location: hooks/log_tool_operations.py:256 Impact: Injection attacks, crashes Effort: 2 hours

Fix Tasks:

# Task 1: Add JSON schema validation
import jsonschema

HOOK_SCHEMA = {
    "type": "object",
    "properties": {
        "session_id": {"type": "string", "maxLength": 100},
        "tool_name": {"type": "string", "pattern": "^[a-zA-Z0-9_-]+$"},
        "tool_input": {"type": "object"}
    },
    "required": ["session_id", "tool_name"]
}

def validate_input(data):
    jsonschema.validate(data, HOOK_SCHEMA)

# Task 2: Add size limits
MAX_INPUT_SIZE = 10 * 1024  # 10KB
if len(input_data) > MAX_INPUT_SIZE:
    raise ValueError("Input too large")

πŸ”΄ CRITICAL: Testing Gap

T1.1: Zero Test Coverage

Issue: 1,079 lines of new code with no tests Impact: Regression risk, unreliable deployment Effort: 8 hours

Fix Tasks:

# Task 1: Create test structure
tests/
β”œβ”€β”€ unit/
β”‚   β”œβ”€β”€ test_prometheus_exporter.py
β”‚   β”œβ”€β”€ test_log_tool_operations.py
β”‚   └── test_metrics_calculations.py
β”œβ”€β”€ integration/
β”‚   β”œβ”€β”€ test_database_operations.py
β”‚   └── test_prometheus_integration.py
└── fixtures/
    └── sample_data.sql

# Task 2: Implement critical path tests
def test_database_connection_failure():
    """Test graceful handling of DB errors"""
    
def test_metric_calculation_accuracy():
    """Verify metric calculations are correct"""
    
def test_large_dataset_performance():
    """Ensure performance with 100k+ records"""

# Task 3: Add CI/CD test execution
# .github/workflows/test.yml

Implementation Priority Matrix

Priority Issue Effort Impact Dependencies
1 S1.1: No Authentication 3h Critical None
2 P1.3: Missing Indexes 30m High None
3 P1.1: Unbounded Queries 2h Critical P1.3
4 S1.3: Temp File Vuln 1h High None
5 P1.2: N+1 Queries 3h High P1.3
6 T1.1: Test Coverage 8h Critical All above
7 S1.2: Info Disclosure 2h Medium S1.1
8 P1.4: Incremental Processing 4h High P1.1
9 S1.4: JSON Validation 2h Medium None

Quick Wins (< 1 hour each)

  1. Add database indexes - Immediate 10x performance boost
  2. Bind to localhost - Quick security fix
  3. Fix temp file vulnerability - Simple mktemp change
  4. Add query limits - Prevent memory exhaustion

Comprehensive Fix (Full Sprint)

Week 1: Critical Security & Performance

  • Day 1: Add authentication (S1.1)
  • Day 2: Fix unbounded queries and add indexes (P1.1, P1.3)
  • Day 3: Fix N+1 pattern (P1.2)
  • Day 4: Add basic tests (T1.1 partial)
  • Day 5: Security fixes (S1.2, S1.3, S1.4)

Week 2: Optimization & Testing

  • Day 1-2: Incremental processing (P1.4)
  • Day 3-4: Comprehensive test suite (T1.1 complete)
  • Day 5: Performance testing and optimization

Validation Checklist

After implementing fixes, verify:

  • Metrics endpoint requires authentication
  • Database queries complete in < 100ms
  • Memory usage stays under 100MB
  • Test coverage > 80%
  • No sensitive data in metrics
  • Graceful handling of all error cases
  • Performance with 1M+ records acceptable
  • Security scan passes
  • Load testing successful
  • Documentation updated

Rollback Plan

If issues occur after deployment:

  1. Immediate: Disable prometheus_exporter service
  2. Short-term: Revert to previous monitoring solution
  3. Recovery:
    • Restore from database backup if corrupted
    • Clear prometheus metrics cache
    • Restart with reduced collection frequency

Success Metrics

Post-fix deployment should achieve:

  • Collection time < 100ms for 100k records
  • Memory usage < 100MB constant
  • Zero unauthorized access attempts
  • 95%+ test coverage on critical paths
  • < 5ms p99 metric query response time

Next Steps

  1. Prioritize which fixes to implement first based on environment (dev/staging/prod)
  2. Assign tasks to team members based on expertise
  3. Schedule implementation sprint
  4. Review this plan with security and performance teams
  5. Execute fixes in priority order
  6. Validate each fix with tests before moving to next

Notes

  • All code examples are simplified for clarity
  • Actual implementation may require additional error handling
  • Consider feature flags for gradual rollout
  • Monitor metrics after each fix deployment
  • Keep this document updated as fixes are completed