This document outlines critical issues identified during the code review of PR #1 (feature/prometheus-grafana-export) and provides actionable fix tasks with priority levels, effort estimates, and implementation guidance.
Issue: Loading entire history into memory every 15 seconds
Location: tools/prometheus_exporter.py:331-337
Impact: At 1M records = 2GB RAM, 8s collection time
Effort: 2 hours
Fix Tasks:
# Task 1: Add time-based filtering
- Modify query to only fetch recent data (last hour/day)
- Add LIMIT clause to prevent unbounded growth
- Example: WHERE timestamp > datetime('now', '-1 hour') LIMIT 10000
# Task 2: Implement pagination for large datasets
- Add offset/limit parameters
- Process data in chunksIssue: 250+ database queries per collection cycle
Location: tools/prometheus_exporter.py:435-565
Impact: 500-1250ms query latency
Effort: 3 hours
Fix Tasks:
-- Task 1: Create single aggregated query
SELECT
session_id,
COUNT(DISTINCT agent_name) as unique_agents,
COUNT(*) as total_invocations,
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as success_count,
AVG(duration) as avg_duration
FROM agent_invocations
WHERE session_id IN (SELECT session_id FROM sessions LIMIT 50)
GROUP BY session_id
-- Task 2: Use JOIN instead of nested queries
-- Task 3: Cache session results for reuseIssue: No indexes on frequently queried columns Impact: 10x slower queries Effort: 30 minutes
Fix Tasks:
-- Task 1: Create performance indexes
CREATE INDEX idx_invocations_timestamp ON agent_invocations(timestamp);
CREATE INDEX idx_invocations_session ON agent_invocations(session_id, start_time, end_time);
CREATE INDEX idx_invocations_agent ON agent_invocations(agent_name, phase);
CREATE INDEX idx_tool_uses_agent ON agent_tool_uses(agent_name, agent_invocation_id);
CREATE INDEX idx_sessions_status ON sessions(status, start_time);
-- Task 2: Add composite indexes for complex queries
CREATE INDEX idx_invocations_composite ON agent_invocations(session_id, agent_name, status);Issue: Recalculates all metrics from scratch Impact: CPU waste, growing collection time Effort: 4 hours
Fix Tasks:
# Task 1: Track last processed timestamp
class MetricsCollector:
def __init__(self):
self.last_processed_timestamp = self.load_checkpoint()
def collect_incremental(self):
cursor.execute('''
SELECT * FROM agent_invocations
WHERE timestamp > ?
ORDER BY timestamp
''', (self.last_processed_timestamp,))
def save_checkpoint(self, timestamp):
# Persist to file or database
# Task 2: Implement differential metrics
- Only update changed metrics
- Keep running totals in memoryIssue: Anyone can access metrics on port 9090
Location: tools/prometheus_exporter.py:736
Impact: Information disclosure, potential DDoS
Effort: 3 hours
Fix Tasks:
# Task 1: Add Basic Authentication
from prometheus_client import start_http_server
from prometheus_client.exposition import basic_auth_handler
def start_secure_server(port, username, password):
# Implement basic auth wrapper
handler = basic_auth_handler(handler, username, password)
# Task 2: Implement Bearer Token Authentication
import hashlib
import hmac
def verify_token(token):
expected = os.environ.get('METRICS_TOKEN')
return hmac.compare_digest(token, expected)
# Task 3: Bind to localhost only
start_http_server(port, addr='127.0.0.1', registry=registry)
# Task 4: Add reverse proxy configuration
# nginx.conf example for productionIssue: Exposing sensitive paths, session IDs Location: Multiple locations Impact: System architecture exposure Effort: 2 hours
Fix Tasks:
# Task 1: Hash sensitive identifiers
import hashlib
def hash_session_id(session_id):
return hashlib.sha256(session_id.encode()).hexdigest()[:8]
# Task 2: Remove absolute paths
database_info.info({
'database': 'agent_workflow.db', # Not full path
'total_invocations': str(total_invocations),
})
# Task 3: Add configuration for metric sensitivity
SENSITIVE_METRICS = ['session_id', 'file_path', 'agent_name']Issue: Predictable temp file in installation
Location: install-logging.sh:182
Impact: Symlink attacks, code injection
Effort: 1 hour
Fix Tasks:
# Task 1: Use mktemp for unique files
TEMP_FILE=$(mktemp /tmp/update_claude_XXXXXX.py)
trap "rm -f $TEMP_FILE" EXIT
# Task 2: Set restrictive permissions
chmod 600 "$TEMP_FILE"
# Task 3: Use user-specific temp directory
TEMP_DIR="${TMPDIR:-/tmp}/${USER}"
mkdir -p -m 700 "$TEMP_DIR"Issue: No schema validation for hook input
Location: hooks/log_tool_operations.py:256
Impact: Injection attacks, crashes
Effort: 2 hours
Fix Tasks:
# Task 1: Add JSON schema validation
import jsonschema
HOOK_SCHEMA = {
"type": "object",
"properties": {
"session_id": {"type": "string", "maxLength": 100},
"tool_name": {"type": "string", "pattern": "^[a-zA-Z0-9_-]+$"},
"tool_input": {"type": "object"}
},
"required": ["session_id", "tool_name"]
}
def validate_input(data):
jsonschema.validate(data, HOOK_SCHEMA)
# Task 2: Add size limits
MAX_INPUT_SIZE = 10 * 1024 # 10KB
if len(input_data) > MAX_INPUT_SIZE:
raise ValueError("Input too large")Issue: 1,079 lines of new code with no tests Impact: Regression risk, unreliable deployment Effort: 8 hours
Fix Tasks:
# Task 1: Create test structure
tests/
βββ unit/
β βββ test_prometheus_exporter.py
β βββ test_log_tool_operations.py
β βββ test_metrics_calculations.py
βββ integration/
β βββ test_database_operations.py
β βββ test_prometheus_integration.py
βββ fixtures/
βββ sample_data.sql
# Task 2: Implement critical path tests
def test_database_connection_failure():
"""Test graceful handling of DB errors"""
def test_metric_calculation_accuracy():
"""Verify metric calculations are correct"""
def test_large_dataset_performance():
"""Ensure performance with 100k+ records"""
# Task 3: Add CI/CD test execution
# .github/workflows/test.yml| Priority | Issue | Effort | Impact | Dependencies |
|---|---|---|---|---|
| 1 | S1.1: No Authentication | 3h | Critical | None |
| 2 | P1.3: Missing Indexes | 30m | High | None |
| 3 | P1.1: Unbounded Queries | 2h | Critical | P1.3 |
| 4 | S1.3: Temp File Vuln | 1h | High | None |
| 5 | P1.2: N+1 Queries | 3h | High | P1.3 |
| 6 | T1.1: Test Coverage | 8h | Critical | All above |
| 7 | S1.2: Info Disclosure | 2h | Medium | S1.1 |
| 8 | P1.4: Incremental Processing | 4h | High | P1.1 |
| 9 | S1.4: JSON Validation | 2h | Medium | None |
- Add database indexes - Immediate 10x performance boost
- Bind to localhost - Quick security fix
- Fix temp file vulnerability - Simple mktemp change
- Add query limits - Prevent memory exhaustion
- Day 1: Add authentication (S1.1)
- Day 2: Fix unbounded queries and add indexes (P1.1, P1.3)
- Day 3: Fix N+1 pattern (P1.2)
- Day 4: Add basic tests (T1.1 partial)
- Day 5: Security fixes (S1.2, S1.3, S1.4)
- Day 1-2: Incremental processing (P1.4)
- Day 3-4: Comprehensive test suite (T1.1 complete)
- Day 5: Performance testing and optimization
After implementing fixes, verify:
- Metrics endpoint requires authentication
- Database queries complete in < 100ms
- Memory usage stays under 100MB
- Test coverage > 80%
- No sensitive data in metrics
- Graceful handling of all error cases
- Performance with 1M+ records acceptable
- Security scan passes
- Load testing successful
- Documentation updated
If issues occur after deployment:
- Immediate: Disable prometheus_exporter service
- Short-term: Revert to previous monitoring solution
- Recovery:
- Restore from database backup if corrupted
- Clear prometheus metrics cache
- Restart with reduced collection frequency
Post-fix deployment should achieve:
- Collection time < 100ms for 100k records
- Memory usage < 100MB constant
- Zero unauthorized access attempts
- 95%+ test coverage on critical paths
- < 5ms p99 metric query response time
- Prioritize which fixes to implement first based on environment (dev/staging/prod)
- Assign tasks to team members based on expertise
- Schedule implementation sprint
- Review this plan with security and performance teams
- Execute fixes in priority order
- Validate each fix with tests before moving to next
- All code examples are simplified for clarity
- Actual implementation may require additional error handling
- Consider feature flags for gradual rollout
- Monitor metrics after each fix deployment
- Keep this document updated as fixes are completed