test: add continuous data validation framework

rophy · rophy · commit eabd7c6623ab · 2026-03-25T12:42:17.000+08:00
- Dockerfile.swingbench: containerized Swingbench load generator
- validator.py: tails LogMiner + OLR JSONL files, matches events by
  content in real-time, stops swingbench on mismatch via Docker socket
- docker-compose.yaml: full stack with receiver, dbz-logminer, dbz-olr,
  swingbench, validator, prometheus
- VALIDATION-PLAN.md: architecture and design decisions

Designed for long-running (hours/days) continuous validation. On
mismatch, swingbench is stopped immediately to preserve redo logs
and event history for offline replay.
diff --git a/tests/sql/environments/rac/debezium/perf/Dockerfile.swingbench b/tests/sql/environments/rac/debezium/perf/Dockerfile.swingbench
@@ -0,0 +1,16 @@
+FROM eclipse-temurin:21-jre
+
+ARG SWINGBENCH_URL=https://github.com/domgiles/swingbench-public/releases/download/production/swingbenchlatest.zip
+
+RUN apt-get update && apt-get install -y --no-install-recommends unzip curl && \
+    curl -sL -o /tmp/swingbench.zip "$SWINGBENCH_URL" && \
+    unzip -qo /tmp/swingbench.zip -d /opt && \
+    rm /tmp/swingbench.zip && \
+    chmod +x /opt/swingbench/bin/* && \
+    apt-get remove -y unzip curl && apt-get autoremove -y && rm -rf /var/lib/apt/lists/*
+
+ENV PATH="/opt/swingbench/bin:${PATH}"
+
+WORKDIR /opt/swingbench
+
+ENTRYPOINT ["charbench"]
diff --git a/tests/sql/environments/rac/debezium/perf/VALIDATION-PLAN.md b/tests/sql/environments/rac/debezium/perf/VALIDATION-PLAN.md
@@ -0,0 +1,93 @@
+# Continuous Data Validation Framework
+
+## Goal
+
+Run OLR and LogMiner Debezium adapters simultaneously under sustained load,
+continuously validate both produce identical events, and stop immediately on
+any mismatch — preserving redo logs and event history for replay.
+
+## Architecture
+
+```
+Oracle RAC (VM)
+  └── OLR container (reads redo → TCP:5000)
+
+Host (docker-compose):
+  swingbench       → continuous OLTP load via CMAN (port 1521)
+  dbz-logminer     → LogMiner adapter → POST /logminer → receiver
+  dbz-olr          → OLR adapter      → POST /olr      → receiver
+  receiver         → writes logminer.jsonl + olr.jsonl
+  validator        → tails both files, matches events, stops on mismatch
+```
+
+## Components
+
+### receiver (existing, no changes needed)
+- Writes events to `logminer.jsonl` and `olr.jsonl`
+- Provides `/metrics` for throughput/latency monitoring
+
+### swingbench (new container)
+- `Dockerfile.swingbench` — eclipse-temurin:21 + Swingbench
+- Connects to Oracle via CMAN (`VM_IP:1521`)
+- Configurable users and runtime via env vars / command args
+- Stopped by validator on mismatch
+
+### validator (new)
+- Python script that tails both JSONL files
+- Extracts match key: `(table, op, sorted(after_columns))`
+- Maintains two multisets (one per adapter)
+- Match window: events from one adapter are held for N seconds waiting
+  for the matching event from the other adapter
+- On timeout (event in one adapter but not the other): MISMATCH → stop
+- On content diff (same key but different values): MISMATCH → stop
+- On match: remove from both sets, increment match counter
+- Logs progress every 10s: matched count, pending LM, pending OLR
+
+### On mismatch:
+1. Validator sends `docker stop swingbench` (DML stops)
+2. Logs the mismatched events with full detail
+3. Redo logs on VM are preserved (no log switch)
+4. JSONL files preserved for offline replay
+5. Exit with non-zero code
+
+## Docker Compose
+
+```yaml
+services:
+  receiver:     # existing
+  dbz-logminer: # existing
+  dbz-olr:      # existing
+  swingbench:
+    image: swingbench:latest
+    network_mode: host
+    command: ["-cs", "//VM_IP:1521/ORCLPDB", "-u", "soe", "-p", "soe",
+              "-c", "/opt/swingbench/configs/SOE_Server_Side_V2.xml",
+              "-uc", "4", "-rt", "99:00.00", "-nc", "-nr", "-s"]
+  validator:
+    image: python:3.12-slim
+    network_mode: host
+    volumes:
+      - ./output:/app/output:ro
+      - /var/run/docker.sock:/var/run/docker.sock
+    command: ["python3", "/app/validator.py",
+              "--logminer", "/app/output/logminer.jsonl",
+              "--olr", "/app/output/olr.jsonl",
+              "--match-window", "60"]
+```
+
+## Match Key Design
+
+For INSERT: `(table, "c", hash(sorted(after_columns)))`
+For UPDATE: `(table, "u", hash(sorted(before_columns)), hash(sorted(after_columns)))`
+For DELETE: `(table, "d", hash(sorted(before_columns)))`
+
+Using hash of column values (not full content) keeps memory bounded for
+long-running tests. Store full content only for recent unmatched events
+(within match window) for mismatch reporting.
+
+## Open Questions
+
+- Match window duration: 60s? 120s? Depends on max lag between adapters.
+- Should validator also check event count periodically?
+- Memory management for very long runs (hours/days)?
+- Should we also validate ordering within the same table/key?
diff --git a/tests/sql/environments/rac/debezium/perf/docker-compose.yaml b/tests/sql/environments/rac/debezium/perf/docker-compose.yaml
@@ -36,6 +36,50 @@ services:
       - ../../../../../debezium/lib/debezium-connector-oracle-3.5.0.Beta1.jar:/debezium/lib/debezium-connector-oracle-3.5.0.Beta1.jar:ro
       - dbz-olr-data:/debezium/data
 
+  swingbench:
+    image: swingbench:latest
+    container_name: perf-swingbench
+    network_mode: host
+    # Default: 4 users, run for 99 hours (effectively forever until stopped)
+    # Override with: docker compose run swingbench -uc 8 -rt 01:00.00
+    command:
+      - "-cs"
+      - "//192.168.122.130:1521/ORCLPDB"
+      - "-u"
+      - "soe"
+      - "-p"
+      - "soe"
+      - "-c"
+      - "/opt/swingbench/configs/SOE_Server_Side_V2.xml"
+      - "-uc"
+      - "${SWINGBENCH_USERS:-4}"
+      - "-rt"
+      - "${SWINGBENCH_RUNTIME:-99:00.00}"
+      - "-nc"
+      - "-nr"
+      - "-s"
+
+  validator:
+    image: python:3.12-slim
+    container_name: perf-validator
+    network_mode: host
+    depends_on:
+      receiver:
+        condition: service_started
+    volumes:
+      - ./validator.py:/app/validator.py:ro
+      - ./output:/app/output:ro
+      - /var/run/docker.sock:/var/run/docker.sock
+    command:
+      - "python3"
+      - "/app/validator.py"
+      - "--logminer"
+      - "/app/output/logminer.jsonl"
+      - "--olr"
+      - "/app/output/olr.jsonl"
+      - "--match-window"
+      - "${MATCH_WINDOW:-120}"
+
   prometheus:
     image: prom/prometheus:latest
     container_name: perf-prometheus
diff --git a/tests/sql/environments/rac/debezium/perf/validator.py b/tests/sql/environments/rac/debezium/perf/validator.py
@@ -0,0 +1,223 @@
+#!/usr/bin/env python3
+"""Real-time validator: tails LogMiner and OLR JSONL files, matches events.
+
+Stops the swingbench container on mismatch. Designed for long-running
+continuous validation of OLR vs LogMiner data correctness.
+
+Usage:
+    python3 validator.py --logminer output/logminer.jsonl --olr output/olr.jsonl
+"""
+
+import argparse
+import json
+import os
+import subprocess
+import sys
+import time
+from collections import defaultdict
+
+SENTINEL_TABLE = 'DEBEZIUM_SENTINEL'
+POLL_INTERVAL = 1.0     # seconds between file polls
+REPORT_INTERVAL = 10.0  # seconds between progress reports
+DEFAULT_MATCH_WINDOW = 120  # seconds to wait for matching event
+
+
+def normalize_value(v):
+    if v is None:
+        return None
+    return str(v)
+
+
+def event_key(event):
+    """Extract a content-based match key from a Debezium event."""
+    source = event.get('source', {})
+    table = source.get('table', '')
+    op = event.get('op', '')
+
+    if table == SENTINEL_TABLE:
+        return None  # skip sentinel
+
+    after = event.get('after') or {}
+    before = event.get('before') or {}
+
+    after_norm = tuple(sorted((k, normalize_value(v)) for k, v in after.items()))
+    before_norm = tuple(sorted((k, normalize_value(v)) for k, v in before.items()))
+
+    if op == 'c':
+        return (table, op, after_norm)
+    elif op == 'u':
+        return (table, op, before_norm, after_norm)
+    elif op == 'd':
+        return (table, op, before_norm)
+    else:
+        return None  # skip unknown ops (heartbeats, etc.)
+
+
+def tail_file(path, position):
+    """Read new lines from file starting at position. Returns (lines, new_position)."""
+    try:
+        size = os.path.getsize(path)
+    except OSError:
+        return [], position
+
+    if size <= position:
+        return [], position
+
+    lines = []
+    with open(path, 'r') as f:
+        f.seek(position)
+        for line in f:
+            line = line.strip()
+            if line:
+                lines.append(line)
+        new_position = f.tell()
+    return lines, new_position
+
+
+def stop_swingbench():
+    """Stop the swingbench container via Docker socket."""
+    import socket
+    try:
+        sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        sock.connect('/var/run/docker.sock')
+        request = (
+            'POST /v1.40/containers/perf-swingbench/stop HTTP/1.1\r\n'
+            'Host: localhost\r\n'
+            'Content-Length: 0\r\n'
+            '\r\n'
+        )
+        sock.sendall(request.encode())
+        response = sock.recv(4096).decode()
+        sock.close()
+        if '204' in response or '304' in response:
+            print('  Swingbench stopped', flush=True)
+        else:
+            print(f'  WARNING: Unexpected response: {response[:100]}', flush=True)
+    except Exception as e:
+        print(f'  WARNING: Failed to stop swingbench: {e}', flush=True)
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Real-time OLR vs LogMiner validator')
+    parser.add_argument('--logminer', required=True, help='Path to logminer.jsonl')
+    parser.add_argument('--olr', required=True, help='Path to olr.jsonl')
+    parser.add_argument('--match-window', type=int, default=DEFAULT_MATCH_WINDOW,
+                        help=f'Seconds to wait for matching event (default: {DEFAULT_MATCH_WINDOW})')
+    parser.add_argument('--stop-on-fail', action='store_true', default=True,
+                        help='Stop swingbench on mismatch (default: true)')
+    args = parser.parse_args()
+
+    print(f'Validator starting', flush=True)
+    print(f'  LogMiner: {args.logminer}', flush=True)
+    print(f'  OLR:      {args.olr}', flush=True)
+    print(f'  Match window: {args.match_window}s', flush=True)
+    print(flush=True)
+
+    # Pending events: key -> [(timestamp, channel, full_event), ...]
+    # When both sides produce the same key, they cancel out (match).
+    lm_pending = {}  # key -> (timestamp, event_json)
+    olr_pending = {}  # key -> (timestamp, event_json)
+
+    lm_pos = 0
+    olr_pos = 0
+    matched = 0
+    lm_total = 0
+    olr_total = 0
+    skipped = 0
+    last_report = time.time()
+
+    while True:
+        now = time.time()
+
+        # Tail both files
+        lm_lines, lm_pos = tail_file(args.logminer, lm_pos)
+        olr_lines, olr_pos = tail_file(args.olr, olr_pos)
+
+        # Process LogMiner events
+        for line in lm_lines:
+            try:
+                event = json.loads(line)
+            except json.JSONDecodeError:
+                continue
+            key = event_key(event)
+            if key is None:
+                skipped += 1
+                continue
+            lm_total += 1
+
+            if key in olr_pending:
+                # Match found — OLR already has this event
+                del olr_pending[key]
+                matched += 1
+            else:
+                lm_pending[key] = (now, line)
+
+        # Process OLR events
+        for line in olr_lines:
+            try:
+                event = json.loads(line)
+            except json.JSONDecodeError:
+                continue
+            key = event_key(event)
+            if key is None:
+                skipped += 1
+                continue
+            olr_total += 1
+
+            if key in lm_pending:
+                # Match found — LogMiner already has this event
+                del lm_pending[key]
+                matched += 1
+            else:
+                olr_pending[key] = (now, line)
+
+        # Check for expired events (exceeded match window)
+        expired_lm = [(k, ts, line) for k, (ts, line) in lm_pending.items()
+                       if now - ts > args.match_window]
+        expired_olr = [(k, ts, line) for k, (ts, line) in olr_pending.items()
+                        if now - ts > args.match_window]
+
+        if expired_lm or expired_olr:
+            print(flush=True)
+            print('!!! MISMATCH DETECTED !!!', flush=True)
+            print(f'  Matched so far: {matched}', flush=True)
+            print(f'  LogMiner total: {lm_total}, OLR total: {olr_total}', flush=True)
+            print(f'  LogMiner pending: {len(lm_pending)}, OLR pending: {len(olr_pending)}', flush=True)
+            print(flush=True)
+
+            if expired_lm:
+                print(f'  Events in LogMiner but NOT in OLR ({len(expired_lm)} expired):', flush=True)
+                for key, ts, line in expired_lm[:5]:
+                    age = now - ts
+                    print(f'    [{age:.0f}s old] table={key[0]} op={key[1]}', flush=True)
+                    print(f'      {line[:200]}', flush=True)
+
+            if expired_olr:
+                print(f'  Events in OLR but NOT in LogMiner ({len(expired_olr)} expired):', flush=True)
+                for key, ts, line in expired_olr[:5]:
+                    age = now - ts
+                    print(f'    [{age:.0f}s old] table={key[0]} op={key[1]}', flush=True)
+                    print(f'      {line[:200]}', flush=True)
+
+            if args.stop_on_fail:
+                stop_swingbench()
+
+            print(flush=True)
+            print('VALIDATION FAILED', flush=True)
+            sys.exit(1)
+
+        # Progress report
+        if now - last_report >= REPORT_INTERVAL:
+            print(f'[{time.strftime("%H:%M:%S")}] '
+                  f'matched={matched:,} '
+                  f'lm={lm_total:,} olr={olr_total:,} '
+                  f'pending: lm={len(lm_pending):,} olr={len(olr_pending):,} '
+                  f'skipped={skipped:,}',
+                  flush=True)
+            last_report = now
+
+        time.sleep(POLL_INTERVAL)
+
+
+if __name__ == '__main__':
+    main()