Skip to content

feat: add SQLite persistent storage for benchmark results (#26)#34

Merged
Neal006 merged 3 commits into
Neal006:mainfrom
Sugaria0427:feat/sqlite-persistent-storage
Jun 8, 2026
Merged

feat: add SQLite persistent storage for benchmark results (#26)#34
Neal006 merged 3 commits into
Neal006:mainfrom
Sugaria0427:feat/sqlite-persistent-storage

Conversation

@Sugaria0427

Copy link
Copy Markdown
Contributor

Summary

Replace flat JSON/CSV log storage with a queryable SQLite database for benchmark results, while maintaining full backward compatibility.

Related issue

Closes #26

Type of change

  • API or infrastructure

How was this tested?

python tests/test_pipeline.py  (29/29 pass)
python main.py --turns 10 --checkpoints 5 10 --backends naive --log
python utils/migrate_legacy_logs.py

Checklist

  • All existing tests pass
  • New tests added (6 new tests)
  • Docstrings added on all new public classes and functions
  • Type hints used on all new function signatures
  • No API key required to run any new tests
  • Backward compatible — JSON/CSV output unchanged

@Neal006

Neal006 commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Code Review — PR #34 (SQLite persistent storage)

Thanks for the contribution! The overall design is clean and the backward-compatibility approach (JSON/CSV unchanged, SQLite as opt-in) is exactly right. I found 3 bugs that will cause crashes in production and 2 lower-priority issues that need addressing before merge.


🔴 Bug 1 — Duplicate results rows corrupt get_run() on run-id collision

File: utils/storage.py, save_run() (~line 1508)

self.conn.execute(
    "INSERT OR REPLACE INTO runs (run_id, timestamp, config_json) VALUES (?, ?, ?)",
    ...
)
# then: self.conn.executemany("INSERT INTO results ...", rows)

INSERT OR REPLACE replaces the row in the runs table but does not cascade-delete the existing rows in results (there is no ON DELETE CASCADE on the FK). On a second call with the same run_id (e.g., a re-run, a test cleanup failure, or the migration script running twice), results accumulates duplicate metric rows. get_run() then sees 2× the data per checkpoint and reconstructs wrong metric arrays.

Fix: delete old results before inserting new ones:

# inside `with self.conn:`, before executemany
self.conn.execute("DELETE FROM results WHERE run_id = ?", (run_id,))

Or add ON DELETE CASCADE to the FK definition and keep INSERT OR REPLACE.


🔴 Bug 2 — IndexError crash in _append_csv_summary when rows is empty

File: evaluation/logger.py, _append_csv_summary() (~line 442)

with open(csv_path, "a", newline="") as fh:
    writer = csv.DictWriter(fh, fieldnames=rows[0].keys())  # ← crashes if rows == []

If checkpoints is [] or every key in display_data is filtered out ("checkpoints" / "has_llm_eval"), rows is empty and rows[0] throws IndexError. The PR adds the has_llm_eval filter (good fix for the pre-existing TypeError) but still doesn't guard against empty rows.

Fix:

if not rows:
    return

Add this guard immediately before the with open(...) block.


🔴 Bug 3 — Storage() connections are never closed → file handle leak

File: evaluation/logger.py, log_run() and list_runs()

# log_run:
Storage().save_run(run_id, config, display_data)   # connection never closed

# list_runs:
store = Storage()
runs = store.list_runs(limit=50)
if runs:
    return runs   # returns without calling store.close()

Every call to log_run() (and every SQLite-path call to list_runs()) leaks a sqlite3.Connection. In the Streamlit dashboard, each benchmark run fires log_run and a re-render can trigger list_runs multiple times. On Windows, open SQLite file handles also block file deletion/migration.

Fix — option A (minimal): add try/finally:

# log_run:
try:
    store = Storage()
    store.save_run(run_id, config, display_data)
finally:
    store.close()

# list_runs:
try:
    store = Storage()
    runs = store.list_runs(limit=50)
    if runs:
        return runs
finally:
    store.close()

Fix — option B (cleaner): add __enter__/__exit__ to Storage so callers can use with Storage() as store:.


🟡 Issue 4 — get_run() returns None in metric arrays; downstream callers may crash

File: utils/storage.py, get_run() (~line 1574)

display[backend][metric] = [
    turn_map[t].get(metric) if metric in turn_map.get(t, {}) else None
    for t in cps
]

Metrics absent for a checkpoint are stored as None. compare_runs() and the dashboard iterate these arrays expecting floats. For example, compare_runs does list(v["recall"]) — that works — but any caller doing arithmetic (sum(arr), Plotly chart) on the returned list will encounter TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'.

Suggested fix: filter None at the point of reconstruction, or document clearly that consumers must handle None.


🟡 Issue 5 — Tests leak Storage connections and use raw SQL for cleanup

File: tests/test_pipeline.py, test_logger_writes_sqlite and test_list_runs_returns_sqlite_runs

Both tests do cleanup via store.conn.execute("DELETE FROM ...") but never call store.close(). The raw-SQL cleanup also bypasses the public API, making the tests fragile if the schema changes.

Suggested fix: call store.close() in each test's finally block (consistent with the other storage tests that already do this correctly).


Summary table

# Severity File Issue
1 🔴 High utils/storage.py Duplicate results rows corrupt get_run() on repeated run_id
2 🔴 High evaluation/logger.py IndexError when rows is empty in _append_csv_summary
3 🔴 High evaluation/logger.py Storage() connections never closed → file handle leak
4 🟡 Medium utils/storage.py None values in metric arrays break arithmetic downstream
5 🟡 Low tests/test_pipeline.py Integration tests leak connections and use raw SQL for cleanup

Once the three 🔴 issues are fixed I'm happy to approve. The design and test coverage are solid — these are all mechanical fixes.

Sugaria0427 added a commit to Sugaria0427/memorylens that referenced this pull request Jun 3, 2026
- Bug 1: DELETE stale results before INSERT in save_run() to prevent
  duplicate rows on repeated run_id (fixes get_run() corruption)
- Bug 2: Add empty-rows guard in _append_csv_summary() to prevent IndexError
- Bug 3: Close Storage() connections in log_run() and list_runs() via
  try/finally to prevent file handle leaks on Windows
- Issue 4: Document None-handling contract in get_run() docstring;
  filter None from compare_runs() recall arrays
- Issue 5: Close Storage() in test cleanup to match the Storage unit tests

Adds test_storage_save_run_idempotent to verify fix for Bug 1.
@Sugaria0427 Sugaria0427 force-pushed the feat/sqlite-persistent-storage branch from 022423e to 6fcc8db Compare June 3, 2026 18:13
@Sugaria0427

Sugaria0427 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

[@Neal006] Thanks for the thorough review! All 5 issues have been addressed:

🔴 Bug 1 — Added DELETE FROM results WHERE run_id = ? before INSERT in save_run() to prevent duplicate rows (with test test_storage_save_run_idempotent)
🔴 Bug 2 — Added if not rows: return guard in _append_csv_summary() to prevent IndexError on empty checkpoints
🔴 Bug 3 — Wrapped Storage() usage in log_run() and list_runs() with try/finally + close() to prevent connection leaks

🟡 Issue 4 — Added docstring note in get_run() about None handling; compare_runs() now filters None from recall arrays
🟡 Issue 5 — Added store.close() in integration test cleanup blocks

All 30 tests pass (23 existing + 7 new).
I’ve pushed the changes; if you have a moment, a second review would be much appreciated. Thanks!

@Neal006

Neal006 commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Code Review — Approved with Minor Notes

Hey @Sugaria0427, great work on this PR! The implementation is solid and correct — the schema matches the spec in issue #26 exactly, it's backward compatible, the idempotency handling is done right, and the pre-existing has_llm_eval bug fix is a nice bonus. I've reviewed it carefully and it's clear you read the codebase thoroughly before writing a single line.

I verified that nothing in the codebase reads the path key from list_runs() (not dashboard.py, not main.py, not anywhere), so that API change is safe.


Minor issues to fix before this can land

  1. Merge conflict — the branch is currently conflicting with main. Please rebase on the latest main and resolve the conflicts. This is blocking the merge.

  2. sys.exit dead branch in migrate_legacy_logs.pytotal is always >= 0, so the exit-1 branch is unreachable:

    # current (always exits 0)
    sys.exit(0 if total >= 0 else 1)
    
    # fix: just do this
    sys.exit(0)

    Or if you want to exit 1 on any uncaught error, remove the try/except inside migrate() and let exceptions propagate.

  3. import warnings inside except block in evaluation/logger.py — move it to the top of the file with the other imports. It works where it is, but it's inconsistent with the rest of the module.

  4. list_runs() fallback silently hides filesystem runs once any SQLite run exists — consider adding a one-line comment above the if runs: return runs line pointing users to run the migration script:

    # Once any SQLite run exists, filesystem is bypassed — run migrate_legacy_logs.py first
    if runs:
        return runs

None of these are blocking except the merge conflict. Once you resolve that and push, this is ready to go.

Thanks again for the contribution — this is exactly what the project needed!

Replaces flat JSON/CSV log storage with a queryable SQLite database
while maintaining full backward compatibility.

New files:
- utils/storage.py — Storage class wrapping Python stdlib sqlite3.
  Schema: runs (run_id, timestamp, config_json) + results (run_id FK,
  backend, turn, metric, value). API: save_run, get_run, list_runs,
  compare_runs.
- utils/migrate_legacy_logs.py — one-shot idempotent migration script
  that imports existing experiment_logs/*.json into SQLite.

Modified files:
- evaluation/logger.py — log_run() now calls Storage().save_run()
  alongside existing JSON+CSV writes. list_runs() queries SQLite
  first, falls back to filesystem scan. Fixes pre-existing
  has_llm_eval bug in _append_csv_summary.
- tests/test_pipeline.py — 6 new tests (4 Storage CRUD + 2 logger
  integration).
- CHANGELOG.md — documented the new feature.
- .gitignore — added experiment_logs/memorylens.db.

Closes Neal006#26
- Bug 1: DELETE stale results before INSERT in save_run() to prevent
  duplicate rows on repeated run_id (fixes get_run() corruption)
- Bug 2: Add empty-rows guard in _append_csv_summary() to prevent IndexError
- Bug 3: Close Storage() connections in log_run() and list_runs() via
  try/finally to prevent file handle leaks on Windows
- Issue 4: Document None-handling contract in get_run() docstring;
  filter None from compare_runs() recall arrays
- Issue 5: Close Storage() in test cleanup to match the Storage unit tests

Adds test_storage_save_run_idempotent to verify fix for Bug 1.
…, comment

- Resolve merge conflict in CHANGELOG.md (rebase on upstream/main)
- Remove dead sys.exit branch in migrate_legacy_logs.py (unreachable)
- Move import warnings to top of evaluation/logger.py (consistent style)
- Add clarifying comment in list_runs() about migration bypass
@Sugaria0427 Sugaria0427 force-pushed the feat/sqlite-persistent-storage branch from 6fcc8db to 61f98d7 Compare June 8, 2026 06:30
@Sugaria0427

Copy link
Copy Markdown
Contributor Author

All 4 minor issues are now fixed and the rebase is complete:

  1. Merge conflict resolved — rebased on latest main, CHANGELOG.md conflict resolved (both our SQLite entries and the docs clarifications are preserved)
  2. sys.exit dead branch — simplified to sys.exit(0) in migrate_legacy_logs.py
  3. import warnings — moved to the top of evaluation/logger.py with other imports
  4. Comment added — above if runs: return runs pointing users to migrate_legacy_logs.py

All 30 tests still pass.

@Neal006 Neal006 merged commit 07b3a27 into Neal006:main Jun 8, 2026
@Neal006

Neal006 commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Merged! Great work @Sugaria0427

You addressed every issue from the review — import warnings moved to the top, the dead sys.exit branch cleaned up, and the fallback comment added. The rebase was clean too.

This is a genuinely well-crafted contribution:

Really appreciated the care you put into this. Welcome to the contributor list!

@Sugaria0427

Copy link
Copy Markdown
Contributor Author

Thanks @Neal006 ! It was a pleasure working on this — the codebase is well-structured and your review feedback was very thorough.
Happy to contribute more in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: SQLite persistent storage — replace flat JSON logs with a queryable database

2 participants