feat: add SQLite persistent storage for benchmark results (#26) by Sugaria0427 · Pull Request #34 · Neal006/memorylens

Sugaria0427 · 2026-05-24T12:49:43Z

Summary

Replace flat JSON/CSV log storage with a queryable SQLite database for benchmark results, while maintaining full backward compatibility.

Related issue

Closes #26

Type of change

API or infrastructure

How was this tested?

python tests/test_pipeline.py  (29/29 pass)
python main.py --turns 10 --checkpoints 5 10 --backends naive --log
python utils/migrate_legacy_logs.py

Checklist

All existing tests pass
New tests added (6 new tests)
Docstrings added on all new public classes and functions
Type hints used on all new function signatures
No API key required to run any new tests
Backward compatible — JSON/CSV output unchanged

Neal006 · 2026-06-03T04:13:35Z

Code Review — PR #34 (SQLite persistent storage)

Thanks for the contribution! The overall design is clean and the backward-compatibility approach (JSON/CSV unchanged, SQLite as opt-in) is exactly right. I found 3 bugs that will cause crashes in production and 2 lower-priority issues that need addressing before merge.

🔴 Bug 1 — Duplicate `results` rows corrupt `get_run()` on run-id collision

File: utils/storage.py, save_run() (~line 1508)

self.conn.execute(
    "INSERT OR REPLACE INTO runs (run_id, timestamp, config_json) VALUES (?, ?, ?)",
    ...
)
# then: self.conn.executemany("INSERT INTO results ...", rows)

INSERT OR REPLACE replaces the row in the runs table but does not cascade-delete the existing rows in results (there is no ON DELETE CASCADE on the FK). On a second call with the same run_id (e.g., a re-run, a test cleanup failure, or the migration script running twice), results accumulates duplicate metric rows. get_run() then sees 2× the data per checkpoint and reconstructs wrong metric arrays.

Fix: delete old results before inserting new ones:

# inside `with self.conn:`, before executemany
self.conn.execute("DELETE FROM results WHERE run_id = ?", (run_id,))

Or add ON DELETE CASCADE to the FK definition and keep INSERT OR REPLACE.

🔴 Bug 2 — `IndexError` crash in `_append_csv_summary` when `rows` is empty

File: evaluation/logger.py, _append_csv_summary() (~line 442)

with open(csv_path, "a", newline="") as fh:
    writer = csv.DictWriter(fh, fieldnames=rows[0].keys())  # ← crashes if rows == []

If checkpoints is [] or every key in display_data is filtered out ("checkpoints" / "has_llm_eval"), rows is empty and rows[0] throws IndexError. The PR adds the has_llm_eval filter (good fix for the pre-existing TypeError) but still doesn't guard against empty rows.

Fix:

if not rows:
    return

Add this guard immediately before the with open(...) block.

🔴 Bug 3 — `Storage()` connections are never closed → file handle leak

File: evaluation/logger.py, log_run() and list_runs()

# log_run:
Storage().save_run(run_id, config, display_data)   # connection never closed

# list_runs:
store = Storage()
runs = store.list_runs(limit=50)
if runs:
    return runs   # returns without calling store.close()

Every call to log_run() (and every SQLite-path call to list_runs()) leaks a sqlite3.Connection. In the Streamlit dashboard, each benchmark run fires log_run and a re-render can trigger list_runs multiple times. On Windows, open SQLite file handles also block file deletion/migration.

Fix — option A (minimal): add try/finally:

# log_run:
try:
    store = Storage()
    store.save_run(run_id, config, display_data)
finally:
    store.close()

# list_runs:
try:
    store = Storage()
    runs = store.list_runs(limit=50)
    if runs:
        return runs
finally:
    store.close()

Fix — option B (cleaner): add __enter__/__exit__ to Storage so callers can use with Storage() as store:.

🟡 Issue 4 — `get_run()` returns `None` in metric arrays; downstream callers may crash

File: utils/storage.py, get_run() (~line 1574)

display[backend][metric] = [
    turn_map[t].get(metric) if metric in turn_map.get(t, {}) else None
    for t in cps
]

Metrics absent for a checkpoint are stored as None. compare_runs() and the dashboard iterate these arrays expecting floats. For example, compare_runs does list(v["recall"]) — that works — but any caller doing arithmetic (sum(arr), Plotly chart) on the returned list will encounter TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'.

Suggested fix: filter None at the point of reconstruction, or document clearly that consumers must handle None.

🟡 Issue 5 — Tests leak `Storage` connections and use raw SQL for cleanup

File: tests/test_pipeline.py, test_logger_writes_sqlite and test_list_runs_returns_sqlite_runs

Both tests do cleanup via store.conn.execute("DELETE FROM ...") but never call store.close(). The raw-SQL cleanup also bypasses the public API, making the tests fragile if the schema changes.

Suggested fix: call store.close() in each test's finally block (consistent with the other storage tests that already do this correctly).

Summary table

#	Severity	File	Issue
1	🔴 High	`utils/storage.py`	Duplicate `results` rows corrupt `get_run()` on repeated `run_id`
2	🔴 High	`evaluation/logger.py`	`IndexError` when `rows` is empty in `_append_csv_summary`
3	🔴 High	`evaluation/logger.py`	`Storage()` connections never closed → file handle leak
4	🟡 Medium	`utils/storage.py`	`None` values in metric arrays break arithmetic downstream
5	🟡 Low	`tests/test_pipeline.py`	Integration tests leak connections and use raw SQL for cleanup

Once the three 🔴 issues are fixed I'm happy to approve. The design and test coverage are solid — these are all mechanical fixes.

- Bug 1: DELETE stale results before INSERT in save_run() to prevent duplicate rows on repeated run_id (fixes get_run() corruption) - Bug 2: Add empty-rows guard in _append_csv_summary() to prevent IndexError - Bug 3: Close Storage() connections in log_run() and list_runs() via try/finally to prevent file handle leaks on Windows - Issue 4: Document None-handling contract in get_run() docstring; filter None from compare_runs() recall arrays - Issue 5: Close Storage() in test cleanup to match the Storage unit tests Adds test_storage_save_run_idempotent to verify fix for Bug 1.

Sugaria0427 · 2026-06-03T18:13:14Z

[@Neal006] Thanks for the thorough review! All 5 issues have been addressed:

🔴 Bug 1 — Added DELETE FROM results WHERE run_id = ? before INSERT in save_run() to prevent duplicate rows (with test test_storage_save_run_idempotent)
🔴 Bug 2 — Added if not rows: return guard in _append_csv_summary() to prevent IndexError on empty checkpoints
🔴 Bug 3 — Wrapped Storage() usage in log_run() and list_runs() with try/finally + close() to prevent connection leaks

🟡 Issue 4 — Added docstring note in get_run() about None handling; compare_runs() now filters None from recall arrays
🟡 Issue 5 — Added store.close() in integration test cleanup blocks

All 30 tests pass (23 existing + 7 new).
I’ve pushed the changes; if you have a moment, a second review would be much appreciated. Thanks!

Neal006 · 2026-06-08T06:14:44Z

Code Review — Approved with Minor Notes

Hey @Sugaria0427, great work on this PR! The implementation is solid and correct — the schema matches the spec in issue #26 exactly, it's backward compatible, the idempotency handling is done right, and the pre-existing has_llm_eval bug fix is a nice bonus. I've reviewed it carefully and it's clear you read the codebase thoroughly before writing a single line.

I verified that nothing in the codebase reads the path key from list_runs() (not dashboard.py, not main.py, not anywhere), so that API change is safe.

Minor issues to fix before this can land

Merge conflict — the branch is currently conflicting with main. Please rebase on the latest main and resolve the conflicts. This is blocking the merge.
sys.exit dead branch in migrate_legacy_logs.py — total is always >= 0, so the exit-1 branch is unreachable:
```
# current (always exits 0)
sys.exit(0 if total >= 0 else 1)

# fix: just do this
sys.exit(0)
```
Or if you want to exit 1 on any uncaught error, remove the try/except inside migrate() and let exceptions propagate.
import warnings inside except block in evaluation/logger.py — move it to the top of the file with the other imports. It works where it is, but it's inconsistent with the rest of the module.
list_runs() fallback silently hides filesystem runs once any SQLite run exists — consider adding a one-line comment above the if runs: return runs line pointing users to run the migration script:
```
# Once any SQLite run exists, filesystem is bypassed — run migrate_legacy_logs.py first
if runs:
    return runs
```

None of these are blocking except the merge conflict. Once you resolve that and push, this is ready to go.

Thanks again for the contribution — this is exactly what the project needed!

Replaces flat JSON/CSV log storage with a queryable SQLite database while maintaining full backward compatibility. New files: - utils/storage.py — Storage class wrapping Python stdlib sqlite3. Schema: runs (run_id, timestamp, config_json) + results (run_id FK, backend, turn, metric, value). API: save_run, get_run, list_runs, compare_runs. - utils/migrate_legacy_logs.py — one-shot idempotent migration script that imports existing experiment_logs/*.json into SQLite. Modified files: - evaluation/logger.py — log_run() now calls Storage().save_run() alongside existing JSON+CSV writes. list_runs() queries SQLite first, falls back to filesystem scan. Fixes pre-existing has_llm_eval bug in _append_csv_summary. - tests/test_pipeline.py — 6 new tests (4 Storage CRUD + 2 logger integration). - CHANGELOG.md — documented the new feature. - .gitignore — added experiment_logs/memorylens.db. Closes Neal006#26

- Bug 1: DELETE stale results before INSERT in save_run() to prevent duplicate rows on repeated run_id (fixes get_run() corruption) - Bug 2: Add empty-rows guard in _append_csv_summary() to prevent IndexError - Bug 3: Close Storage() connections in log_run() and list_runs() via try/finally to prevent file handle leaks on Windows - Issue 4: Document None-handling contract in get_run() docstring; filter None from compare_runs() recall arrays - Issue 5: Close Storage() in test cleanup to match the Storage unit tests Adds test_storage_save_run_idempotent to verify fix for Bug 1.

…, comment - Resolve merge conflict in CHANGELOG.md (rebase on upstream/main) - Remove dead sys.exit branch in migrate_legacy_logs.py (unreachable) - Move import warnings to top of evaluation/logger.py (consistent style) - Add clarifying comment in list_runs() about migration bypass

Sugaria0427 · 2026-06-08T06:30:50Z

All 4 minor issues are now fixed and the rebase is complete:

✅ Merge conflict resolved — rebased on latest main, CHANGELOG.md conflict resolved (both our SQLite entries and the docs clarifications are preserved)
✅ sys.exit dead branch — simplified to sys.exit(0) in migrate_legacy_logs.py
✅ import warnings — moved to the top of evaluation/logger.py with other imports
✅ Comment added — above if runs: return runs pointing users to migrate_legacy_logs.py

All 30 tests still pass.

Neal006 · 2026-06-08T06:57:22Z

Merged! Great work @Sugaria0427

You addressed every issue from the review — import warnings moved to the top, the dead sys.exit branch cleaned up, and the fallback comment added. The rebase was clean too.

This is a genuinely well-crafted contribution:

Matches the issue feat: SQLite persistent storage — replace flat JSON logs with a queryable database #26 spec exactly
Zero new dependencies (stdlib sqlite3 only)
Backward compatible — existing JSON/CSV output untouched
Idempotency handled correctly
8 solid tests with proper isolation
Even fixed a pre-existing TypeError bug as a bonus

Really appreciated the care you put into this. Welcome to the contributor list!

Sugaria0427 · 2026-06-08T07:04:05Z

Thanks @Neal006 ! It was a pleasure working on this — the codebase is well-structured and your review feedback was very thorough.
Happy to contribute more in the future!

Sugaria0427 force-pushed the feat/sqlite-persistent-storage branch from 022423e to 6fcc8db Compare June 3, 2026 18:13

Sugaria0427 added 3 commits June 8, 2026 14:29

Sugaria0427 force-pushed the feat/sqlite-persistent-storage branch from 6fcc8db to 61f98d7 Compare June 8, 2026 06:30

Neal006 merged commit 07b3a27 into Neal006:main Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SQLite persistent storage for benchmark results (#26)#34

feat: add SQLite persistent storage for benchmark results (#26)#34
Neal006 merged 3 commits into
Neal006:mainfrom
Sugaria0427:feat/sqlite-persistent-storage

Sugaria0427 commented May 24, 2026

Uh oh!

Neal006 commented Jun 3, 2026

Uh oh!

Sugaria0427 commented Jun 3, 2026 •

edited

Loading

Uh oh!

Neal006 commented Jun 8, 2026

Uh oh!

Sugaria0427 commented Jun 8, 2026

Uh oh!

Neal006 commented Jun 8, 2026

Uh oh!

Sugaria0427 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sugaria0427 commented May 24, 2026

Summary

Related issue

Type of change

How was this tested?

Checklist

Uh oh!

Neal006 commented Jun 3, 2026

Code Review — PR #34 (SQLite persistent storage)

🔴 Bug 1 — Duplicate results rows corrupt get_run() on run-id collision

🔴 Bug 2 — IndexError crash in _append_csv_summary when rows is empty

🔴 Bug 3 — Storage() connections are never closed → file handle leak

🟡 Issue 4 — get_run() returns None in metric arrays; downstream callers may crash

🟡 Issue 5 — Tests leak Storage connections and use raw SQL for cleanup

Summary table

Uh oh!

Sugaria0427 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Neal006 commented Jun 8, 2026

Code Review — Approved with Minor Notes

Minor issues to fix before this can land

Uh oh!

Sugaria0427 commented Jun 8, 2026

Uh oh!

Neal006 commented Jun 8, 2026

Merged! Great work @Sugaria0427

Uh oh!

Sugaria0427 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔴 Bug 1 — Duplicate `results` rows corrupt `get_run()` on run-id collision

🔴 Bug 2 — `IndexError` crash in `_append_csv_summary` when `rows` is empty

🔴 Bug 3 — `Storage()` connections are never closed → file handle leak

🟡 Issue 4 — `get_run()` returns `None` in metric arrays; downstream callers may crash

🟡 Issue 5 — Tests leak `Storage` connections and use raw SQL for cleanup

Sugaria0427 commented Jun 3, 2026 •

edited

Loading