Skip to content

feat(storage): SQLAlchemy storage provider with PostgreSQL support#1161

Open
vringar wants to merge 9 commits intomasterfrom
feat/sqlalchemy-postgresql-v2
Open

feat(storage): SQLAlchemy storage provider with PostgreSQL support#1161
vringar wants to merge 9 commits intomasterfrom
feat/sqlalchemy-postgresql-v2

Conversation

@vringar
Copy link
Copy Markdown
Contributor

@vringar vringar commented Apr 5, 2026

Summary

  • Rewrite SQLiteStorageProvider to use SQLAlchemy, enabling multi-backend support
  • Add PostgreSQL test scenario using pytest-postgresql
  • Use BigInteger for columns holding large values (visit_id, request_id, etc.), Integer for bounded values (response_status, duration)
  • Conditional imports: SQLite tests work without pytest-postgresql installed
  • PostgreSQL fixtures properly gated behind HAS_PYTEST_POSTGRESQL flag

Supersedes #1149. Incorporates fixes from 5 rounds of adversarial review (VDD methodology).

VDD Review History

  • Round 1: Found Integer overflow on PostgreSQL, global scenario pollution
  • Round 2: Fixed BigInteger, isolated pg scenario — but unconditional import broke SQLite-only envs
  • Round 3: Conditional imports, conftest.py fixtures — but gating flag defined and never used
  • Round 4: Wired flag to scenarios — but kept wrong psycopg2 variant (abandoned psycopg2-binary)
  • Round 5: Correct dependency (psycopg2=2.9.11)
  • Final verdict: "...Fine."

Test plan

  • SQLite storage tests pass without PostgreSQL installed
  • PostgreSQL tests pass when pytest-postgresql is available
  • pre-commit run --all-files passes
  • No psycopg2-binary references remain

Copilot AI review requested due to automatic review settings April 5, 2026 18:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new SQLAlchemy-based structured storage provider and schema definitions to enable multi-backend SQL storage (SQLite + PostgreSQL), updates test scaffolding to include SQLAlchemy/PG scenarios, and adjusts generated test values to fit PostgreSQL INTEGER bounds for certain columns.

Changes:

  • Introduces SQLAlchemyStorageProvider (SQLAlchemy Core) and a shared SQLAlchemy schema (TABLE_MAP) for OpenWPM’s structured tables.
  • Updates storage test fixtures and adds SQLAlchemy provider tests, including optional PostgreSQL coverage gated on pytest-postgresql.
  • Pins/adds PostgreSQL + psycopg2 + SQLAlchemy + pytest-postgresql dependencies; adjusts random test value ranges for duration/response_status.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
openwpm/storage/sqlalchemy_provider.py New SQLAlchemy-backed StructuredStorageProvider implementation (incl. reflection fallback + coercions).
openwpm/storage/sqlalchemy_schema.py SQLAlchemy Core table definitions for the structured schema, with BigInteger choices for PG overflow avoidance.
openwpm/storage/sql_provider.py Rewrites SQLiteStorageProvider into a thin wrapper delegating to SQLAlchemyStorageProvider.
test/storage/test_sqlalchemy_provider.py New tests for schema equivalence, all-tables insert smoke tests, and _coerce_record behavior; adds optional PG scenario test.
test/storage/fixtures.py Adds sqlalchemy_sqlite scenario and gated postgresql scenario to structured provider fixtures.
test/storage/conftest.py Adds conditional pytest-postgresql fixtures for discovery when the dependency is installed.
test/storage/test_values.py Narrows random ranges for duration/response_status to 32-bit int bounds.
environment.yaml Adds pinned deps: PostgreSQL, psycopg2, pytest-postgresql, SQLAlchemy.
scripts/environment-unpinned.yaml Adds corresponding unpinned dependency entries.
.gitignore Ignores Crosslink-managed local state files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +66 to +80
self._connection.execute(sa_table.insert(), record)
self._sql_counter += 1
except Exception as e:
self.logger.error(
"Unsupported record:\n%s\n%s\ntable=%s\n%s\n"
% (type(e), e, table, repr(record))
)
# On PostgreSQL, a failed statement aborts the entire transaction.
# All subsequent statements would fail with "InFailedSqlTransaction"
# until a ROLLBACK is issued. We must rollback here so that
# subsequent inserts can succeed.
try:
self._connection.rollback()
except Exception as rollback_err:
self.logger.error("Rollback failed: %s", rollback_err)
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store_record rolls back the entire transaction on any insert error. That rollback will also undo any previous successful inserts since the last commit, causing silent data loss for a whole batch/visit when a single bad record arrives. Consider isolating each insert with a SAVEPOINT / nested transaction (e.g., begin_nested()), or committing more frequently and only rolling back the failed statement, while still recovering from PostgreSQL's aborted-transaction state.

Suggested change
self._connection.execute(sa_table.insert(), record)
self._sql_counter += 1
except Exception as e:
self.logger.error(
"Unsupported record:\n%s\n%s\ntable=%s\n%s\n"
% (type(e), e, table, repr(record))
)
# On PostgreSQL, a failed statement aborts the entire transaction.
# All subsequent statements would fail with "InFailedSqlTransaction"
# until a ROLLBACK is issued. We must rollback here so that
# subsequent inserts can succeed.
try:
self._connection.rollback()
except Exception as rollback_err:
self.logger.error("Rollback failed: %s", rollback_err)
# Isolate each insert in a SAVEPOINT so a single bad record does
# not roll back earlier successful inserts in the outer transaction.
# This also recovers cleanly from PostgreSQL's aborted-transaction
# state by rolling back only the failed statement scope.
with self._connection.begin_nested():
self._connection.execute(sa_table.insert(), record)
self._sql_counter += 1
except Exception as e:
self.logger.error(
"Unsupported record:\n%s\n%s\ntable=%s\n%s\n"
% (type(e), e, table, repr(record))
)

Copilot uses AI. Check for mistakes.
Comment on lines +159 to +163
for table_name, test_data in test_table.items():
await structured_provider.store_record(
TableName(table_name), test_data["visit_id"], test_data
)

Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes the full test_data dict to store_record, but test_values injects a visit_id key into every table record (including task/crawl, which don't have a visit_id column). With SQLAlchemy inserts, that will raise (extra/unknown column) and, since the provider swallows exceptions, the test can succeed without actually inserting rows. Consider mirroring StorageController behavior here by deleting visit_id when it is INVALID_VISIT_ID (and/or asserting that inserts succeeded, e.g., by validating row counts).

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +39
Compares column names, types, NOT NULL constraints, default values,
and AUTOINCREMENT status for every table.
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states this test compares default values, but the implementation never checks dflt_value from PRAGMA table_info. Either add an assertion for defaults (being careful about SQLite quoting differences) or adjust the docstring so it matches the actual coverage.

Suggested change
Compares column names, types, NOT NULL constraints, default values,
and AUTOINCREMENT status for every table.
Compares column names, types, NOT NULL constraints, and AUTOINCREMENT
status for every table.

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +33
async def init(self) -> None:
self._engine = create_engine(self.db_url, **self.engine_kwargs)
self._connection = self._engine.connect()
metadata.create_all(self._engine)

Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods are async but perform synchronous (potentially slow) DB work directly on the event loop (engine creation/connection + DDL). In the StorageController this can still block socket handling/other tasks, especially with PostgreSQL. Consider using SQLAlchemy's asyncio support (sqlalchemy.ext.asyncio) or offloading blocking DB work to a thread/executor.

Copilot uses AI. Check for mistakes.
vringar and others added 8 commits April 7, 2026 23:37
Add SQLAlchemyStorageProvider backed by SQLAlchemy Core (not ORM) that
supports any SQLAlchemy-compatible database. SQLiteStorageProvider is
now a thin wrapper delegating to SQLAlchemyStorageProvider with a
sqlite:/// URL.

New files:
- sqlalchemy_schema.py: All 13 tables as SQLAlchemy Table objects
- sqlalchemy_provider.py: StructuredStorageProvider implementation
- test_sqlalchemy_provider.py: Schema equivalence, all-tables, coercion tests

Key decisions:
- DATETIME columns → Text for cross-dialect compatibility
- No foreign keys (test data violates them, SQLite ignores them)
- PostgreSQL transaction abort recovery via rollback in store_record
- Table reflection fallback for custom tables (e.g. page_links)

Closes #1143
… [CL-5]

Add psycopg2, pytest-postgresql, and postgresql to environment.yaml and extend
test/storage/fixtures.py with a postgresql scenario backed by SQLAlchemyStorageProvider,
so the existing parametrized all-tables insertion tests also run against a real PostgreSQL instance.
…tures [CL-5]

- Change Integer to BigInteger for columns holding large values
- Gate PostgreSQL scenario so existing SQLite tests run without pg
- Add missing deps to environment-unpinned.yaml
- Gate postgresql_scenarios on HAS_PYTEST_POSTGRESQL so SQLite tests
  work without pytest-postgresql installed
- Remove psycopg2 from deps (mutually exclusive with psycopg2-binary)
- Add explanatory comment for intentional ImportError pass in conftest
psycopg2 is a C extension that requires libpq, which is incompatible
with python_abi 3.14 cp314t on conda-forge. psycopg (v3) is a pure
Python driver that works on all Python versions.

- Replace psycopg2 with psycopg in environment files
- Remove postgresql server package (pytest-postgresql manages its own)
- Update SQLAlchemy dialect from postgresql+psycopg2 to postgresql+psycopg
@vringar vringar force-pushed the feat/sqlalchemy-postgresql-v2 branch from 75e9c42 to 267ac02 Compare April 7, 2026 23:37
SQLite requires exactly INTEGER (not BIGINT) for AUTOINCREMENT primary
keys. The SQLAlchemy schema used BigInteger for task_id, which rendered
as BIGINT in DDL — causing OperationalError during table creation.
Since SQLiteStorageProvider now delegates to SQLAlchemyStorageProvider,
this crashed the StorageController process, and TaskManager blocked
forever on status_queue.get(), hanging ALL test groups in CI.

Changes:
- task.task_id: BigInteger → Integer (matches schema.sql)
- crawl.task_id: BigInteger → Integer (foreign key consistency)
- test_values: cap task_id range to 2^31-1 for PostgreSQL compat
- dns_responses: add missing redirect_url column from schema.sql
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 94.44444% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.73%. Comparing base (a088fd3) to head (46e2d22).

Files with missing lines Patch % Lines
openwpm/storage/sqlalchemy_provider.py 93.82% 5 Missing ⚠️
openwpm/storage/sql_provider.py 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1161      +/-   ##
==========================================
+ Coverage   62.18%   62.73%   +0.55%     
==========================================
  Files          40       42       +2     
  Lines        3898     3953      +55     
==========================================
+ Hits         2424     2480      +56     
+ Misses       1474     1473       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants