[DCP] Remove DB and Table Auto Creation from DCP Service + Modify the schema to match Ingestion Pipeline#26
Conversation
… service. The Ingestion Pipeline will be responsible for this moving forward.
…stion pipeline. Verified this works with retrieval
There was a problem hiding this comment.
Code Review
This pull request removes the database initialization logic and simplifies the EdgeModel schema by deleting several columns and an index. However, these changes are breaking as they will cause runtime failures in GraphService, which still references the removed attributes and index. Additionally, making the provenance column non-nullable may lead to integrity errors in existing service methods.
| object_value_tokenlist = deferred( | ||
| sa.Column(Text(), nullable=True) | ||
| ) # TOKENLIST is a Spanner type, but represented as String in SQLAlchemy | ||
| provenance = sa.Column(String(1024), primary_key=True, nullable=False) |
There was a problem hiding this comment.
The removal of the object_value, object_bytes, and object_hash columns, along with the EdgeByObjectValue index, is a breaking change that will cause runtime failures in GraphService.
Specifically:
- AttributeError:
GraphService.node_model_to_graph_nodeandGraphService.coerce_edge_val_for_db_writestill attempt to access these deleted attributes onEdgeModelinstances. - Database Error: The
drop_tablescommand inapi_cli.pycallsGraphService.drop_tables(), which executes a raw SQLDROP INDEX EdgeByObjectValue. This will fail if the index is no longer present in the schema. - Integrity Error: Changing
provenancetonullable=Falsewhile it is part of the primary key will cause failures inGraphService.create_edge_modelif a provenance is not provided (as it defaults toNone).
Please update GraphService to align with the new schema or include those changes in this PR to maintain system integrity.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request simplifies the database schema by removing several fields from the EdgeModel and eliminating the database initialization logic. It also updates the graph service to treat all edges as object IDs. Feedback highlights several critical issues: the change to make provenance mandatory in the model conflicts with existing service methods, leading to potential IntegrityError and AttributeError exceptions. Additionally, the removal of the EdgeByObjectValue index will break the drop_tables method which contains hardcoded SQL. Finally, there is a concern regarding JSON-LD semantics where using @id for literal values might break downstream RDF consumers.
| object_value_tokenlist = deferred( | ||
| sa.Column(Text(), nullable=True) | ||
| ) # TOKENLIST is a Spanner type, but represented as String in SQLAlchemy | ||
| provenance = sa.Column(String(1024), primary_key=True, nullable=False) |
There was a problem hiding this comment.
The provenance column is now nullable=False and part of the primary key. However, the create_edge_model function in graph_service.py (line 127) still treats provenance as an optional argument. This inconsistency will lead to IntegrityError during database inserts if a provenance value is not provided. Additionally, the removal of object_value, object_bytes, and object_hash fields will cause AttributeError in coerce_edge_val_for_db_write and insert_node_models_batch within graph_service.py. While the write path is planned for removal, it is best practice to update or stub these dependent methods in this PR to avoid leaving the codebase in a broken state.
| # Indexes | ||
| __table_args__ = ( | ||
| # Index for object_value lookups | ||
| sa.Index("EdgeByObjectValue", "object_value"), |
There was a problem hiding this comment.
Removing the EdgeByObjectValue index here will cause the GraphService.drop_tables() method (located in packages/datacommons-api/datacommons_api/services/graph_service.py, line 512) to fail. That method contains a hardcoded SQL command DROP INDEX EdgeByObjectValue which will now target a non-existent index. Please update the drop_tables method to reflect this schema change.
| # If the edge has an object id, add it to the property value | ||
| property_value["@id"] = edge.object_id | ||
| # All edges are object IDs | ||
| property_value["@id"] = edge.object_id |
There was a problem hiding this comment.
The assumption that all edges are object IDs might lead to semantically incorrect JSON-LD output for literal properties (such as names, descriptions, or numerical values). If the ingestion pipeline now stores literal values within the object_id column, they should be represented using the @value key in the JSON-LD document instead of @id. In JSON-LD, @id is reserved for IRIs/CURIEs, and using it for literal data can break downstream consumers expecting valid RDF semantics.
There was a problem hiding this comment.
Gemini's right, it's a little more complicated than this. Now literal values are stored back on the node table.
Christie has had a PR up for a few weeks (😓 ) that also addresses these schema updates. Lets both review that PR, get it in, and then rebase this PR on top of it-wdyt @gmechali ?
| # If the edge has an object id, add it to the property value | ||
| property_value["@id"] = edge.object_id | ||
| # All edges are object IDs | ||
| property_value["@id"] = edge.object_id |
There was a problem hiding this comment.
Gemini's right, it's a little more complicated than this. Now literal values are stored back on the node table.
Christie has had a PR up for a few weeks (😓 ) that also addresses these schema updates. Lets both review that PR, get it in, and then rebase this PR on top of it-wdyt @gmechali ?
| return session() | ||
|
|
||
|
|
||
| def initialize_db(project_id: str, instance_id: str, database_name: str): |
There was a problem hiding this comment.
with this function removed, can you add some instructions to the README on how users can get initialize spanner?
It might include something like:
(1) option 1, use terraform to deploy to GCP, and give instructions for provisioning spanner from there
(2) option 2, without terraform, clone the https://github.com/datacommonsorg/import/ repo, and use the import pipeline w / DirectRunner to initialize a spanner db
There was a problem hiding this comment.
Yes but I will do it in a follow up so i can reference the terraforms once they're checked in! Will do it tomorrow at the latest!
|
As discussed, we let Christie's PR get submitted - now the only diffs in this PR are the removal of the DB initialization. In a follow up I will update the README to initialize the DB using the ingestion workflow. @dwnoble Any other concerns with this one? |
|
/gemini review |
Deletes the flow within the DCP service responsible for initializing the tables. We have moved this responsibility to the ingestion pipeline in datacommonsorg/import#494