[DCP] Remove DB and Table Auto Creation from DCP Service + Modify the schema to match Ingestion Pipeline by gmechali · Pull Request #26 · datacommonsorg/datacommons

gmechali · 2026-04-07T14:05:59Z

Deletes the flow within the DCP service responsible for initializing the tables. We have moved this responsibility to the ingestion pipeline in datacommonsorg/import#494

… service. The Ingestion Pipeline will be responsible for this moving forward.

…stion pipeline. Verified this works with retrieval

gemini-code-assist

Code Review

This pull request removes the database initialization logic and simplifies the EdgeModel schema by deleting several columns and an index. However, these changes are breaking as they will cause runtime failures in GraphService, which still references the removed attributes and index. Additionally, making the provenance column non-nullable may lead to integrity errors in existing service methods.

gemini-code-assist · 2026-04-07T14:07:30Z

-    object_value_tokenlist = deferred(
-        sa.Column(Text(), nullable=True)
-    )  #  TOKENLIST is a Spanner type, but represented as String in SQLAlchemy
+    provenance = sa.Column(String(1024), primary_key=True, nullable=False)


The removal of the object_value, object_bytes, and object_hash columns, along with the EdgeByObjectValue index, is a breaking change that will cause runtime failures in GraphService.

Specifically:

AttributeError: GraphService.node_model_to_graph_node and GraphService.coerce_edge_val_for_db_write still attempt to access these deleted attributes on EdgeModel instances.

Database Error: The drop_tables command in api_cli.py calls GraphService.drop_tables(), which executes a raw SQL DROP INDEX EdgeByObjectValue. This will fail if the index is no longer present in the schema.

Integrity Error: Changing provenance to nullable=False while it is part of the primary key will cause failures in GraphService.create_edge_model if a provenance is not provided (as it defaults to None).

Please update GraphService to align with the new schema or include those changes in this PR to maintain system integrity.

gmechali · 2026-04-07T14:21:04Z

/gemini review

gemini-code-assist

Code Review

This pull request simplifies the database schema by removing several fields from the EdgeModel and eliminating the database initialization logic. It also updates the graph service to treat all edges as object IDs. Feedback highlights several critical issues: the change to make provenance mandatory in the model conflicts with existing service methods, leading to potential IntegrityError and AttributeError exceptions. Additionally, the removal of the EdgeByObjectValue index will break the drop_tables method which contains hardcoded SQL. Finally, there is a concern regarding JSON-LD semantics where using @id for literal values might break downstream RDF consumers.

gemini-code-assist · 2026-04-07T14:22:50Z

-    object_value_tokenlist = deferred(
-        sa.Column(Text(), nullable=True)
-    )  #  TOKENLIST is a Spanner type, but represented as String in SQLAlchemy
+    provenance = sa.Column(String(1024), primary_key=True, nullable=False)


The provenance column is now nullable=False and part of the primary key. However, the create_edge_model function in graph_service.py (line 127) still treats provenance as an optional argument. This inconsistency will lead to IntegrityError during database inserts if a provenance value is not provided. Additionally, the removal of object_value, object_bytes, and object_hash fields will cause AttributeError in coerce_edge_val_for_db_write and insert_node_models_batch within graph_service.py. While the write path is planned for removal, it is best practice to update or stub these dependent methods in this PR to avoid leaving the codebase in a broken state.

gemini-code-assist · 2026-04-07T14:22:50Z

-    # Indexes
-    __table_args__ = (
-        # Index for object_value lookups
-        sa.Index("EdgeByObjectValue", "object_value"),


Removing the EdgeByObjectValue index here will cause the GraphService.drop_tables() method (located in packages/datacommons-api/datacommons_api/services/graph_service.py, line 512) to fail. That method contains a hardcoded SQL command DROP INDEX EdgeByObjectValue which will now target a non-existent index. Please update the drop_tables method to reflect this schema change.

gemini-code-assist · 2026-04-07T14:22:50Z

-            # If the edge has an object id, add it to the property value
-            property_value["@id"] = edge.object_id
+        # All edges are object IDs
+        property_value["@id"] = edge.object_id


The assumption that all edges are object IDs might lead to semantically incorrect JSON-LD output for literal properties (such as names, descriptions, or numerical values). If the ingestion pipeline now stores literal values within the object_id column, they should be represented using the @value key in the JSON-LD document instead of @id. In JSON-LD, @id is reserved for IRIs/CURIEs, and using it for literal data can break downstream consumers expecting valid RDF semantics.

@dwnoble can you confirm?

Gemini's right, it's a little more complicated than this. Now literal values are stored back on the node table.

Christie has had a PR up for a few weeks (😓 ) that also addresses these schema updates. Lets both review that PR, get it in, and then rebase this PR on top of it-wdyt @gmechali ?

dwnoble · 2026-04-09T19:02:09Z

-            # If the edge has an object id, add it to the property value
-            property_value["@id"] = edge.object_id
+        # All edges are object IDs
+        property_value["@id"] = edge.object_id


Gemini's right, it's a little more complicated than this. Now literal values are stored back on the node table.

Christie has had a PR up for a few weeks (😓 ) that also addresses these schema updates. Lets both review that PR, get it in, and then rebase this PR on top of it-wdyt @gmechali ?

dwnoble · 2026-04-09T19:09:18Z

    return session()
-
-
-def initialize_db(project_id: str, instance_id: str, database_name: str):


with this function removed, can you add some instructions to the README on how users can get initialize spanner?

It might include something like:
(1) option 1, use terraform to deploy to GCP, and give instructions for provisioning spanner from there
(2) option 2, without terraform, clone the https://github.com/datacommonsorg/import/ repo, and use the import pipeline w / DirectRunner to initialize a spanner db

Yes but I will do it in a follow up so i can reference the terraforms once they're checked in! Will do it tomorrow at the latest!

gmechali · 2026-04-14T17:56:40Z

As discussed, we let Christie's PR get submitted - now the only diffs in this PR are the removal of the DB initialization. In a follow up I will update the README to initialize the DB using the ingestion workflow.

@dwnoble Any other concerns with this one?

gmechali · 2026-04-14T17:57:08Z

/gemini review

gemini-code-assist

Code Review

This pull request removes the database initialization logic from the API CLI and the database session module, specifically deleting the initialize_db function and its related imports and constants. I have no feedback to provide.

gmechali added 2 commits April 7, 2026 09:53

Removes the logic to create the DB and update the schema from the DCP…

5433649

… service. The Ingestion Pipeline will be responsible for this moving forward.

Also modify the Edge schema to match the new spanner schema from inge…

c55f013

…stion pipeline. Verified this works with retrieval

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

Graph Service changes to reflect schema updates.

e676023

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

Lint

4baac14

gmechali requested review from clincoln8 and dwnoble April 7, 2026 14:24

dwnoble requested changes Apr 9, 2026

View reviewed changes

Merge branch 'main' into removeAutoCreate

54ef5b3

gmechali requested a review from dwnoble April 14, 2026 17:56

gemini-code-assist bot reviewed Apr 14, 2026

View reviewed changes

		return session()


		def initialize_db(project_id: str, instance_id: str, database_name: str):

Conversation

gmechali commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gmechali commented Apr 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gmechali Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

dwnoble Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

dwnoble Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

dwnoble Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gmechali Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gmechali commented Apr 14, 2026

Uh oh!

gmechali commented Apr 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gmechali commented Apr 7, 2026 •

edited

Loading