Skip to content

[DCP] Remove DB and Table Auto Creation from DCP Service + Modify the schema to match Ingestion Pipeline#26

Open
gmechali wants to merge 5 commits intodatacommonsorg:mainfrom
gmechali:removeAutoCreate
Open

[DCP] Remove DB and Table Auto Creation from DCP Service + Modify the schema to match Ingestion Pipeline#26
gmechali wants to merge 5 commits intodatacommonsorg:mainfrom
gmechali:removeAutoCreate

Conversation

@gmechali
Copy link
Copy Markdown
Contributor

@gmechali gmechali commented Apr 7, 2026

Deletes the flow within the DCP service responsible for initializing the tables. We have moved this responsibility to the ingestion pipeline in datacommonsorg/import#494

gmechali added 2 commits April 7, 2026 09:53
… service. The Ingestion Pipeline will be responsible for this moving forward.
…stion pipeline. Verified this works with retrieval
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the database initialization logic and simplifies the EdgeModel schema by deleting several columns and an index. However, these changes are breaking as they will cause runtime failures in GraphService, which still references the removed attributes and index. Additionally, making the provenance column non-nullable may lead to integrity errors in existing service methods.

object_value_tokenlist = deferred(
sa.Column(Text(), nullable=True)
) # TOKENLIST is a Spanner type, but represented as String in SQLAlchemy
provenance = sa.Column(String(1024), primary_key=True, nullable=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The removal of the object_value, object_bytes, and object_hash columns, along with the EdgeByObjectValue index, is a breaking change that will cause runtime failures in GraphService.

Specifically:

  • AttributeError: GraphService.node_model_to_graph_node and GraphService.coerce_edge_val_for_db_write still attempt to access these deleted attributes on EdgeModel instances.
  • Database Error: The drop_tables command in api_cli.py calls GraphService.drop_tables(), which executes a raw SQL DROP INDEX EdgeByObjectValue. This will fail if the index is no longer present in the schema.
  • Integrity Error: Changing provenance to nullable=False while it is part of the primary key will cause failures in GraphService.create_edge_model if a provenance is not provided (as it defaults to None).

Please update GraphService to align with the new schema or include those changes in this PR to maintain system integrity.

@gmechali
Copy link
Copy Markdown
Contributor Author

gmechali commented Apr 7, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies the database schema by removing several fields from the EdgeModel and eliminating the database initialization logic. It also updates the graph service to treat all edges as object IDs. Feedback highlights several critical issues: the change to make provenance mandatory in the model conflicts with existing service methods, leading to potential IntegrityError and AttributeError exceptions. Additionally, the removal of the EdgeByObjectValue index will break the drop_tables method which contains hardcoded SQL. Finally, there is a concern regarding JSON-LD semantics where using @id for literal values might break downstream RDF consumers.

object_value_tokenlist = deferred(
sa.Column(Text(), nullable=True)
) # TOKENLIST is a Spanner type, but represented as String in SQLAlchemy
provenance = sa.Column(String(1024), primary_key=True, nullable=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The provenance column is now nullable=False and part of the primary key. However, the create_edge_model function in graph_service.py (line 127) still treats provenance as an optional argument. This inconsistency will lead to IntegrityError during database inserts if a provenance value is not provided. Additionally, the removal of object_value, object_bytes, and object_hash fields will cause AttributeError in coerce_edge_val_for_db_write and insert_node_models_batch within graph_service.py. While the write path is planned for removal, it is best practice to update or stub these dependent methods in this PR to avoid leaving the codebase in a broken state.

# Indexes
__table_args__ = (
# Index for object_value lookups
sa.Index("EdgeByObjectValue", "object_value"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Removing the EdgeByObjectValue index here will cause the GraphService.drop_tables() method (located in packages/datacommons-api/datacommons_api/services/graph_service.py, line 512) to fail. That method contains a hardcoded SQL command DROP INDEX EdgeByObjectValue which will now target a non-existent index. Please update the drop_tables method to reflect this schema change.

# If the edge has an object id, add it to the property value
property_value["@id"] = edge.object_id
# All edges are object IDs
property_value["@id"] = edge.object_id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assumption that all edges are object IDs might lead to semantically incorrect JSON-LD output for literal properties (such as names, descriptions, or numerical values). If the ingestion pipeline now stores literal values within the object_id column, they should be represented using the @value key in the JSON-LD document instead of @id. In JSON-LD, @id is reserved for IRIs/CURIEs, and using it for literal data can break downstream consumers expecting valid RDF semantics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwnoble can you confirm?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini's right, it's a little more complicated than this. Now literal values are stored back on the node table.

Christie has had a PR up for a few weeks (😓 ) that also addresses these schema updates. Lets both review that PR, get it in, and then rebase this PR on top of it-wdyt @gmechali ?

@gmechali gmechali requested review from clincoln8 and dwnoble April 7, 2026 14:24
# If the edge has an object id, add it to the property value
property_value["@id"] = edge.object_id
# All edges are object IDs
property_value["@id"] = edge.object_id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini's right, it's a little more complicated than this. Now literal values are stored back on the node table.

Christie has had a PR up for a few weeks (😓 ) that also addresses these schema updates. Lets both review that PR, get it in, and then rebase this PR on top of it-wdyt @gmechali ?

return session()


def initialize_db(project_id: str, instance_id: str, database_name: str):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this function removed, can you add some instructions to the README on how users can get initialize spanner?

It might include something like:
(1) option 1, use terraform to deploy to GCP, and give instructions for provisioning spanner from there
(2) option 2, without terraform, clone the https://github.com/datacommonsorg/import/ repo, and use the import pipeline w / DirectRunner to initialize a spanner db

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I will do it in a follow up so i can reference the terraforms once they're checked in! Will do it tomorrow at the latest!

@gmechali
Copy link
Copy Markdown
Contributor Author

As discussed, we let Christie's PR get submitted - now the only diffs in this PR are the removal of the DB initialization. In a follow up I will update the README to initialize the DB using the ingestion workflow.

@dwnoble Any other concerns with this one?

@gmechali gmechali requested a review from dwnoble April 14, 2026 17:56
@gmechali
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the database initialization logic from the API CLI and the database session module, specifically deleting the initialize_db function and its related imports and constants. I have no feedback to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants