Skip to content

[DCP] Auto create tables when DB is empty#494

Merged
gmechali merged 7 commits intodatacommonsorg:masterfrom
gmechali:autoCreateTable
Apr 8, 2026
Merged

[DCP] Auto create tables when DB is empty#494
gmechali merged 7 commits intodatacommonsorg:masterfrom
gmechali:autoCreateTable

Conversation

@gmechali
Copy link
Copy Markdown
Contributor

@gmechali gmechali commented Apr 7, 2026

Modify the ingestion pipeline's logic for DB creation to be a bit more flexible. Before this change, it just checks whether the DB exists, and if not it creates it with the appropriate scheme and protodescriptor.

With this change, we retain the ability to create the DB and initialize it properly, but we also allow customers to provide an empty DB, and have it be initialized with the proper schema and proto descriptor.
This is important for DCP, where the DB should be created via Terraform at setup time in order to also connect the DCP service to the DB, however the schema is not necessary at that point, and so we can expect the ingestion pipeline to validate the schema and create it when necessary.

I've tested this on all the following scenarios:

  • DB exists and is properly initialized
  • DB exists but has no tables or proto descriptor
  • DB exists but has no tables only
  • DB does not exist.

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 7, 2026

Not up to standards ⛔

🔴 Issues 3 medium · 3 minor

Alerts:
⚠ 6 issues (≤ 0 issues of at least minor severity)

Results:
6 new issues

Category Results
CodeStyle 3 minor
Complexity 3 medium

View in Codacy

🟢 Metrics 22 complexity · 0 duplication

Metric Results
Complexity 22
Duplication 0

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the SpannerClient to handle existing but uninitialized databases by checking for the presence of core tables and applying DDL statements if the database is empty. It also introduces unit tests for these new initialization checks. Review feedback suggests using configured table names instead of hardcoded strings for consistency, optimizing I/O by deferring DDL reading until needed, removing redundant fully qualified class names, and adding a missing assertion in the test suite.

@gmechali
Copy link
Copy Markdown
Contributor Author

gmechali commented Apr 7, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Spanner database initialization process by replacing the createDatabase method with a more comprehensive validateOrInitializeDatabase method. This new logic verifies the existence of required tables and proto bundles, handling cases where the database might be partially initialized or empty. Additionally, the PR introduces unit tests for SpannerClient and updates the project's dependencies to support testing. Review feedback suggests improving the robustness of DDL statement detection by using more specific string matching and adhering to the principle of programming to an interface by using List instead of ArrayList in method signatures.

@gmechali gmechali requested review from dwnoble and vish-cs April 7, 2026 13:44
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@vish-cs
Copy link
Copy Markdown
Contributor

vish-cs commented Apr 8, 2026

Thanks for making this change. This was one of the pending items since the previous logic was incompatible with the Terraform setup.
https://github.com/datacommonsorg/data/blob/master/import-automation/terraform/main.tf

FYI, Terraform deployment script creates additional tables in the DB for import automation.
https://github.com/datacommonsorg/data/blob/master/import-automation/workflow/spanner_schema.sql

@gmechali
Copy link
Copy Markdown
Contributor Author

gmechali commented Apr 8, 2026

Thanks for making this change. This was one of the pending items since the previous logic was incompatible with the Terraform setup. https://github.com/datacommonsorg/data/blob/master/import-automation/terraform/main.tf

FYI, Terraform deployment script creates additional tables in the DB for import automation. https://github.com/datacommonsorg/data/blob/master/import-automation/workflow/spanner_schema.sql

Got it - so the DCP Terraform script now can also just keep its current logic of creating the DataBase, but letting the ingestion script initialize the tables.

We also could have initialized the tables in the Terraform script, but I think that holds some pretty significant risk. If a user ends up making a change to the Table Schema and reapplies terraform, it does not modify the table, it recreates it, and therefore could drop all the data in the tables.
That was why I did not want the DCP TF to create the Node/Edge/Obs tables. Maybe for ImportStatus it's a bit less critical...but im not sure.

I will put some thought into it in a few days when I get to the TF Import Status setup for DCP.

@gmechali gmechali merged commit fced75e into datacommonsorg:master Apr 8, 2026
5 of 6 checks passed
@gmechali gmechali deleted the autoCreateTable branch April 8, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants