Skip to content

Align datastore_search timestamp output with the dump endpoint#9

Merged
sagargg merged 1 commit into
mainfrom
feat/iso-timestamp-projection
Jun 8, 2026
Merged

Align datastore_search timestamp output with the dump endpoint#9
sagargg merged 1 commit into
mainfrom
feat/iso-timestamp-projection

Conversation

@sagargg

@sagargg sagargg commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

datastore_search and datastore_dump (CSV / NDJSON) now emit TIMESTAMP / DATETIME columns in the same fixed ISO 8601 shape2026-06-08T00:00:00, UTC implicit, no offset, no fractional. Round-trip stays clean: BigQuery's CAST(STRING AS TIMESTAMP) accepts the no-TZ form and assumes UTC, so a dumped value can be re-upserted unchanged.

datastore_search_sql is intentionally untouched — user SQL stays verbatim.

Why

Before this change, a single TIMESTAMP value rendered three different ways:

Path Output
datastore_dump (CSV / NDJSON) "2026-06-08T00:00:00"
datastore_search — JSON "2026-06-08T00:00:00+00:00" (orjson default)
datastore_search — CSV / TSV "2026-06-08 00:00:00+00:00" (str(datetime) via csv.writer)

Clients that string-compare or naively parse ISO 8601 hit a footgun. Unifying on a single BigQuery-side cast removes the divergence at source.

Code change

  • bigquery/lib.py — new format_select_column(name, bq_type) helper. Single source of truth for the SELECT-list expression. TIMESTAMP → FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%S', col, 'UTC') AS col; DATETIME → FORMAT_DATETIME(...); other types pass through.
  • backend.py::_build_export_select (dump) — thin loop calling the helper. No behaviour change; existing dump tests still pass byte-for-byte.
  • bigquery/search.py::_project_column — translates Frictionless datetime → BQ TIMESTAMP via bigquery_type and calls the same helper. Used by both build_search and build_count so the data query and the DISTINCT count agree.
  • build_search ORDER BY now references t.\col`` rather than the bare column — defensive against the projection alias shadowing the native TIMESTAMP at sort time. ISO 8601 happens to lex-sort the same way, so behaviour is unchanged today; the explicit reference future-proofs us against format tweaks.

Trade-offs called out

  • Sub-second precision is lost in both endpoints (BigQuery stores µs, we emit whole seconds). Parquet dumps keep native types — use that if you need precision.
  • Column type in the search response is now STRING for the columns that used to be TIMESTAMP. The Frictionless schema returned by datastore_info still says datetime — we're just formatting the value at read time, not changing storage.
  • DISTINCT semantics shift to second-precision under the new format (two rows differing only in microseconds dedupe to one). Acceptable given the explicit "whole-second ISO" requirement.

Tests

  • tests/engines/bigquery/test_tables.py — SQL-shape assertions updated for the new ORDER BY (t.\col`) and the wrapped _updated_at` projection.
  • tests/test_datastore_dump.py — already pinned to the new format.
  • 3 NotFoundError matchers flipped "not declared""not found" to match the shortened messages on main's upsert / search / info paths (the delete + dump paths kept their longer messages and aren't touched).

Test-infra: hermetic AUTH_TYPE in conftest

While debugging the test suite locally I tripped over an existing latent issue: tests/conftest.py already neutralizes BigQuery env vars and CKAN_URL, but not AUTH_TYPE. A developer .env containing AUTH_TYPE=anonymous silently reroutes the dict-resource create branch and skips the read-only force guard — ~7 endpoint tests start failing locally even though CI is fine (no .env).

Fix in this PR: force os.environ["AUTH_TYPE"] = "ckan" at the top of conftest.py (not setdefault, because the dev .env is set). The CKAN_URL line above stays setdefault so a CI-supplied value still wins. Bundled here because it's directly tied to the test updates.

Verification

  • pytest -q: 333 passing, no skips, no flake.
  • ruff check: clean.
  • Live spec sanity: search SQL projects FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%S', \col`, 'UTC') AS `col`; dump produces an identical column expression in EXPORT DATA`.

Summary by CodeRabbit

  • Bug Fixes

    • Standardized error messages for resource-not-found scenarios to use consistent wording across all operations.
  • Improvements

    • Aligned datetime and timestamp formatting across search results and data exports to use a consistent ISO-8601 format (YYYY-MM-DDTHH:MM:SS) without timezone offsets or fractional seconds.

Both `datastore_search` and the `datastore_dump` (CSV / NDJSON) now emit TIMESTAMP / DATETIME columns in the same fixed-shape ISO 8601 form — `2026-06-08T00:00:00`, UTC implicit, no offset, no fractional. Round-trip works in both directions: BigQuery's `CAST(STRING AS TIMESTAMP)` accepts the no-TZ form and assumes UTC, so a dumped value can be re-upserted unchanged.

- New `format_select_column(name, bq_type)` helper in `bigquery/lib.py` — single source of truth for the SELECT-list expression. TIMESTAMP -> `FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%S', col, 'UTC') AS col`; DATETIME -> `FORMAT_DATETIME('%Y-%m-%dT%H:%M:%S', col) AS col`; everything else passes through.
- `backend.py::_build_export_select` (dump) — thin loop calling the helper. Same SQL as before, no behaviour change.
- `bigquery/search.py::_project_column` — wraps Frictionless `datetime` columns by translating to BQ `TIMESTAMP` via `bigquery_type` and calling the helper. `_project_column` is used by both `build_search` and `build_count` so DISTINCT semantics stay consistent between the data query and the count.
- `build_search` ORDER BY now references `t.\`col\`` rather than the bare column name — the projection alias for datetime columns shouldn't shadow the native TIMESTAMP at sort time. ISO 8601 sorts lexicographically the same as the underlying TIMESTAMP, so this is defensive future-proofing.
- `datastore_search_sql` is intentionally untouched (user-supplied SQL stays verbatim).

Tests:
- `tests/engines/bigquery/test_tables.py` — SQL-shape assertions updated for the new ORDER BY and the wrapped `_updated_at` projection.
- `tests/test_datastore_dump.py` — already pinned to the new format.
- 3 `NotFoundError` matchers in `test_tables.py` flipped "not declared" -> "not found" to match the shortened messages on main (upsert / search / info paths).

Test infrastructure:
- `tests/conftest.py` forces `AUTH_TYPE=ckan` (not setdefault) so a developer .env that has `AUTH_TYPE=anonymous` can't silently reroute the dict-resource create branch + the read-only force guard, which would skip ~7 endpoint tests locally. CI is unaffected (no .env). The CKAN_URL line above stays `setdefault` for the same reason.
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1cd9fa6e-7a7c-417d-a2c0-f31e717e4fbe

📥 Commits

Reviewing files that changed from the base of the PR and between dd0995a and 58d963b.

📒 Files selected for processing (6)
  • datastore/infrastructure/engines/bigquery/backend.py
  • datastore/infrastructure/engines/bigquery/lib.py
  • datastore/infrastructure/engines/bigquery/search.py
  • tests/conftest.py
  • tests/engines/bigquery/test_tables.py
  • tests/test_datastore_dump.py

📝 Walkthrough

Walkthrough

This PR standardizes BigQuery datetime output across search, export, and count operations by introducing a shared column formatter that converts TIMESTAMP/DATETIME to UTC ISO-8601 strings without fractional seconds. Error messages for missing resources are unified, and test infrastructure enforces consistent CKAN authentication throughout the suite.

Changes

DateTime Formatting & Query Standardization

Layer / File(s) Summary
Shared datetime format helper
datastore/infrastructure/engines/bigquery/lib.py
New format_select_column(name, bq_type) function detects TIMESTAMP and DATETIME types and formats them to UTC ISO-8601 strings without fractional seconds, otherwise returns the column reference unchanged.
Search and count query formatting
datastore/infrastructure/engines/bigquery/search.py, tests/engines/bigquery/test_tables.py
New _project_column helper maps Frictionless types to BigQuery types and applies the shared formatter. build_search uses formatted projections and qualifies ORDER BY with table alias. build_count aligns its inner SELECT projection with build_search for consistent distinct semantics on datetime columns. Test expectations updated for qualified column references and formatted _updated_at projection.
Export data formatting
datastore/infrastructure/engines/bigquery/backend.py, tests/test_datastore_dump.py
_build_export_select for non-Parquet exports now uses format_select_column(field.name, field.field_type) for each column, aligning export rendering with search response formatting. Test assertions updated for second-precision ISO-8601 datetime format without timezone suffix or fractional seconds.
Unified error messages for missing resources
datastore/infrastructure/engines/bigquery/backend.py, tests/engines/bigquery/test_tables.py
Three missing-resource NotFoundError messages in upsert, search, and info are standardized to "resource <id> not found.". Test expectations updated to match the new unified message.
Test authentication enforcement
tests/conftest.py
Module-level explicit AUTH_TYPE override to "ckan" at import time ensures CKAN-auth-guarded code paths execute consistently during endpoint tests, regardless of developer .env settings.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • datopian/datastore#2: Both PRs adjust BigQuery SQL projection/EXPORT SELECT logic to format TIMESTAMP/DATETIME columns into the same ISO-like YYYY-MM-DDTHH:MM:SS form (main PR via format_select_column used by export/search; retrieved PR via dump/export select generation used by the new /datastore/dump/{resource_id} flow).

Poem

🐰 A datetime dance in BigQuery's embrace,
Format them all in one standard place!
No fractional seconds, no UTC sign—
Just clean ISO strings, oh how they align.
From search to export, from count to the logs,
The timestamps now hop through without any clogs!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: aligning timestamp output formats between two datastore endpoints for consistency.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/iso-timestamp-projection

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sagargg sagargg merged commit aa64e30 into main Jun 8, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant