feat(cwop): v0.2 persistence + history() backtest replay#83
Merged
Conversation
Adds a standalone CWOP persistence layer and a parity-safe access path so ML
strategies can replay personal-weather-station history for model training —
without ever wiring CWOP into the parity-critical Kalshi settlement join.
New surface (weather.cwop):
- history(station, from_date, to_date, *, qc_status=None) -> schema.cwop.v1 DataFrame
(source="cwop.cache"); the backtest-replay entry point. Raises NoCWOPDataError
on an empty range — never an empty frame.
- persist_observations(observations) -> int — write live observations to the
monthly parquet cache (the collection seam for snapshot()/stream() output).
- snapshot(..., persist=True) — collect AND persist in one call (side effect;
the returned live frame is unchanged, still source="cwop.live").
Persistence (weather/cwop/_cache.py):
- $HOME/.mostlyright/cache/cwop/{station}/{year}/{month}.parquet
(honors MOSTLYRIGHT_CACHE_DIR).
- Standalone island: does NOT import the parity-coupled weather/cache.py.
- filelock-guarded read-modify-write merge, dedup by (station_id, observed_at)
first-seen-wins, atomic .tmp + os.replace.
- NO current-month skip — CWOP is ephemeral (no REST backfill), so the current
month is exactly what a live collector must retain.
- Own _CWOP_STATION_RE path validator (CWOP ids are not 4-letter ICAO).
Schema: schema.cwop.v1 now accepts both "cwop.live" and "cwop.cache" via
_registered_sources; build_cwop_dataframe takes a source param.
Parity firewall unchanged: merge/observations.py, research.py, live/_sources.py,
and schema.observation.v1 are all untouched (verified by diff). No include_cwop
on research() — that path aggregates max/min source-blind and would corrupt
NHIGH/NLOW targets.
Tests: +40 (30 cache, 10 history); full non-live suite 3233 passed. New-module
line coverage 94% (_cache) / 96% (_history) via the sysmon lane.
Independent adversarial review of the v0.2 persistence/history feature. BLOCKER — mixed naive/aware observed_at defeated dedup and tz-stripped the stored timestamp column. write_cwop_cache now normalizes observed_at/ knowledge_time to tz-aware UTC at the write chokepoint (before the dedup key is computed), so a naive and an aware spelling of the same instant collapse to one row and the on-disk pyarrow column is uniformly timestamp[us, UTC]. MEDIUM — history() range guard was stricter than the window it gated: it collapsed a bare `to` date to midnight while read_cwop_window expands it to end-of-day, spuriously rejecting valid asymmetric ranges like (datetime 18:00, date same-day). The guard now uses the same _as_utc_datetime normalization as the window. MEDIUM — a legitimate cwop.cache read logged a false `source_drift_allowed` audit event, because the validator's audit block keys off the legacy singular _registered_source. CwopSchema now declares the union via _registered_sources ONLY (no singular), so genuine union-source reads no longer pollute the train/infer-mismatch audit trail. No validator change. LOW — history() rebuilt qc_status via dict-default, which leaves an explicit None cell as None and fails the non-nullable enum check (SchemaValidationError instead of the documented NoCWOPDataError/DataFrame contract). Now coerces with `or "unknown"`. +6 regression tests. Full non-live suite 3239 passed; new-module line coverage 93% (_cache) / 100% (_history).
…e docs
- history() qc_status filter now matches on the SAME null→"unknown" coercion
_row_to_observation applies, so a persisted row stored with a null verdict is
reachable via qc_status="unknown" (predicate now matches the materialized
value the caller sees). +1 regression test.
- Refresh CwopSchema docstring + the `source` ColumnSpec note for the
{cwop.live, cwop.cache} source union (were stale after the v0.2 change).
|
✅ Docs-required check: PASS API-surface change includes docs updates — no reminder needed. API-surface files changed: Docs files changed: |
|
Parity ticket gate: PASSED See |
Stamp the partition's normalized (upper) station id onto every persisted row, mirroring the authoritative source stamp. Persisting via a lowercase id (cw0875) lands in the CW0875 partition but previously kept the raw station_id, so the (station_id, observed_at) dedup key diverged and casing variants stored as distinct rows — history() then returned duplicates for one station. (codex P2)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CWOP v0.2: a standalone persistence layer + a parity-safe backtest-replay path so ML strategies can train on Personal Weather Station history — without ever wiring CWOP into the parity-critical Kalshi NHIGH/NLOW settlement join.
CWOP stays on its own island. The four parity-frozen files are untouched (verified by
git diff main):merge/observations.py,research.py,live/_sources.py,schema.observation.v1. There is noinclude_cwoponresearch()— that path aggregatesmax/min(temp_f)source-blind, so a single hot-rooftop or indoor PWS would corrupt a settlement target. CWOP reaches models only through its ownhistory(), a DataFrame the strategy joins itself.New public surface (
mostlyright.weather.cwop)history(station, from_date, to_date, *, qc_status=None) -> schema.cwop.v1 DataFrame— backtest replay from the persisted cache (source="cwop.cache"). Inclusive date range; optional QC filter (e.g.qc_status="clean"). RaisesNoCWOPDataErroron an empty range — never an empty frame.persist_observations(observations) -> int— write live observations to the monthly parquet cache (the collection seam forsnapshot()/stream()output).snapshot(..., persist=True)— collect and persist in one call (side effect; the returned live frame is unchanged, stillsource="cwop.live").Persistence (
weather/cwop/_cache.py)$HOME/.mostlyright/cache/cwop/{station}/{year}/{month}.parquet(honorsMOSTLYRIGHT_CACHE_DIR).weather/cache.py.filelock-guarded read-modify-write merge under a single per-partition lock (no lost updates), dedup by(station_id, observed_at)first-seen-wins, atomic.tmp+os.replace._CWOP_STATION_REpath validator +assert_path_underbackstop.timestamp[us, UTC].Schema
schema.cwop.v1now accepts the source union{"cwop.live", "cwop.cache"}via_registered_sources(singular_registered_sourcedropped so legitimate cache reads don't pollute the source-drift audit trail).build_cwop_dataframegained asourceparam. The parity-frozenschema.observation.v1enum is untouched — noPWSmember.Review
Architecture self-review + two independent adversarial review rounds (codex was rate-limited mid-run, so an independent reviewer substituted). Round 1 found 1 BLOCKER (mixed naive/aware
observed_atdefeating dedup + tz-stripping storage), 2 MEDIUM (history range-guard stricter than the window; falsesource_drift_allowedaudit on cache reads), 1 LOW (nullqc_status→SchemaValidationError). All fixed; round 2 verified the fixes and confirmed no new issues and the firewall intact. The two non-blocking follow-ups (qc-filter/materialized-value consistency, stale docstrings) are also resolved.Tests
uv run pytest -m "not live"→ 3240 passed, 414 skipped (baseline 3193 + 47 new CWOP persistence/history tests). New-module line coverage (sysmon lane): 93%_cache/ 100%_history. Lint + format clean.Parity safety
git diff mainshows zero changes tomerge/observations.py,research.py,live/_sources.py, and the canonicalcore/schemas/. No"cwop"key in any merge/live/research file. CWOP persistence is a physically separate cache tree and a separate module that never imports the parity cache.TS Parity
N/A for this phase (Python-only), parity ticket deferred. The v0.1 CWOP adapter already ships Python-first with a TS parity ticket for the live path (APRS socket → Worker/ReadableStream). v0.2 persistence is a local parquet cache — TS has no equivalent local-disk story in-browser, and the
history()query maps to whatever local/edge storage the TS SDK adopts later. No TS-visible public-API contract changes here that aren't already gated behind the existing CWOP TS parity ticket. No browser/bundle constraints introduced (purepyarrow/filelock, server-side).python_only: true — CWOP v0.2 persistence + history() is a Python-only local-cache feature behind the CWOP firewall; no cross-SDK contract, tracked by the existing CWOP TS parity ticket.
🤖 Generated with Claude Code