Skip to content

feat(remote): publish sanitized cloud SQLite archive objects#77

Merged
vincentkoc merged 1 commit into
mainfrom
feature/cloudflare-remote-archives
May 28, 2026
Merged

feat(remote): publish sanitized cloud SQLite archive objects#77
vincentkoc merged 1 commit into
mainfrom
feature/cloudflare-remote-archives

Conversation

@vincentkoc
Copy link
Copy Markdown
Member

Summary

  • update crawlkit to v0.10.0
  • upload a sanitized SQLite cloud archive during discrawl cloud publish after D1 ingest
  • keep @me rows, raw JSON, attachment URLs, embeddings, and local-only tables out of the remote SQLite artifact

Verification

  • GOWORK=off go test ./internal/cli -run 'TestCloudPublishSendsNonDMRows|TestCloud'\n- GOWORK=off go test ./...\n- built and uploaded a sanitized discrawl/openclaw chunked R2 bundle: 845,058,048 bytes, 1,417,329 messages, @me excluded

@vincentkoc vincentkoc marked this pull request as ready for review May 28, 2026 09:20
@vincentkoc vincentkoc requested a review from a team as a code owner May 28, 2026 09:20
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 28, 2026

Codex review: found issues before merge. Reviewed May 28, 2026, 5:56 AM ET / 09:56 UTC.

Summary
This PR bumps crawlkit to v0.10.0, makes discrawl cloud publish build and upload a sanitized SQLite archive object after row ingest, adds test coverage, and edits changelog text.

Reproducibility: yes. Source inspection shows the PR unconditionally calls uploadSQLiteArchive after final D1 ingest, so a Worker that supports /ingest but lacks PUT /sqlite would fail the new success path.

Review metrics: 3 noteworthy metrics.

  • Mandatory upload path: 1 PUT /sqlite call added after 4 ingest tables. This is the compatibility-sensitive success-path change existing Workers must support.
  • Dependency surface: github.com/openclaw/crawlkit v0.8.0 -> v0.10.0. The runtime behavior depends on the new remote SQLite upload API from the dependency bump.
  • Patch size: 5 files changed, +258/-11. The diff spans runtime behavior, tests, dependency metadata, and release notes.

Merge readiness
Overall: 🦐 gold shrimp
Proof: 🦞 diamond lobster
Patch quality: 🦐 gold shrimp
Result: needs maintainer review before merge.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Preserve or explicitly migrate the existing D1-only Worker publish path.
  • Get maintainer approval for the SQLite object sanitizer, retention, and Worker/R2 deployment boundary.
  • Move release-note context out of CHANGELOG.md and keep it in the PR body or commit message.

Risk before merge

  • [P1] Existing Worker deployments that support current D1 ingest but not PUT /sqlite or R2 permissions can accept rows and then make discrawl cloud publish fail.
  • [P1] Publishing a bulk SQLite object expands the remote artifact retention and sanitizer contract beyond row ingest, even though the PR narrows the exported tables.
  • [P1] The PR edits CHANGELOG.md, which should stay release-owned for normal OpenClaw PR work.

Maintainer options:

  1. Preserve D1-only publish compatibility (recommended)
    Gate SQLite upload on remote contract or capability support, or make it explicitly optional, and add coverage for both D1-only and SQLite-upload paths.
  2. Require a coordinated Worker/R2 migration
    Maintainers can accept the breaking upgrade if release guidance and deployment proof show the Worker route and R2 permissions are already in place.
  3. Pause for sanitizer ownership review
    Hold the PR if the long-term SQLite object schema, privacy filters, or remote retention policy need to be settled in the remote service first.

Next step before merge

  • [P2] This needs a maintainer choice on mandatory versus gated SQLite upload and the remote sanitizer boundary, not an automated repair lane.

Security
Needs attention: The diff has no obvious sanitizer leak, but it expands publication from row ingest to a bulk SQLite object and needs maintainer approval before release.

Review findings

  • [P1] Keep D1-only cloud publish compatible — internal/cli/cloud_commands.go:119-121
  • [P3] Leave release notes to the release flow — CHANGELOG.md:17-19
Review details

Best possible solution:

Land this only after preserving D1-only publish compatibility or explicitly coordinating a Worker/R2 migration, with the SQLite sanitizer and retention contract approved outside normal changelog edits.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection shows the PR unconditionally calls uploadSQLiteArchive after final D1 ingest, so a Worker that supports /ingest but lacks PUT /sqlite would fail the new success path.

Is this the best way to solve the issue?

No, not yet. The sanitizer implementation is focused, but making the upload mandatory is not the safest upgrade path unless maintainers explicitly require a coordinated Worker/R2 migration.

Full review comments:

  • [P1] Keep D1-only cloud publish compatible — internal/cli/cloud_commands.go:119-121
    This new call runs unconditionally after the existing D1 ingest succeeds. A Worker that supports /ingest but has not yet been upgraded for PUT /sqlite or R2 permissions will now fail the command after accepting rows, so please capability-gate or explicitly migrate this upload before making it mandatory.
    Confidence: 0.88
  • [P3] Leave release notes to the release flow — CHANGELOG.md:17-19
    This PR adds normal release-note text to CHANGELOG.md, but OpenClaw release notes are release-owned. Please keep the user-visible context in the PR body or commit message and let the release process update the changelog.
    Confidence: 0.78

Overall correctness: patch is incorrect
Overall confidence: 0.87

AGENTS.md: not found in the target repository.

Codex review notes: model gpt-5.5, reasoning high; reviewed against dcf0c9c8dbb3.

Label changes

Label justifications:

  • P2: This is a normal-priority feature PR with limited blast radius but real upgrade and security-boundary decisions before merge.
  • merge-risk: 🚨 compatibility: The PR makes a new Worker/R2 upload endpoint mandatory for an existing cloud publish workflow that currently succeeds with D1 ingest alone.
  • merge-risk: 🚨 security-boundary: Publishing a bulk SQLite object expands the remote data boundary beyond row ingest and needs maintainer approval of the sanitizer contract.
  • rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🦞 diamond lobster and patch quality is 🦐 gold shrimp.
  • status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (live_output): The PR body reports a real sanitized R2 bundle upload with concrete byte/message counts and @me exclusion, which is sufficient runtime proof for this non-visual CLI change.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body reports a real sanitized R2 bundle upload with concrete byte/message counts and @me exclusion, which is sufficient runtime proof for this non-visual CLI change.
Evidence reviewed

Security concerns:

  • [medium] Bulk SQLite remote artifact changes the data boundary — internal/cli/cloud_commands.go:119
    The PR uploads a generated SQLite database to the Worker-backed object store; even with explicit non-DM table exports, maintainers should approve the schema, excluded tables, and retention expectations before shipping this new publication path.
    Confidence: 0.76

What I checked:

  • Target policy: No AGENTS.md exists under the discrawl target repository root; the only AGENTS.md found was outside this target checkout and was not applied as target policy.
  • Current main behavior: Current main sends the four D1 ingest tables and returns publish counts without any SQLite object upload in the success path. (internal/cli/cloud_commands.go:94, dcf0c9c8dbb3)
  • PR mandatory upload path: The PR calls uploadSQLiteArchive immediately after final message ingest and returns that error, making the new PUT /sqlite route required for publish success. (internal/cli/cloud_commands.go:119, 693e98058134)
  • User-facing docs: The current README describes cloud publish as sending non-DM rows into the Worker-backed D1 archive and excluding @me rows, without a required R2/SQLite object migration. (README.md:560, dcf0c9c8dbb3)
  • Upstream contract: crawlkit v0.10.0 adds the publisher-authenticated PUT /v1/apps/:app/archives/:archive/sqlite route and UploadSQLite client API used by this PR. (github.com/openclaw/crawlkit/remote/contract.go:77, 7db936ce95ca)
  • Sanitizer coverage: The PR test inspects the uploaded SQLite payload, verifies the four expected cloud tables, and checks zero direct-message rows. (internal/cli/cli_test.go:1301, 693e98058134)

Likely related people:

  • vincentkoc: Current main blame and commit metadata show Vincent Koc authored the Cloudflare remote archive mode, including runCloudPublish, its tests, and the README publish docs. (role: current remote archive implementer; confidence: high; commits: dcf0c9c8dbb3; files: internal/cli/cloud_commands.go, internal/cli/cli_test.go, README.md)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. labels May 28, 2026
@vincentkoc vincentkoc force-pushed the feature/cloudflare-remote-archives branch from 1fd8c3c to 279f3db Compare May 28, 2026 09:29
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 28, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedgithub.com/​openclaw/​crawlkit@​v0.8.0 ⏵ v0.10.095100100100100

View full report

@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. labels May 28, 2026
@vincentkoc vincentkoc force-pushed the feature/cloudflare-remote-archives branch 2 times, most recently from 693e980 to b625044 Compare May 28, 2026 10:01
@vincentkoc vincentkoc force-pushed the feature/cloudflare-remote-archives branch from b625044 to db23005 Compare May 28, 2026 10:05
@vincentkoc vincentkoc merged commit 6de2e0e into main May 28, 2026
12 checks passed
@vincentkoc vincentkoc deleted the feature/cloudflare-remote-archives branch May 28, 2026 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant