Skip to content

Tracking results storage and exposure #1241

@Ndpnt

Description

@Ndpnt

Tracking results storage and exposure

Context

The engine periodically tracks whether online documents are accessible and extractable. For each service/terms pair, the outcome is either success (content fetched and version recorded) or failure (content inaccessible or extraction broken). When tracking fails, the Reporter module creates an issue on a third-party software forge (GitHub or GitLab). When tracking resumes, it closes the issue. These forge issues are the sole persistent record of tracking results.

Problem statement

Tracking results have no local persistence. Even when everything works, the persistence of tracking data depends entirely on an external forge. If the forge is unreachable or no API token is configured, failures are logged but not persisted, and the information is lost once the process ends. There is no way to query the current tracking status of a service without calling an external forge API.

This RFC proposes to introduce a local storage layer that persists tracking results independently of any external service, and to expose them through the collection API. This storage becomes the single source of truth for tracking status. External integrations (such as the current GitHub and GitLab issue managers) would then be extracted into separate modules that consume tracking results through this API. The active consumption of this data (creating issues, sending notifications) is out of scope.

Alternatives considered

Two existing standards and four alternative storage strategies were evaluated before arriving at the proposed design. None was retained. A dedicated Git repository with a custom JSON format was chosen instead.

Existing standards considered

TAP — Test Anything Protocol

TAP is a text-based protocol for reporting test results. The analogy with tracking is genuine: for each service/terms pair, the engine verifies that content is accessible and extractable, producing a pass or fail result.

TAP is not retained for two reasons. First, the standard only structures a small fraction of the data: TAP provides a binary ok/not ok outcome, but the engine needs to persist richer information (error reasons, source document metadata, snapshot references, timing data, transient errors). All of this must go into free-form YAML diagnostic blocks, outside the standard's structure:

TAP version 14
1..1
not ok 1 - Facebook · Terms of Service
  ---
  date: "2026-03-15T10:30:00Z"
  runId: "f47ac10b-58cc-4372-a567-0e02b2c3d479"
  reasons:
    - "CSS selector \".content\" has no match in the document"
  sourceDocuments:
    - id: legal-terms
      fetch: "https://www.facebook.com/legal/terms"
      select: ".content"
      snapshotId: abc123
      mimeType: text/html
  ...

Only the not ok line is structured by TAP. Everything in the ---/... block is free-form YAML that no TAP tool can interpret. Second, existing TAP tools (tap-parser, tap-mocha-reporter) are designed for CI/CD terminal output, not for the consumers needed here (issue sync, API endpoints, federation aggregation). Since the standard neither structures the data nor provides usable tooling, adopting it would impose format constraints without any corresponding benefit.

EARL — Evaluation and Report Language

EARL is a W3C vocabulary for expressing test results. The semantic fit is the strongest: its concepts of Assertion, TestSubject, TestCriterion, and five-valued OutcomeValue map naturally to the tracking domain.

EARL is not retained for the same two reasons. The EARL vocabulary covers the outcome, the assertor (engine version), the test subject (URL), and the test criterion (terms type), but everything else must be added as custom properties outside the standard:

{
  "@context": "http://www.w3.org/ns/earl#",
  "@type": "Assertion",
  "assertedBy": { "@type": "Software", "title": "Open Terms Archive Engine", "version": "11.0.0" },
  "subject": { "@type": "TestSubject", "source": "https://www.facebook.com/legal/terms" },
  "test": { "@type": "TestCriterion", "title": "Terms of Service" },
  "result": { "@type": "TestResult", "outcome": "earl:failed", "description": "CSS selector \".content\" has no match" },
  "date": "2026-03-15T10:30:00Z",
  "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "sourceDocuments": [{ "id": "legal-terms", "fetch": "https://www.facebook.com/legal/terms", "select": ".content", "snapshotId": "abc123", "mimeType": "text/html" }]
}

The top half (EARL vocabulary) structures the outcome, assertor, subject, and criterion. The bottom half (date, runId, sourceDocuments) is custom, outside the standard. And existing EARL tooling is entirely limited to WCAG accessibility evaluation (axe, Pa11y, WCAG-EM Report Tool); no generic viewer, dashboard, or aggregator exists. The RDF foundation (@context, @type, nested objects) adds structural overhead without any ecosystem to leverage in return.

Alternative storage strategies considered

Storing in the declarations repository

The declarations repository defines what to track and how to track them. Tracking results describe whether tracking works. Colocating them could seem logical, and it mirrors the current practice of attaching GitHub/GitLab issues to the declarations repository. A tracking-results/ folder could hold one file per service/terms pair alongside a run.json.

This approach is not retained because the declarations repository has a fundamentally different operational model. It is designed for human contributions via pull requests: contributors add or modify declarations, reviewers approve them, and the engine reads them. The engine currently has read-only access and never commits to this repository. Adding machine-generated commits every 12 hours would require giving the engine write access, pollute the commit history and notifications for contributors who watch the repository, and mix two fundamentally different types of data: human-authored declarations and machine-generated tracking results. The repository's role as a curated, human-managed collection of declarations would be blurred.

Storing as Git trailers in snapshot and version commits

The engine already stores metadata as Git trailers in snapshot and version commits (X-engine-version, X-fetcher, X-source-document-location). Tracking results could be added as additional trailers. For failures (where no content commit exists), empty commits could carry the tracking data. A run.json equivalent could be stored as a specially formatted empty commit at the end of each run.

This approach is not retained because it requires stacking workarounds that add complexity without benefit. Trailers are flat key-value pairs, but the tracking data includes nested structures (sourceDocuments is an array of objects with 8 fields each, transientError is an object); encoding these as flat trailers would be verbose and fragile. Empty commits for failures and run summaries would clutter the repository history alongside content commits, making both harder to read. Tracking results would be dispersed across two repositories (snapshots and versions), mixing tracking data with content data in both and requiring cross-repository queries to reconstruct the tracking state of a single service. In practice, this amounts to building an ad-hoc storage subsystem inside repositories that serve a different purpose, at which point a dedicated repository is simpler and cleaner.

Storing as files in the snapshots repository

Tracking result files could be committed to the snapshots repository alongside snapshot files, using the existing write infrastructure. A run.json at the root would not interfere with existing snapshot files, which are organized in service/terms subdirectories.

This approach is not retained because the snapshots repository is already the fastest-growing repository in the system. On large collections, it has already reached size and commit count limits that required operational workarounds. Adding tracking result commits on top of snapshot commits would further increase this pressure. Beyond size concerns, it mixes two types of data with different semantics and commit patterns. Consumers that process snapshots (dataset generation, version extraction) would encounter tracking result files alongside content files. The commit history would interleave high-frequency snapshot recordings with low-frequency tracking state transitions, making both harder to navigate. And if tracking results need to be published separately (for federation, for a public dashboard) without exposing raw snapshots, this is impossible when everything is in the same repository.

Storing as a local state file

A ./data/tracking-results.json file, not Git-tracked, updated during each run. The file could preserve the full history of entries (not just current state) and be queried using a lightweight embedded database like lowdb.

This approach is not retained because it sacrifices three important properties. First, publication and federation: a Git repository can be cloned, pushed to a remote, and mirrored where a local JSON file cannot. There is no stable URL to give to a federation aggregator, no way to replicate the data without building custom synchronization. Second, auditability and data integrity: the snapshots and versions repositories are public Git repositories that anyone can clone and independently verify, and no one can silently rewrite history. A local JSON file offers no such guarantee; its content can be modified at any time without detection. Third, durability: a local file exists only on the machine that runs the engine. A disk failure, a machine recreation, or an accidental deletion destroys the entire tracking history with no possibility of recovery. A Git repository pushed to a remote is inherently replicated, and any clone serves as a full backup. Rebuilding publication, replication, durability, and integrity guarantees on top of a local file would mean reconstructing a subset of what Git already provides, while the project already has the full infrastructure to manage Git repositories.

Proposed solution

Tracking results are stored in a new dedicated Git repository managed by the engine alongside the existing snapshots and versions repositories. This follows the established pattern where each repository has a clear, single purpose: declarations define what to track, snapshots store raw content, versions store extracted content, and tracking results store whether tracking works and why it fails.

The operational cost of a fourth repository (creation, deployment, CI configuration) is real but incremental. Operators already manage three repositories per instance; a fourth follows the same pattern.

Repository structure

The tracking results repository contains one JSON file per declared service/terms pair, reflecting its current tracking status. Every declared terms has a file, whether it is tracking successfully or failing. A run.json file at the root captures run-level information.

tracking-results/
├── run.json
├── README.md
├── Facebook/
│   └── Terms of Service.json              ← status: failed
├── Google/
│   ├── Terms of Service.json              ← status: ok
│   └── Privacy Policy.json                ← status: ok
└── …

This approach stores the complete tracking history in Git. Each file is updated only when its content changes: status transitions, failure reason changes, or transient error appearances and disappearances. The Git history of each file tells the full story: when tracking first succeeded, when it failed, why it failed, when it resumed. A git checkout at any past commit gives the exact state of the entire collection at that point in time.

File format for tracking results

Each service/terms file contains a JSON object. The content depends on the current tracking status.

When tracking is successful:

{
  "status": "ok",
  "date": "2026-01-10T10:30:00Z",
  "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "serviceName": "Google",
  "sourceDocuments": [
    {
      "id": "terms",
      "fetch": "https://policies.google.com/terms",
      "select": ".content",
      "remove": ".banner",
      "filter": ["removeLinks"],
      "executeClientScripts": false,
      "snapshotId": "def456",
      "mimeType": "text/html"
    }
  ]
}

When tracking is successful but a transient error occurred and was resolved by retry:

{
  "status": "ok",
  "date": "2026-01-10T10:30:00Z",
  "runId": "a1b2c3d4-58cc-4372-a567-0e02b2c3d479",
  "serviceName": "Google",
  "sourceDocuments": [
    {
      "id": "terms",
      "fetch": "https://policies.google.com/terms",
      "select": ".content",
      "remove": ".banner",
      "filter": ["removeLinks"],
      "executeClientScripts": false,
      "snapshotId": "def456",
      "mimeType": "text/html"
    }
  ],
  "transientError": {
    "date": "2026-04-07T10:30:00Z",
    "reasons": ["Fetch failed: HTTP code 503"]
  }
}

When tracking is failing:

{
  "status": "failed",
  "date": "2026-03-15T10:30:00Z",
  "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "serviceName": "Facebook",
  "reasons": [
    "CSS selector \".content\" has no match in the document"
  ],
  "sourceDocuments": [
    {
      "id": "legal-terms",
      "fetch": "https://www.facebook.com/legal/terms",
      "select": ".content",
      "remove": null,
      "filter": null,
      "executeClientScripts": false,
      "snapshotId": "abc123",
      "mimeType": "text/html"
    }
  ]
}
Field Type Present when Description
status "ok" or "failed" Always Current tracking status
date ISO 8601 string Always Datetime when the current status started. Preserved across runs as long as the status does not change
runId UUID v4 string Always Identifier of the run that last updated this file. Matches the runId in run.json
transientError object ok, only when a transient error occurred during the last run Transient error that was resolved by retry
transientError.date ISO 8601 string With transientError Datetime when the transient error occurred
transientError.reasons string[] With transientError Human-readable error reasons
serviceName string Always Human-readable service name (e.g., "Facebook"), as distinct from the service ID used in file paths
reasons string[] failed Human-readable error reasons
sourceDocuments object[] Always Source documents of the tracked terms, with their declaration metadata and last snapshot information. Field names follow the declaration format
sourceDocuments[].id string Always Identifier of the source document, generated from its URL
sourceDocuments[].fetch string Always URL of the source document
sourceDocuments[].select string, object, or array Always CSS selectors for content to include
sourceDocuments[].remove string, object, array, or null Always CSS selectors for content to exclude
sourceDocuments[].filter string[] or null Always Names of filters applied to the content
sourceDocuments[].executeClientScripts boolean Always Whether fetching requires a headless browser
sourceDocuments[].snapshotId string or null Always ID of the last recorded snapshot, if any
sourceDocuments[].mimeType string or null Always MIME type of the last recorded snapshot (e.g., text/html, application/pdf)

The runId field is a UUID v4 generated at the start of each tracking run. It is written both in run.json and in every individual file that is updated during the run. In individual files, it enables correlating file changes to runs without relying on commit timestamp proximity, and detecting interrupted runs where run.json carries a runId that some individual files never received, revealing which terms were not processed. In run.json, it enables efficient API polling: a consumer can compare the runId to determine whether new data is available since its last check.

The date field records when the current status started, not when the file was last updated. For failures, if a failure persists for 22 days but the reasons change on day 10, date is preserved from day 1 and the reasons are updated; the Git history shows when the reasons changed. For successes, date records when tracking resumed after a failure or when the terms was first successfully tracked. This allows consumers to know at a glance how long a service has been in its current state without traversing Git history.

The transientError field captures errors that were likely transient (HTTP 503, DNS temporary failure, timeout, etc.) and were resolved by the engine's automatic retry mechanism. This field is present only when the last run encountered such an error; if the next run succeeds without transient error, the field is removed. This provides visibility on "fragile" services that succeed but only after retry. The Git history of the file shows when transient errors occurred and disappeared, allowing pattern detection over time.

The sourceDocuments array is present on all entries, regardless of tracking status. Each entry contains the full declaration metadata of the source document (using the field names from the declaration format: fetch, select, remove, filter, executeClientScripts) alongside the last snapshot information (snapshotId, mimeType). This ensures that any consumer can understand not only whether tracking works, but also what is being tracked and how it is configured, without needing to read the service declaration or query the snapshots repository.

Run file

A run.json file at the repository root is updated at the end of every run, providing run-level information that cannot be derived from individual tracking result files:

{
  "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "collectionId": "france",
  "schedule": "30 */12 * * *",
  "lastRun": {
    "startDate": "2026-04-06T10:30:00Z",
    "endDate": "2026-04-06T10:42:34Z",
    "engineVersion": "11.0.0"
  },
  "declared": {
    "services": 523,
    "terms": 1523
  },
  "tracked": {
    "ok": 1476,
    "failed": 47
  },
  "transitions": {
    "newFailures": [
      { "serviceId": "Facebook", "termsType": "Terms of Service" },
      { "serviceId": "Twitter", "termsType": "Privacy Policy" }
    ],
    "recoveries": [
      { "serviceId": "Google", "termsType": "Privacy Policy" }
    ],
    "reasonChanges": []
  },
  "transientErrors": 23
}
Field Description
runId UUID v4 identifier for the current run, generated at the start of each run
collectionId ID of the collection this instance tracks
schedule Configured cron expression for tracking runs
lastRun.startDate Datetime when the last run started
lastRun.endDate Datetime when the last run ended
lastRun.engineVersion Version of the engine that performed the last run
declared.services Total number of declared services
declared.terms Total number of declared terms across all services
tracked.ok Number of terms currently tracking successfully
tracked.failed Number of terms currently failing
transitions.newFailures Array of { serviceId, termsType } objects for terms that started failing in the last run
transitions.recoveries Array of { serviceId, termsType } objects for terms that recovered in the last run
transitions.reasonChanges Array of { serviceId, termsType } objects for terms whose failure reasons changed in the last run
transientErrors Number of terms that encountered a transient error during the last run, whether or not it was resolved by retry

This file serves several purposes. The Git history of run.json has one commit per run, so a gap in commits reveals that the tracking was not running. The tracked.ok/tracked.failed ratio at each run provides a health curve over time. The lastRun.engineVersion field allows correlating spikes in failures with engine upgrades. The collection and schedule fields allow a federation aggregator to identify and contextualize each instance.

Collection API extension

The existing collection API exposes service declarations, collection metadata, and version content over HTTP. It would be extended with the following endpoints to expose tracking results, using the existing naming conventions (plural for collections as /services, singular for specific resources as /service/:id):

Method Endpoint Description
GET /tracking-results Returns the tracking status of all declared terms
GET /tracking-result/:serviceId Returns the tracking status of all terms for a given service
GET /tracking-result/:serviceId/:termsType Returns the tracking status of a specific service/terms pair
GET /tracking-results/run Returns the latest run information (run.json content)

GET /tracking-results returns an array of all tracking result objects, each enriched with serviceId and termsType fields derived from the file path. It supports filtering by status (GET /tracking-results?status=failed). Pagination is not included in the initial design, as the ?status=failed filter covers the most common current use case (opening issues on forge). These endpoints always return the current state; querying tracking results at a specific date is not supported initially, but can be added later following the same approach used for the /version endpoint if the need arises.

Response examples

GET /tracking-results:

[
  {
    "serviceId": "Google",
    "termsType": "Terms of Service",
    "status": "ok",
    "date": "2026-01-10T10:30:00Z",
    "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "serviceName": "Google",
    "sourceDocuments": [
      { "id": "terms", "fetch": "https://policies.google.com/terms", "select": ".content", "remove": ".banner", "filter": ["removeLinks"], "executeClientScripts": false, "snapshotId": "def456", "mimeType": "text/html" }
    ]
  },
  {
    "serviceId": "Facebook",
    "termsType": "Terms of Service",
    "status": "failed",
    "date": "2026-03-15T10:30:00Z",
    "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "serviceName": "Facebook",
    "reasons": ["CSS selector \".content\" has no match in the document"],
    "sourceDocuments": [
      { "id": "legal-terms", "fetch": "https://www.facebook.com/legal/terms", "select": ".content", "remove": null, "filter": null, "executeClientScripts": false, "snapshotId": "abc123", "mimeType": "text/html" }
    ]
  }
]

GET /tracking-result/Google:

[
  {
    "serviceId": "Google",
    "termsType": "Terms of Service",
    "status": "ok",
    "date": "2026-01-10T10:30:00Z",
    "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "serviceName": "Google",
    "sourceDocuments": [
      { "id": "terms", "fetch": "https://policies.google.com/terms", "select": ".content", "remove": ".banner", "filter": ["removeLinks"], "executeClientScripts": false, "snapshotId": "def456", "mimeType": "text/html" }
    ]
  },
  {
    "serviceId": "Google",
    "termsType": "Privacy Policy",
    "status": "ok",
    "date": "2025-11-02T08:15:00Z",
    "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "serviceName": "Google",
    "sourceDocuments": [
      { "id": "privacy", "fetch": "https://policies.google.com/privacy", "select": ".content", "remove": null, "filter": null, "executeClientScripts": false, "snapshotId": "ghi789", "mimeType": "text/html" }
    ]
  }
]

GET /tracking-result/Facebook/Terms%20of%20Service:

{
  "serviceId": "Facebook",
  "termsType": "Terms of Service",
  "status": "failed",
  "date": "2026-03-15T10:30:00Z",
  "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "serviceName": "Facebook",
  "reasons": ["CSS selector \".content\" has no match in the document"],
  "sourceDocuments": [
    { "id": "legal-terms", "fetch": "https://www.facebook.com/legal/terms", "select": ".content", "remove": null, "filter": null, "executeClientScripts": false, "snapshotId": "abc123", "mimeType": "text/html" }
  ]
}

GET /tracking-results/run:

{
  "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "collectionId": "france",
  "schedule": "30 */12 * * *",
  "lastRun": {
    "startDate": "2026-04-06T10:30:00Z",
    "endDate": "2026-04-06T10:42:34Z",
    "engineVersion": "11.0.0"
  },
  "declared": { "services": 523, "terms": 1523 },
  "tracked": { "ok": 1476, "failed": 47 },
  "transitions": {
    "newFailures": [{ "serviceId": "Facebook", "termsType": "Terms of Service" }],
    "recoveries": [{ "serviceId": "Google", "termsType": "Privacy Policy" }],
    "reasonChanges": []
  },
  "transientErrors": 23
}

The API always serves the state of the last completed run. It does this by resolving the commit SHA of the last run.json update and reading all files at that commit. While a new run is in progress and individual files are being updated with new commits, the API continues to serve the snapshot from the previous run's final commit. When the current run completes and run.json is committed, the API automatically starts serving the new state. This approach is resilient to crashes: if the engine crashes mid-run, the API keeps serving the last complete state.

Notification strategy

External connectors (GitHub issue manager, GitLab issue manager) need to know when a run completes and transitions have occurred. Push-based mechanisms were considered, including webhooks where the engine would send an HTTP POST at the end of each run, and Server-Sent Events where connectors would subscribe to a stream. Both add significant complexity in the form of payload signing, delivery retry logic, connection management, and additional configuration.

Since tracking runs occur on a fixed schedule (typically every 12 hours), connectors can simply poll GET /tracking-results/run at the same frequency. The cost of one lightweight request every 12 hours is negligible, and the runId field allows a connector to immediately determine whether new data is available since its last check. This polling approach requires no additional infrastructure in the engine.

Push-based notifications can be added later if the number of connectors grows or if lower latency is needed.

Deployment

Each instance would create a new repository following the naming convention established for other repositories: {instance_name}-tracking-results. For example, the demo instance would have:

  • demo-declarations (existing)
  • demo-versions (existing)
  • demo-snapshots (existing)
  • demo-tracking-results (new)

The deployment configuration would be updated to include the new repository path and remote, following the same pattern as versions and snapshots.

Instances can be deployed incrementally: the tracking results repository can be created and configured at any time. On the first run after configuration, the engine will populate it with the initial state of all declared terms and a first run.json. No migration of historical data from GitHub/GitLab issues is necessary; the repository starts recording from the first run onwards. Historical issue data remains accessible on the forges for reference.

Feedback expectations

We invite you to provide feedback to:

  • Point out any limitations or edge cases you see in the proposed implementation.
  • Suggest improvements or refinements to the current proposal.
  • Share any alternative approaches you believe could address the problem more effectively.

If you support the proposal as it stands, please react with a 👍 or leave a positive comment.

Please provide your feedback by April 29th.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest for comments

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions