Tracking results storage and exposure
Context
The engine periodically tracks whether online documents are accessible and extractable. For each service/terms pair, the outcome is either success (content fetched and version recorded) or failure (content inaccessible or extraction broken). When tracking fails, the Reporter module creates an issue on a third-party software forge (GitHub or GitLab). When tracking resumes, it closes the issue. These forge issues are the sole persistent record of tracking results.
Problem statement
Tracking results have no local persistence. Even when everything works, the persistence of tracking data depends entirely on an external forge. If the forge is unreachable or no API token is configured, failures are logged but not persisted, and the information is lost once the process ends. There is no way to query the current tracking status of a service without calling an external forge API.
This RFC proposes to introduce a local storage layer that persists tracking results independently of any external service, and to expose them through the collection API. This storage becomes the single source of truth for tracking status. External integrations (such as the current GitHub and GitLab issue managers) would then be extracted into separate modules that consume tracking results through this API. The active consumption of this data (creating issues, sending notifications) is out of scope.
Alternatives considered
Two existing standards and four alternative storage strategies were evaluated before arriving at the proposed design. None was retained. A dedicated Git repository with a custom JSON format was chosen instead.
Existing standards considered
TAP — Test Anything Protocol
TAP is a text-based protocol for reporting test results. The analogy with tracking is genuine: for each service/terms pair, the engine verifies that content is accessible and extractable, producing a pass or fail result.
TAP is not retained for two reasons. First, the standard only structures a small fraction of the data: TAP provides a binary ok/not ok outcome, but the engine needs to persist richer information (error reasons, source document metadata, snapshot references, timing data, transient errors). All of this must go into free-form YAML diagnostic blocks, outside the standard's structure:
TAP version 14
1..1
not ok 1 - Facebook · Terms of Service
---
date: "2026-03-15T10:30:00Z"
runId: "f47ac10b-58cc-4372-a567-0e02b2c3d479"
reasons:
- "CSS selector \".content\" has no match in the document"
sourceDocuments:
- id: legal-terms
fetch: "https://www.facebook.com/legal/terms"
select: ".content"
snapshotId: abc123
mimeType: text/html
...
Only the not ok line is structured by TAP. Everything in the ---/... block is free-form YAML that no TAP tool can interpret. Second, existing TAP tools (tap-parser, tap-mocha-reporter) are designed for CI/CD terminal output, not for the consumers needed here (issue sync, API endpoints, federation aggregation). Since the standard neither structures the data nor provides usable tooling, adopting it would impose format constraints without any corresponding benefit.
EARL — Evaluation and Report Language
EARL is a W3C vocabulary for expressing test results. The semantic fit is the strongest: its concepts of Assertion, TestSubject, TestCriterion, and five-valued OutcomeValue map naturally to the tracking domain.
EARL is not retained for the same two reasons. The EARL vocabulary covers the outcome, the assertor (engine version), the test subject (URL), and the test criterion (terms type), but everything else must be added as custom properties outside the standard:
{
"@context": "http://www.w3.org/ns/earl#",
"@type": "Assertion",
"assertedBy": { "@type": "Software", "title": "Open Terms Archive Engine", "version": "11.0.0" },
"subject": { "@type": "TestSubject", "source": "https://www.facebook.com/legal/terms" },
"test": { "@type": "TestCriterion", "title": "Terms of Service" },
"result": { "@type": "TestResult", "outcome": "earl:failed", "description": "CSS selector \".content\" has no match" },
"date": "2026-03-15T10:30:00Z",
"runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"sourceDocuments": [{ "id": "legal-terms", "fetch": "https://www.facebook.com/legal/terms", "select": ".content", "snapshotId": "abc123", "mimeType": "text/html" }]
}
The top half (EARL vocabulary) structures the outcome, assertor, subject, and criterion. The bottom half (date, runId, sourceDocuments) is custom, outside the standard. And existing EARL tooling is entirely limited to WCAG accessibility evaluation (axe, Pa11y, WCAG-EM Report Tool); no generic viewer, dashboard, or aggregator exists. The RDF foundation (@context, @type, nested objects) adds structural overhead without any ecosystem to leverage in return.
Alternative storage strategies considered
Storing in the declarations repository
The declarations repository defines what to track and how to track them. Tracking results describe whether tracking works. Colocating them could seem logical, and it mirrors the current practice of attaching GitHub/GitLab issues to the declarations repository. A tracking-results/ folder could hold one file per service/terms pair alongside a run.json.
This approach is not retained because the declarations repository has a fundamentally different operational model. It is designed for human contributions via pull requests: contributors add or modify declarations, reviewers approve them, and the engine reads them. The engine currently has read-only access and never commits to this repository. Adding machine-generated commits every 12 hours would require giving the engine write access, pollute the commit history and notifications for contributors who watch the repository, and mix two fundamentally different types of data: human-authored declarations and machine-generated tracking results. The repository's role as a curated, human-managed collection of declarations would be blurred.
Storing as Git trailers in snapshot and version commits
The engine already stores metadata as Git trailers in snapshot and version commits (X-engine-version, X-fetcher, X-source-document-location). Tracking results could be added as additional trailers. For failures (where no content commit exists), empty commits could carry the tracking data. A run.json equivalent could be stored as a specially formatted empty commit at the end of each run.
This approach is not retained because it requires stacking workarounds that add complexity without benefit. Trailers are flat key-value pairs, but the tracking data includes nested structures (sourceDocuments is an array of objects with 8 fields each, transientError is an object); encoding these as flat trailers would be verbose and fragile. Empty commits for failures and run summaries would clutter the repository history alongside content commits, making both harder to read. Tracking results would be dispersed across two repositories (snapshots and versions), mixing tracking data with content data in both and requiring cross-repository queries to reconstruct the tracking state of a single service. In practice, this amounts to building an ad-hoc storage subsystem inside repositories that serve a different purpose, at which point a dedicated repository is simpler and cleaner.
Storing as files in the snapshots repository
Tracking result files could be committed to the snapshots repository alongside snapshot files, using the existing write infrastructure. A run.json at the root would not interfere with existing snapshot files, which are organized in service/terms subdirectories.
This approach is not retained because the snapshots repository is already the fastest-growing repository in the system. On large collections, it has already reached size and commit count limits that required operational workarounds. Adding tracking result commits on top of snapshot commits would further increase this pressure. Beyond size concerns, it mixes two types of data with different semantics and commit patterns. Consumers that process snapshots (dataset generation, version extraction) would encounter tracking result files alongside content files. The commit history would interleave high-frequency snapshot recordings with low-frequency tracking state transitions, making both harder to navigate. And if tracking results need to be published separately (for federation, for a public dashboard) without exposing raw snapshots, this is impossible when everything is in the same repository.
Storing as a local state file
A ./data/tracking-results.json file, not Git-tracked, updated during each run. The file could preserve the full history of entries (not just current state) and be queried using a lightweight embedded database like lowdb.
This approach is not retained because it sacrifices three important properties. First, publication and federation: a Git repository can be cloned, pushed to a remote, and mirrored where a local JSON file cannot. There is no stable URL to give to a federation aggregator, no way to replicate the data without building custom synchronization. Second, auditability and data integrity: the snapshots and versions repositories are public Git repositories that anyone can clone and independently verify, and no one can silently rewrite history. A local JSON file offers no such guarantee; its content can be modified at any time without detection. Third, durability: a local file exists only on the machine that runs the engine. A disk failure, a machine recreation, or an accidental deletion destroys the entire tracking history with no possibility of recovery. A Git repository pushed to a remote is inherently replicated, and any clone serves as a full backup. Rebuilding publication, replication, durability, and integrity guarantees on top of a local file would mean reconstructing a subset of what Git already provides, while the project already has the full infrastructure to manage Git repositories.
Proposed solution
Tracking results are stored in a new dedicated Git repository managed by the engine alongside the existing snapshots and versions repositories. This follows the established pattern where each repository has a clear, single purpose: declarations define what to track, snapshots store raw content, versions store extracted content, and tracking results store whether tracking works and why it fails.
The operational cost of a fourth repository (creation, deployment, CI configuration) is real but incremental. Operators already manage three repositories per instance; a fourth follows the same pattern.
Repository structure
The tracking results repository contains one JSON file per declared service/terms pair, reflecting its current tracking status. Every declared terms has a file, whether it is tracking successfully or failing. A run.json file at the root captures run-level information.
tracking-results/
├── run.json
├── README.md
├── Facebook/
│ └── Terms of Service.json ← status: failed
├── Google/
│ ├── Terms of Service.json ← status: ok
│ └── Privacy Policy.json ← status: ok
└── …
This approach stores the complete tracking history in Git. Each file is updated only when its content changes: status transitions, failure reason changes, or transient error appearances and disappearances. The Git history of each file tells the full story: when tracking first succeeded, when it failed, why it failed, when it resumed. A git checkout at any past commit gives the exact state of the entire collection at that point in time.
File format for tracking results
Each service/terms file contains a JSON object. The content depends on the current tracking status.
When tracking is successful:
{
"status": "ok",
"date": "2026-01-10T10:30:00Z",
"runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"serviceName": "Google",
"sourceDocuments": [
{
"id": "terms",
"fetch": "https://policies.google.com/terms",
"select": ".content",
"remove": ".banner",
"filter": ["removeLinks"],
"executeClientScripts": false,
"snapshotId": "def456",
"mimeType": "text/html"
}
]
}
When tracking is successful but a transient error occurred and was resolved by retry:
{
"status": "ok",
"date": "2026-01-10T10:30:00Z",
"runId": "a1b2c3d4-58cc-4372-a567-0e02b2c3d479",
"serviceName": "Google",
"sourceDocuments": [
{
"id": "terms",
"fetch": "https://policies.google.com/terms",
"select": ".content",
"remove": ".banner",
"filter": ["removeLinks"],
"executeClientScripts": false,
"snapshotId": "def456",
"mimeType": "text/html"
}
],
"transientError": {
"date": "2026-04-07T10:30:00Z",
"reasons": ["Fetch failed: HTTP code 503"]
}
}
When tracking is failing:
{
"status": "failed",
"date": "2026-03-15T10:30:00Z",
"runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"serviceName": "Facebook",
"reasons": [
"CSS selector \".content\" has no match in the document"
],
"sourceDocuments": [
{
"id": "legal-terms",
"fetch": "https://www.facebook.com/legal/terms",
"select": ".content",
"remove": null,
"filter": null,
"executeClientScripts": false,
"snapshotId": "abc123",
"mimeType": "text/html"
}
]
}
| Field |
Type |
Present when |
Description |
status |
"ok" or "failed" |
Always |
Current tracking status |
date |
ISO 8601 string |
Always |
Datetime when the current status started. Preserved across runs as long as the status does not change |
runId |
UUID v4 string |
Always |
Identifier of the run that last updated this file. Matches the runId in run.json |
transientError |
object |
ok, only when a transient error occurred during the last run |
Transient error that was resolved by retry |
transientError.date |
ISO 8601 string |
With transientError |
Datetime when the transient error occurred |
transientError.reasons |
string[] |
With transientError |
Human-readable error reasons |
serviceName |
string |
Always |
Human-readable service name (e.g., "Facebook"), as distinct from the service ID used in file paths |
reasons |
string[] |
failed |
Human-readable error reasons |
sourceDocuments |
object[] |
Always |
Source documents of the tracked terms, with their declaration metadata and last snapshot information. Field names follow the declaration format |
sourceDocuments[].id |
string |
Always |
Identifier of the source document, generated from its URL |
sourceDocuments[].fetch |
string |
Always |
URL of the source document |
sourceDocuments[].select |
string, object, or array |
Always |
CSS selectors for content to include |
sourceDocuments[].remove |
string, object, array, or null |
Always |
CSS selectors for content to exclude |
sourceDocuments[].filter |
string[] or null |
Always |
Names of filters applied to the content |
sourceDocuments[].executeClientScripts |
boolean |
Always |
Whether fetching requires a headless browser |
sourceDocuments[].snapshotId |
string or null |
Always |
ID of the last recorded snapshot, if any |
sourceDocuments[].mimeType |
string or null |
Always |
MIME type of the last recorded snapshot (e.g., text/html, application/pdf) |
The runId field is a UUID v4 generated at the start of each tracking run. It is written both in run.json and in every individual file that is updated during the run. In individual files, it enables correlating file changes to runs without relying on commit timestamp proximity, and detecting interrupted runs where run.json carries a runId that some individual files never received, revealing which terms were not processed. In run.json, it enables efficient API polling: a consumer can compare the runId to determine whether new data is available since its last check.
The date field records when the current status started, not when the file was last updated. For failures, if a failure persists for 22 days but the reasons change on day 10, date is preserved from day 1 and the reasons are updated; the Git history shows when the reasons changed. For successes, date records when tracking resumed after a failure or when the terms was first successfully tracked. This allows consumers to know at a glance how long a service has been in its current state without traversing Git history.
The transientError field captures errors that were likely transient (HTTP 503, DNS temporary failure, timeout, etc.) and were resolved by the engine's automatic retry mechanism. This field is present only when the last run encountered such an error; if the next run succeeds without transient error, the field is removed. This provides visibility on "fragile" services that succeed but only after retry. The Git history of the file shows when transient errors occurred and disappeared, allowing pattern detection over time.
The sourceDocuments array is present on all entries, regardless of tracking status. Each entry contains the full declaration metadata of the source document (using the field names from the declaration format: fetch, select, remove, filter, executeClientScripts) alongside the last snapshot information (snapshotId, mimeType). This ensures that any consumer can understand not only whether tracking works, but also what is being tracked and how it is configured, without needing to read the service declaration or query the snapshots repository.
Run file
A run.json file at the repository root is updated at the end of every run, providing run-level information that cannot be derived from individual tracking result files:
{
"runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"collectionId": "france",
"schedule": "30 */12 * * *",
"lastRun": {
"startDate": "2026-04-06T10:30:00Z",
"endDate": "2026-04-06T10:42:34Z",
"engineVersion": "11.0.0"
},
"declared": {
"services": 523,
"terms": 1523
},
"tracked": {
"ok": 1476,
"failed": 47
},
"transitions": {
"newFailures": [
{ "serviceId": "Facebook", "termsType": "Terms of Service" },
{ "serviceId": "Twitter", "termsType": "Privacy Policy" }
],
"recoveries": [
{ "serviceId": "Google", "termsType": "Privacy Policy" }
],
"reasonChanges": []
},
"transientErrors": 23
}
| Field |
Description |
runId |
UUID v4 identifier for the current run, generated at the start of each run |
collectionId |
ID of the collection this instance tracks |
schedule |
Configured cron expression for tracking runs |
lastRun.startDate |
Datetime when the last run started |
lastRun.endDate |
Datetime when the last run ended |
lastRun.engineVersion |
Version of the engine that performed the last run |
declared.services |
Total number of declared services |
declared.terms |
Total number of declared terms across all services |
tracked.ok |
Number of terms currently tracking successfully |
tracked.failed |
Number of terms currently failing |
transitions.newFailures |
Array of { serviceId, termsType } objects for terms that started failing in the last run |
transitions.recoveries |
Array of { serviceId, termsType } objects for terms that recovered in the last run |
transitions.reasonChanges |
Array of { serviceId, termsType } objects for terms whose failure reasons changed in the last run |
transientErrors |
Number of terms that encountered a transient error during the last run, whether or not it was resolved by retry |
This file serves several purposes. The Git history of run.json has one commit per run, so a gap in commits reveals that the tracking was not running. The tracked.ok/tracked.failed ratio at each run provides a health curve over time. The lastRun.engineVersion field allows correlating spikes in failures with engine upgrades. The collection and schedule fields allow a federation aggregator to identify and contextualize each instance.
Collection API extension
The existing collection API exposes service declarations, collection metadata, and version content over HTTP. It would be extended with the following endpoints to expose tracking results, using the existing naming conventions (plural for collections as /services, singular for specific resources as /service/:id):
| Method |
Endpoint |
Description |
GET |
/tracking-results |
Returns the tracking status of all declared terms |
GET |
/tracking-result/:serviceId |
Returns the tracking status of all terms for a given service |
GET |
/tracking-result/:serviceId/:termsType |
Returns the tracking status of a specific service/terms pair |
GET |
/tracking-results/run |
Returns the latest run information (run.json content) |
GET /tracking-results returns an array of all tracking result objects, each enriched with serviceId and termsType fields derived from the file path. It supports filtering by status (GET /tracking-results?status=failed). Pagination is not included in the initial design, as the ?status=failed filter covers the most common current use case (opening issues on forge). These endpoints always return the current state; querying tracking results at a specific date is not supported initially, but can be added later following the same approach used for the /version endpoint if the need arises.
Response examples
GET /tracking-results:
[
{
"serviceId": "Google",
"termsType": "Terms of Service",
"status": "ok",
"date": "2026-01-10T10:30:00Z",
"runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"serviceName": "Google",
"sourceDocuments": [
{ "id": "terms", "fetch": "https://policies.google.com/terms", "select": ".content", "remove": ".banner", "filter": ["removeLinks"], "executeClientScripts": false, "snapshotId": "def456", "mimeType": "text/html" }
]
},
{
"serviceId": "Facebook",
"termsType": "Terms of Service",
"status": "failed",
"date": "2026-03-15T10:30:00Z",
"runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"serviceName": "Facebook",
"reasons": ["CSS selector \".content\" has no match in the document"],
"sourceDocuments": [
{ "id": "legal-terms", "fetch": "https://www.facebook.com/legal/terms", "select": ".content", "remove": null, "filter": null, "executeClientScripts": false, "snapshotId": "abc123", "mimeType": "text/html" }
]
}
]
GET /tracking-result/Google:
[
{
"serviceId": "Google",
"termsType": "Terms of Service",
"status": "ok",
"date": "2026-01-10T10:30:00Z",
"runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"serviceName": "Google",
"sourceDocuments": [
{ "id": "terms", "fetch": "https://policies.google.com/terms", "select": ".content", "remove": ".banner", "filter": ["removeLinks"], "executeClientScripts": false, "snapshotId": "def456", "mimeType": "text/html" }
]
},
{
"serviceId": "Google",
"termsType": "Privacy Policy",
"status": "ok",
"date": "2025-11-02T08:15:00Z",
"runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"serviceName": "Google",
"sourceDocuments": [
{ "id": "privacy", "fetch": "https://policies.google.com/privacy", "select": ".content", "remove": null, "filter": null, "executeClientScripts": false, "snapshotId": "ghi789", "mimeType": "text/html" }
]
}
]
GET /tracking-result/Facebook/Terms%20of%20Service:
{
"serviceId": "Facebook",
"termsType": "Terms of Service",
"status": "failed",
"date": "2026-03-15T10:30:00Z",
"runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"serviceName": "Facebook",
"reasons": ["CSS selector \".content\" has no match in the document"],
"sourceDocuments": [
{ "id": "legal-terms", "fetch": "https://www.facebook.com/legal/terms", "select": ".content", "remove": null, "filter": null, "executeClientScripts": false, "snapshotId": "abc123", "mimeType": "text/html" }
]
}
GET /tracking-results/run:
{
"runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"collectionId": "france",
"schedule": "30 */12 * * *",
"lastRun": {
"startDate": "2026-04-06T10:30:00Z",
"endDate": "2026-04-06T10:42:34Z",
"engineVersion": "11.0.0"
},
"declared": { "services": 523, "terms": 1523 },
"tracked": { "ok": 1476, "failed": 47 },
"transitions": {
"newFailures": [{ "serviceId": "Facebook", "termsType": "Terms of Service" }],
"recoveries": [{ "serviceId": "Google", "termsType": "Privacy Policy" }],
"reasonChanges": []
},
"transientErrors": 23
}
The API always serves the state of the last completed run. It does this by resolving the commit SHA of the last run.json update and reading all files at that commit. While a new run is in progress and individual files are being updated with new commits, the API continues to serve the snapshot from the previous run's final commit. When the current run completes and run.json is committed, the API automatically starts serving the new state. This approach is resilient to crashes: if the engine crashes mid-run, the API keeps serving the last complete state.
Notification strategy
External connectors (GitHub issue manager, GitLab issue manager) need to know when a run completes and transitions have occurred. Push-based mechanisms were considered, including webhooks where the engine would send an HTTP POST at the end of each run, and Server-Sent Events where connectors would subscribe to a stream. Both add significant complexity in the form of payload signing, delivery retry logic, connection management, and additional configuration.
Since tracking runs occur on a fixed schedule (typically every 12 hours), connectors can simply poll GET /tracking-results/run at the same frequency. The cost of one lightweight request every 12 hours is negligible, and the runId field allows a connector to immediately determine whether new data is available since its last check. This polling approach requires no additional infrastructure in the engine.
Push-based notifications can be added later if the number of connectors grows or if lower latency is needed.
Deployment
Each instance would create a new repository following the naming convention established for other repositories: {instance_name}-tracking-results. For example, the demo instance would have:
demo-declarations (existing)
demo-versions (existing)
demo-snapshots (existing)
demo-tracking-results (new)
The deployment configuration would be updated to include the new repository path and remote, following the same pattern as versions and snapshots.
Instances can be deployed incrementally: the tracking results repository can be created and configured at any time. On the first run after configuration, the engine will populate it with the initial state of all declared terms and a first run.json. No migration of historical data from GitHub/GitLab issues is necessary; the repository starts recording from the first run onwards. Historical issue data remains accessible on the forges for reference.
Feedback expectations
We invite you to provide feedback to:
- Point out any limitations or edge cases you see in the proposed implementation.
- Suggest improvements or refinements to the current proposal.
- Share any alternative approaches you believe could address the problem more effectively.
If you support the proposal as it stands, please react with a 👍 or leave a positive comment.
Please provide your feedback by April 29th.
Tracking results storage and exposure
Context
The engine periodically tracks whether online documents are accessible and extractable. For each service/terms pair, the outcome is either success (content fetched and version recorded) or failure (content inaccessible or extraction broken). When tracking fails, the
Reportermodule creates an issue on a third-party software forge (GitHub or GitLab). When tracking resumes, it closes the issue. These forge issues are the sole persistent record of tracking results.Problem statement
Tracking results have no local persistence. Even when everything works, the persistence of tracking data depends entirely on an external forge. If the forge is unreachable or no API token is configured, failures are logged but not persisted, and the information is lost once the process ends. There is no way to query the current tracking status of a service without calling an external forge API.
This RFC proposes to introduce a local storage layer that persists tracking results independently of any external service, and to expose them through the collection API. This storage becomes the single source of truth for tracking status. External integrations (such as the current GitHub and GitLab issue managers) would then be extracted into separate modules that consume tracking results through this API. The active consumption of this data (creating issues, sending notifications) is out of scope.
Alternatives considered
Two existing standards and four alternative storage strategies were evaluated before arriving at the proposed design. None was retained. A dedicated Git repository with a custom JSON format was chosen instead.
Existing standards considered
TAP — Test Anything Protocol
TAP is a text-based protocol for reporting test results. The analogy with tracking is genuine: for each service/terms pair, the engine verifies that content is accessible and extractable, producing a pass or fail result.
TAP is not retained for two reasons. First, the standard only structures a small fraction of the data: TAP provides a binary
ok/not okoutcome, but the engine needs to persist richer information (error reasons, source document metadata, snapshot references, timing data, transient errors). All of this must go into free-form YAML diagnostic blocks, outside the standard's structure:Only the
not okline is structured by TAP. Everything in the---/...block is free-form YAML that no TAP tool can interpret. Second, existing TAP tools (tap-parser,tap-mocha-reporter) are designed for CI/CD terminal output, not for the consumers needed here (issue sync, API endpoints, federation aggregation). Since the standard neither structures the data nor provides usable tooling, adopting it would impose format constraints without any corresponding benefit.EARL — Evaluation and Report Language
EARL is a W3C vocabulary for expressing test results. The semantic fit is the strongest: its concepts of
Assertion,TestSubject,TestCriterion, and five-valuedOutcomeValuemap naturally to the tracking domain.EARL is not retained for the same two reasons. The EARL vocabulary covers the outcome, the assertor (engine version), the test subject (URL), and the test criterion (terms type), but everything else must be added as custom properties outside the standard:
{ "@context": "http://www.w3.org/ns/earl#", "@type": "Assertion", "assertedBy": { "@type": "Software", "title": "Open Terms Archive Engine", "version": "11.0.0" }, "subject": { "@type": "TestSubject", "source": "https://www.facebook.com/legal/terms" }, "test": { "@type": "TestCriterion", "title": "Terms of Service" }, "result": { "@type": "TestResult", "outcome": "earl:failed", "description": "CSS selector \".content\" has no match" }, "date": "2026-03-15T10:30:00Z", "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "sourceDocuments": [{ "id": "legal-terms", "fetch": "https://www.facebook.com/legal/terms", "select": ".content", "snapshotId": "abc123", "mimeType": "text/html" }] }The top half (EARL vocabulary) structures the outcome, assertor, subject, and criterion. The bottom half (
date,runId,sourceDocuments) is custom, outside the standard. And existing EARL tooling is entirely limited to WCAG accessibility evaluation (axe, Pa11y, WCAG-EM Report Tool); no generic viewer, dashboard, or aggregator exists. The RDF foundation (@context,@type, nested objects) adds structural overhead without any ecosystem to leverage in return.Alternative storage strategies considered
Storing in the declarations repository
The declarations repository defines what to track and how to track them. Tracking results describe whether tracking works. Colocating them could seem logical, and it mirrors the current practice of attaching GitHub/GitLab issues to the declarations repository. A
tracking-results/folder could hold one file per service/terms pair alongside arun.json.This approach is not retained because the declarations repository has a fundamentally different operational model. It is designed for human contributions via pull requests: contributors add or modify declarations, reviewers approve them, and the engine reads them. The engine currently has read-only access and never commits to this repository. Adding machine-generated commits every 12 hours would require giving the engine write access, pollute the commit history and notifications for contributors who watch the repository, and mix two fundamentally different types of data: human-authored declarations and machine-generated tracking results. The repository's role as a curated, human-managed collection of declarations would be blurred.
Storing as Git trailers in snapshot and version commits
The engine already stores metadata as Git trailers in snapshot and version commits (
X-engine-version,X-fetcher,X-source-document-location). Tracking results could be added as additional trailers. For failures (where no content commit exists), empty commits could carry the tracking data. Arun.jsonequivalent could be stored as a specially formatted empty commit at the end of each run.This approach is not retained because it requires stacking workarounds that add complexity without benefit. Trailers are flat key-value pairs, but the tracking data includes nested structures (
sourceDocumentsis an array of objects with 8 fields each,transientErroris an object); encoding these as flat trailers would be verbose and fragile. Empty commits for failures and run summaries would clutter the repository history alongside content commits, making both harder to read. Tracking results would be dispersed across two repositories (snapshots and versions), mixing tracking data with content data in both and requiring cross-repository queries to reconstruct the tracking state of a single service. In practice, this amounts to building an ad-hoc storage subsystem inside repositories that serve a different purpose, at which point a dedicated repository is simpler and cleaner.Storing as files in the snapshots repository
Tracking result files could be committed to the snapshots repository alongside snapshot files, using the existing write infrastructure. A
run.jsonat the root would not interfere with existing snapshot files, which are organized in service/terms subdirectories.This approach is not retained because the snapshots repository is already the fastest-growing repository in the system. On large collections, it has already reached size and commit count limits that required operational workarounds. Adding tracking result commits on top of snapshot commits would further increase this pressure. Beyond size concerns, it mixes two types of data with different semantics and commit patterns. Consumers that process snapshots (dataset generation, version extraction) would encounter tracking result files alongside content files. The commit history would interleave high-frequency snapshot recordings with low-frequency tracking state transitions, making both harder to navigate. And if tracking results need to be published separately (for federation, for a public dashboard) without exposing raw snapshots, this is impossible when everything is in the same repository.
Storing as a local state file
A
./data/tracking-results.jsonfile, not Git-tracked, updated during each run. The file could preserve the full history of entries (not just current state) and be queried using a lightweight embedded database like lowdb.This approach is not retained because it sacrifices three important properties. First, publication and federation: a Git repository can be cloned, pushed to a remote, and mirrored where a local JSON file cannot. There is no stable URL to give to a federation aggregator, no way to replicate the data without building custom synchronization. Second, auditability and data integrity: the snapshots and versions repositories are public Git repositories that anyone can clone and independently verify, and no one can silently rewrite history. A local JSON file offers no such guarantee; its content can be modified at any time without detection. Third, durability: a local file exists only on the machine that runs the engine. A disk failure, a machine recreation, or an accidental deletion destroys the entire tracking history with no possibility of recovery. A Git repository pushed to a remote is inherently replicated, and any clone serves as a full backup. Rebuilding publication, replication, durability, and integrity guarantees on top of a local file would mean reconstructing a subset of what Git already provides, while the project already has the full infrastructure to manage Git repositories.
Proposed solution
Tracking results are stored in a new dedicated Git repository managed by the engine alongside the existing snapshots and versions repositories. This follows the established pattern where each repository has a clear, single purpose: declarations define what to track, snapshots store raw content, versions store extracted content, and tracking results store whether tracking works and why it fails.
The operational cost of a fourth repository (creation, deployment, CI configuration) is real but incremental. Operators already manage three repositories per instance; a fourth follows the same pattern.
Repository structure
The tracking results repository contains one JSON file per declared service/terms pair, reflecting its current tracking status. Every declared terms has a file, whether it is tracking successfully or failing. A
run.jsonfile at the root captures run-level information.This approach stores the complete tracking history in Git. Each file is updated only when its content changes: status transitions, failure reason changes, or transient error appearances and disappearances. The Git history of each file tells the full story: when tracking first succeeded, when it failed, why it failed, when it resumed. A
git checkoutat any past commit gives the exact state of the entire collection at that point in time.File format for tracking results
Each service/terms file contains a JSON object. The content depends on the current tracking status.
When tracking is successful:
{ "status": "ok", "date": "2026-01-10T10:30:00Z", "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "serviceName": "Google", "sourceDocuments": [ { "id": "terms", "fetch": "https://policies.google.com/terms", "select": ".content", "remove": ".banner", "filter": ["removeLinks"], "executeClientScripts": false, "snapshotId": "def456", "mimeType": "text/html" } ] }When tracking is successful but a transient error occurred and was resolved by retry:
{ "status": "ok", "date": "2026-01-10T10:30:00Z", "runId": "a1b2c3d4-58cc-4372-a567-0e02b2c3d479", "serviceName": "Google", "sourceDocuments": [ { "id": "terms", "fetch": "https://policies.google.com/terms", "select": ".content", "remove": ".banner", "filter": ["removeLinks"], "executeClientScripts": false, "snapshotId": "def456", "mimeType": "text/html" } ], "transientError": { "date": "2026-04-07T10:30:00Z", "reasons": ["Fetch failed: HTTP code 503"] } }When tracking is failing:
{ "status": "failed", "date": "2026-03-15T10:30:00Z", "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "serviceName": "Facebook", "reasons": [ "CSS selector \".content\" has no match in the document" ], "sourceDocuments": [ { "id": "legal-terms", "fetch": "https://www.facebook.com/legal/terms", "select": ".content", "remove": null, "filter": null, "executeClientScripts": false, "snapshotId": "abc123", "mimeType": "text/html" } ] }status"ok"or"failed"daterunIdrunIdinrun.jsontransientErrorok, only when a transient error occurred during the last runtransientError.datetransientErrortransientError.reasonstransientErrorserviceNamereasonsfailedsourceDocumentssourceDocuments[].idsourceDocuments[].fetchsourceDocuments[].selectsourceDocuments[].removesourceDocuments[].filtersourceDocuments[].executeClientScriptssourceDocuments[].snapshotIdsourceDocuments[].mimeTypetext/html,application/pdf)The
runIdfield is a UUID v4 generated at the start of each tracking run. It is written both inrun.jsonand in every individual file that is updated during the run. In individual files, it enables correlating file changes to runs without relying on commit timestamp proximity, and detecting interrupted runs whererun.jsoncarries arunIdthat some individual files never received, revealing which terms were not processed. Inrun.json, it enables efficient API polling: a consumer can compare therunIdto determine whether new data is available since its last check.The
datefield records when the current status started, not when the file was last updated. For failures, if a failure persists for 22 days but the reasons change on day 10,dateis preserved from day 1 and the reasons are updated; the Git history shows when the reasons changed. For successes,daterecords when tracking resumed after a failure or when the terms was first successfully tracked. This allows consumers to know at a glance how long a service has been in its current state without traversing Git history.The
transientErrorfield captures errors that were likely transient (HTTP 503, DNS temporary failure, timeout, etc.) and were resolved by the engine's automatic retry mechanism. This field is present only when the last run encountered such an error; if the next run succeeds without transient error, the field is removed. This provides visibility on "fragile" services that succeed but only after retry. The Git history of the file shows when transient errors occurred and disappeared, allowing pattern detection over time.The
sourceDocumentsarray is present on all entries, regardless of tracking status. Each entry contains the full declaration metadata of the source document (using the field names from the declaration format:fetch,select,remove,filter,executeClientScripts) alongside the last snapshot information (snapshotId,mimeType). This ensures that any consumer can understand not only whether tracking works, but also what is being tracked and how it is configured, without needing to read the service declaration or query the snapshots repository.Run file
A
run.jsonfile at the repository root is updated at the end of every run, providing run-level information that cannot be derived from individual tracking result files:{ "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "collectionId": "france", "schedule": "30 */12 * * *", "lastRun": { "startDate": "2026-04-06T10:30:00Z", "endDate": "2026-04-06T10:42:34Z", "engineVersion": "11.0.0" }, "declared": { "services": 523, "terms": 1523 }, "tracked": { "ok": 1476, "failed": 47 }, "transitions": { "newFailures": [ { "serviceId": "Facebook", "termsType": "Terms of Service" }, { "serviceId": "Twitter", "termsType": "Privacy Policy" } ], "recoveries": [ { "serviceId": "Google", "termsType": "Privacy Policy" } ], "reasonChanges": [] }, "transientErrors": 23 }runIdcollectionIdschedulelastRun.startDatelastRun.endDatelastRun.engineVersiondeclared.servicesdeclared.termstracked.oktracked.failedtransitions.newFailures{ serviceId, termsType }objects for terms that started failing in the last runtransitions.recoveries{ serviceId, termsType }objects for terms that recovered in the last runtransitions.reasonChanges{ serviceId, termsType }objects for terms whose failure reasons changed in the last runtransientErrorsThis file serves several purposes. The Git history of
run.jsonhas one commit per run, so a gap in commits reveals that the tracking was not running. Thetracked.ok/tracked.failedratio at each run provides a health curve over time. ThelastRun.engineVersionfield allows correlating spikes in failures with engine upgrades. Thecollectionandschedulefields allow a federation aggregator to identify and contextualize each instance.Collection API extension
The existing collection API exposes service declarations, collection metadata, and version content over HTTP. It would be extended with the following endpoints to expose tracking results, using the existing naming conventions (plural for collections as
/services, singular for specific resources as/service/:id):GET/tracking-resultsGET/tracking-result/:serviceIdGET/tracking-result/:serviceId/:termsTypeGET/tracking-results/runrun.jsoncontent)GET /tracking-resultsreturns an array of all tracking result objects, each enriched withserviceIdandtermsTypefields derived from the file path. It supports filtering by status (GET /tracking-results?status=failed). Pagination is not included in the initial design, as the?status=failedfilter covers the most common current use case (opening issues on forge). These endpoints always return the current state; querying tracking results at a specific date is not supported initially, but can be added later following the same approach used for the/versionendpoint if the need arises.Response examples
GET /tracking-results:[ { "serviceId": "Google", "termsType": "Terms of Service", "status": "ok", "date": "2026-01-10T10:30:00Z", "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "serviceName": "Google", "sourceDocuments": [ { "id": "terms", "fetch": "https://policies.google.com/terms", "select": ".content", "remove": ".banner", "filter": ["removeLinks"], "executeClientScripts": false, "snapshotId": "def456", "mimeType": "text/html" } ] }, { "serviceId": "Facebook", "termsType": "Terms of Service", "status": "failed", "date": "2026-03-15T10:30:00Z", "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "serviceName": "Facebook", "reasons": ["CSS selector \".content\" has no match in the document"], "sourceDocuments": [ { "id": "legal-terms", "fetch": "https://www.facebook.com/legal/terms", "select": ".content", "remove": null, "filter": null, "executeClientScripts": false, "snapshotId": "abc123", "mimeType": "text/html" } ] } ]GET /tracking-result/Google:[ { "serviceId": "Google", "termsType": "Terms of Service", "status": "ok", "date": "2026-01-10T10:30:00Z", "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "serviceName": "Google", "sourceDocuments": [ { "id": "terms", "fetch": "https://policies.google.com/terms", "select": ".content", "remove": ".banner", "filter": ["removeLinks"], "executeClientScripts": false, "snapshotId": "def456", "mimeType": "text/html" } ] }, { "serviceId": "Google", "termsType": "Privacy Policy", "status": "ok", "date": "2025-11-02T08:15:00Z", "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "serviceName": "Google", "sourceDocuments": [ { "id": "privacy", "fetch": "https://policies.google.com/privacy", "select": ".content", "remove": null, "filter": null, "executeClientScripts": false, "snapshotId": "ghi789", "mimeType": "text/html" } ] } ]GET /tracking-result/Facebook/Terms%20of%20Service:{ "serviceId": "Facebook", "termsType": "Terms of Service", "status": "failed", "date": "2026-03-15T10:30:00Z", "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "serviceName": "Facebook", "reasons": ["CSS selector \".content\" has no match in the document"], "sourceDocuments": [ { "id": "legal-terms", "fetch": "https://www.facebook.com/legal/terms", "select": ".content", "remove": null, "filter": null, "executeClientScripts": false, "snapshotId": "abc123", "mimeType": "text/html" } ] }GET /tracking-results/run:{ "runId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "collectionId": "france", "schedule": "30 */12 * * *", "lastRun": { "startDate": "2026-04-06T10:30:00Z", "endDate": "2026-04-06T10:42:34Z", "engineVersion": "11.0.0" }, "declared": { "services": 523, "terms": 1523 }, "tracked": { "ok": 1476, "failed": 47 }, "transitions": { "newFailures": [{ "serviceId": "Facebook", "termsType": "Terms of Service" }], "recoveries": [{ "serviceId": "Google", "termsType": "Privacy Policy" }], "reasonChanges": [] }, "transientErrors": 23 }The API always serves the state of the last completed run. It does this by resolving the commit SHA of the last
run.jsonupdate and reading all files at that commit. While a new run is in progress and individual files are being updated with new commits, the API continues to serve the snapshot from the previous run's final commit. When the current run completes andrun.jsonis committed, the API automatically starts serving the new state. This approach is resilient to crashes: if the engine crashes mid-run, the API keeps serving the last complete state.Notification strategy
External connectors (GitHub issue manager, GitLab issue manager) need to know when a run completes and transitions have occurred. Push-based mechanisms were considered, including webhooks where the engine would send an HTTP POST at the end of each run, and Server-Sent Events where connectors would subscribe to a stream. Both add significant complexity in the form of payload signing, delivery retry logic, connection management, and additional configuration.
Since tracking runs occur on a fixed schedule (typically every 12 hours), connectors can simply poll
GET /tracking-results/runat the same frequency. The cost of one lightweight request every 12 hours is negligible, and therunIdfield allows a connector to immediately determine whether new data is available since its last check. This polling approach requires no additional infrastructure in the engine.Push-based notifications can be added later if the number of connectors grows or if lower latency is needed.
Deployment
Each instance would create a new repository following the naming convention established for other repositories:
{instance_name}-tracking-results. For example, thedemoinstance would have:demo-declarations(existing)demo-versions(existing)demo-snapshots(existing)demo-tracking-results(new)The deployment configuration would be updated to include the new repository path and remote, following the same pattern as versions and snapshots.
Instances can be deployed incrementally: the tracking results repository can be created and configured at any time. On the first run after configuration, the engine will populate it with the initial state of all declared terms and a first
run.json. No migration of historical data from GitHub/GitLab issues is necessary; the repository starts recording from the first run onwards. Historical issue data remains accessible on the forges for reference.Feedback expectations
We invite you to provide feedback to:
If you support the proposal as it stands, please react with a 👍 or leave a positive comment.
Please provide your feedback by April 29th.