Problem
@lde/pipeline’s SPARQL executor retries transient failures, but isTransientError (packages/pipeline/src/sparql/executor.ts) classifies only network errors and HTTP 502/503/504 as transient:
return status === 502 || status === 503 || status === 504;
HTTP 500 is treated as definitive, so a stage that hits a single 500 fails immediately with no retry. Underpowered dataset endpoints routinely return 500 under load (QLever, for one, aborts an over-budget query with a 500). When that happens mid-run the affected stages emit no output and the dataset’s analysis is silently incomplete.
Evidence
Dataset Knowledge Graph run 2026-06-19, dataset https://data.razu.nl/id/dataset/kranten (endpoint https://api.data.razu.nl/datasets/id/object/sparql, ~24M triples):
class-property-object-classes.rq and class-property-languages.rq failed with Invalid SPARQL endpoint response (HTTP status 500) – no retry.
subjects.rq and subject-uri-space.rq aborted on timeouts, with repeated adaptive-timeout tightening.
- Result: the dataset got only a
void:classPartition – no subject namespaces, hence no persistent-URI check downstream.
The endpoint serves light queries fine (ASK in ~50ms) but times out / 500s on heavy aggregations – exactly the transient/overload pattern retries exist for.
Proposed change
Extend the retryable set in isTransientError to {500, 502, 503, 504, 408, 425, 429}. Keep other 4xx (400, 404) non-retryable – those are deterministic. The existing bounded retries (default 3) plus p-retry backoff already guard against hammering a struggling endpoint. This also makes the policy consistent with the DKG’s own per-URI dereference path, which already treats any status >= 500 as transient.
Caveat
A 500 can also be deterministic (an endpoint that always chokes on a query at that size – e.g. razu’s COUNT(DISTINCT ?s) times out every time). Retrying reduces the frequency of incomplete analyses but cannot guarantee success, so it should be paired with surfacing stage-level incompleteness to consumers rather than relied on alone.
Problem
@lde/pipeline’s SPARQL executor retries transient failures, butisTransientError(packages/pipeline/src/sparql/executor.ts) classifies only network errors and HTTP 502/503/504 as transient:HTTP 500 is treated as definitive, so a stage that hits a single 500 fails immediately with no retry. Underpowered dataset endpoints routinely return 500 under load (QLever, for one, aborts an over-budget query with a 500). When that happens mid-run the affected stages emit no output and the dataset’s analysis is silently incomplete.
Evidence
Dataset Knowledge Graph run 2026-06-19, dataset
https://data.razu.nl/id/dataset/kranten(endpointhttps://api.data.razu.nl/datasets/id/object/sparql, ~24M triples):class-property-object-classes.rqandclass-property-languages.rqfailed withInvalid SPARQL endpoint response (HTTP status 500)– no retry.subjects.rqandsubject-uri-space.rqaborted on timeouts, with repeated adaptive-timeout tightening.void:classPartition– no subject namespaces, hence no persistent-URI check downstream.The endpoint serves light queries fine (
ASKin ~50ms) but times out / 500s on heavy aggregations – exactly the transient/overload pattern retries exist for.Proposed change
Extend the retryable set in
isTransientErrorto{500, 502, 503, 504, 408, 425, 429}. Keep other 4xx (400, 404) non-retryable – those are deterministic. The existing bounded retries (default 3) plusp-retrybackoff already guard against hammering a struggling endpoint. This also makes the policy consistent with the DKG’s own per-URI dereference path, which already treats anystatus >= 500as transient.Caveat
A 500 can also be deterministic (an endpoint that always chokes on a query at that size – e.g. razu’s
COUNT(DISTINCT ?s)times out every time). Retrying reduces the frequency of incomplete analyses but cannot guarantee success, so it should be paired with surfacing stage-level incompleteness to consumers rather than relied on alone.