Skip to content

GraphQL search API #495

Description

@ddeboer

Summary

Add a presentation-facing GraphQL search API, generated from the same SHACL + search: source as the existing engine-agnostic projection (@lde/search) and engine adapter (@lde/search-typesense). Per the platform design, both the GraphQL API (schema + Mercurius resolvers) and the REST API (OpenAPI + Fastify handlers) are generated from one projection, so a single source serves N engines (Typesense is the v1 adapter; Elasticsearch/OpenSearch follow the same pattern).

See the platform reference: https://docs.nde.nl/stack/layers/platform#search-apis

Requirement: facet labels returned inline

Facets are keyed by IRI (organizations, classes, terminology sources). The GraphQL API MUST return each facet bucket with its human-readable, locale-resolved label attached inline – so clients never perform a separate IRI→label lookup. The same applies to entity references rendered from an IRI (e.g. the publisher on a result card).

This is an engine-agnostic contract. It says nothing about Typesense, sidecar collections, joins, or exporthow labels are resolved is the engine adapter’s responsibility, hidden behind the framed-document projection. Consumers of the API never learn which engine is underneath (Typesense is only the v1 adapter; Elasticsearch/OpenSearch satisfy the same contract their own way).

Concretely:

  • Facet buckets carry the IRI value, its count, and the label as a locale-ordered [LanguageString!]! (Accept-Language preference), matching the platform’s multilingual-string convention.
  • Label resolution happens server-side, below the API surface. The adapter is free to choose its mechanism – e.g. the Typesense v1 adapter can draw on a sidecar label collection; an Elasticsearch adapter might use nested documents – but none of that leaks into the schema or the resolver contract.
  • No engine client, key, or bespoke lookup logic is shipped into any consumer.

Why inline

Resolving labels server-side and returning them inline:

  • removes a per-render round-trip and the need for clients to hold a Typesense client at all;
  • keeps the locale-fallback policy (e.g. nl→en→first) in one server-side place instead of duplicated per consumer;
  • preserves the engine-agnostic boundary – the API abstracts over the search engine (Typesense is the v1 adapter, not part of the contract), @lde/search / @lde/search-typesense stay domain-agnostic mechanics (no label concept; see the framed-document projection), and label semantics live in the API layer’s resolvers and the domain projection.

Notes on shape (from the platform reference)

  • Standard GraphQL idioms: Prisma-style typed filter inputs (where: { field: { eq, in, gt, … } }), Relay Connection pagination (first / after / opaque cursor), facet arrays with counts.
  • Localised content as [LanguageString!]! ordered by Accept-Language, not language-tagged literals.
  • Schema + resolvers generated from SHACL + search: annotations, not hand-written.

Full feature surface: what the GraphQL must cover

Scoping driven by a comparison of the Typesense search API and the Elasticsearch _search API. Four principles decide what is in the public schema vs. configured below the boundary:

  • Generated, not hand-written. Every field/filter/facet/sort is emitted from SHACL + search: + the projection. The surface is bounded by what the projection materialises (title_search_nl, publisher facet, size number, date fields); the API cannot expose a capability the projection does not produce.
  • Engine-agnostic. Anything exposed must be expressible on both Typesense and Elasticsearch (and resolvable by a future adapter). Engine-proprietary knobs stay below the boundary.
  • Presentation-facing. The audience is web developers who do not know RDF. Prisma where, Relay connections, plain SDL. Relevance-engineering knobs do not leak unless a UI genuinely drives them.
  • Folding contract. query MUST be fold()ed server-side with the same function used at index time (see @lde/search README), or matches silently miss. Invisible to the client.

Capability map (Typesense ∪ Elasticsearch → our GraphQL)

Capability Typesense Elasticsearch Expose in GraphQL? How / where
Query / matching
Free-text query q query.match Yesquery: String Server folds it; fans out over all search-enabled per-locale fields
Which fields are searched query_by multi_match.fields No (server-side) Derived from projection’s search-enabled fields
Prefix / autocomplete prefix, infix edge-ngram / suggest Yes (v1) Separate suggest field/endpoint, see below
Typo tolerance num_typos, *_threshold fuzziness No (server default) Tuned in engine schema
Filtering
Typed filters filter_by bool.filter Yes – Prisma where Per-field { eq, in, gt, gte, lt, lte, exists } from datatype
Boolean composition && || bool must/should/must_not YesAND/OR/NOT on where
IRI / reference filters := exact term Yeseq/in on URI-string fields Facet IRIs are filterable
Faceting
Facet buckets + counts facet_by terms agg Yes (core) facets { value count label }
Inline locale-resolved labels (sidecar) (nested) Yes – headline requirement label: [LanguageString!]!, resolved below the boundary
Numeric / date facet ranges facet ranges range agg Yes where projection has numbers/dates Generated for number/date fields
Facet value search / max values facet_query, max_facet_values agg include/size Yes – args on the facet field For long facet lists (e.g. publishers)
Disjunctive facets (a facet’s own selection does not shrink its own counts) per-facet filter behaviour post_filter Yes – must get right Resolver applies selected filters as post_filter-equivalent so multi-select facets behave correctly
Sorting
By relevance _text_match _score Yes (default)
By field / per-locale sort sort_by sort YesorderBy enum Uses *_sort_${locale} fields, missing_values: last
By recency date sort_by sort on date Yes From date fields
Random _rand() random_score Maybe For discover surfaces
Pagination
Relay connection page/per_page, offset from/size, search_after Yes first/after, opaque cursor hides the offset-vs-search_after choice
Total count always track_total_hits YestotalCount
Ranking / boosting (see breakdown below)
Field weights (title > description) query_by_weights field ^boost No (server-side) From a new search:boost annotation
Locale weighting (user’s lang higher) query_by_weights field ^boost No (server-side) Computed from Accept-Language
Exact-match / token-position priority prioritize_* query DSL No (server default)
Editorial curation: pin / hide pinned_hits, hidden_hits, overrides pinned query / curation Deferred (out of scope v1)
Decay / recency / popularity boost _eval(), decay function_score, decay No raw params Surfaced only via the named relevance enum
Synonyms synonym rules synonym filter No (server-side) Engine config
Result shaping
Field selection include_fields _source Free – GraphQL selection set No param needed
Highlighting / snippets highlight_* highlight Yes (v1) highlights on result; engine-agnostic
Grouping / dedupe group_by collapse Deferred
Discovery
Did-you-mean / suggest q + typos suggest Yes (v1) Autocomplete field/endpoint
Vector / semantic / hybrid vector_query knn, RRF Schema room (v1) Reserve the shape; no implementation yet
Cross-cutting
Localization Yes, everywhere Accept-Language[LanguageString!]! ordering
HTTP caching use_cache REST twin ETag / Cache-Control
Analytics / click tracking analytics rules Out of scope v1

Query boosting

Boosting is not one feature but four mechanisms that land on different sides of the boundary:

  1. Field weighting (title ranks above description) – query_by_weights / ^boost. Relevance engineering, not a client concern → driven by a new search:boost annotation per field; the generator emits the weights. Not a GraphQL parameter.
  2. Locale weighting (rank the Accept-Language locale’s *_search_${locale} fields higher) – already anticipated in the @lde/search README. Computed server-side from Accept-Language; the resolver reorders the weights so the request’s locale leads.
  3. Editorial curation (pin to top / hide) – pinned_hits/hidden_hits/overrides ↔ ES pinned query/curation. Out of scope v1; revisit when a real admin surface exists.
  4. Signal-based boosting (recency decay, popularity) – _eval()/decay ↔ function_score. Server-configured; surfaced to clients only via the named relevance enum, never as raw decay parameters.

Net: the only boosting in the public schema is the relevance/orderBy enum. Everything else lives in search: annotations and the engine schema, generated, below the boundary – so relevance tuning changes without a breaking schema change.


v1 scope (decided)

In – core contract

  • Free-text query (folded server-side), Prisma where (eq/in/gt/gte/lt/lte/exists + AND/OR/NOT), orderBy, Relay first/after + totalCount.
  • Facets with counts and inline locale-ordered [LanguageString!]! labels (the headline requirement above), disjunctive multi-select behaviour, facet-value search/limit, numeric/date ranges.
  • Localization end-to-end via Accept-Language.
  • Relevance as a named enum (RELEVANCE | RECENT | POPULAR | …) – no raw boost knobs.
  • Highlighting / snippets on results.
  • Autocomplete / suggest (prefix + did-you-mean) as a separate field/endpoint.
  • Schema room for vector / semantic / hybrid – reserve the shape, no implementation yet.

Below the boundary – generated/configured, never in the schema

query_by field set, field weights & locale weighting (from a new search:boost annotation), typo tolerance, exact-match/token-position priority, synonyms, decay/recency tuning, the folding step, sidecar label collections, and the offset-vs-search_after pagination mechanism.

Deferred

  • Editorial curation / pin-hide (revisit when a real admin surface exists).
  • Grouping / collapse.
  • Analytics / click tracking.

Schema sketch

type Query {
  datasets(
    query: String
    where: DatasetWhere
    orderBy: [DatasetOrder!]      # relevance (default) | title | modified | …
    first: Int, after: String      # Relay
    facets: [DatasetFacetName!]    # which facets to compute
  ): DatasetConnection!
}

type DatasetConnection {
  edges: [DatasetEdge!]!
  pageInfo: PageInfo!
  totalCount: Int!
  facets: [Facet!]!                # buckets with inline labels
}

type Facet {
  name: String!
  buckets: [FacetBucket!]!
}

type FacetBucket {
  value: String!                   # IRI (as plain string)
  count: Int!
  label: [LanguageString!]!        # inline, locale-ordered, resolved below the boundary
}

input DatasetWhere {
  AND: [DatasetWhere!]
  OR: [DatasetWhere!]
  NOT: DatasetWhere
  publisher: UriFilter             # { eq, in, exists }
  size: IntFilter                  # { eq, gt, gte, lt, lte }
  modified: DateFilter
}

Note: search:boost is a new annotation

Field + locale weighting needs a home, and the cleanest one is a new search:boost annotation on the projection/SHACL source. This is the only part of this scope that reaches down into the existing packages: @lde/search’s spec vocabulary today is langText/facet/number/date with no weight concept. The addition is additive, not breaking, but it is the one piece that lands outside the new API layer.

Relation to current state / interim

  • The write/populate side is done and stays in LDE: @lde/search-typesense’s blue/green rebuild populates both the dataset and the sidecar label collections; the domain projection (label documents, schema, locale fallback) lives in the consumer (dataset-register, search-indexer).
  • Until this API lands, consumers resolve facet/publisher labels themselves as throwaway client-side code (a fetch-all-and-cache of the bounded label collection). This issue supersedes that: once labels come back inline, that interim lookup is deleted.
  • Context for the interim design: Typesense-backed full-text search for the dataset browser netwerk-digitaal-erfgoed/dataset-register#2085.

Out of scope

  • The REST/OpenAPI variant (same generator, separate tracking).
  • Any label-aware surface in @lde/search or @lde/search-typesense – these stay engine-mechanics only; label resolution belongs in the API layer.
  • Editorial curation (pin/hide), grouping/collapse, and analytics/click tracking – deferred from v1 (see above).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Fields

    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions