Skip to content

feat(index): support array object paths in JSON FTS#7377

Draft
wirybeaver wants to merge 1 commit into
lance-format:mainfrom
wirybeaver:lance-json
Draft

feat(index): support array object paths in JSON FTS#7377
wirybeaver wants to merge 1 commit into
lance-format:mainfrom
wirybeaver:lance-json

Conversation

@wirybeaver

@wirybeaver wirybeaver commented Jun 19, 2026

Copy link
Copy Markdown

Summary:

  • Emit legacy, wildcard, and exact indexed path tokens for JSON arrays.
  • Normalize JSON query triplets with JSONPath-style roots, wildcards, and indexes.
  • Add tokenizer and dataset regressions for array-of-object JSON FTS.

Example:
Given this JSON document:

{
  "addresses": [
    {"country": "us", "street": "main"},
    {"country": "ca", "street": "second"}
  ]
}

The JSON FTS tokenizer now emits path variants that let callers search both array values and exact array positions:

addresses.country,str,us        // legacy path, preserves existing behavior
addresses..country,str,us       // wildcard path for addresses[*].country
addresses[0].country,str,us     // exact indexed path
addresses.$index,number,0       // array element position

This means:

  • addresses[*].country,str,us matches rows with any address whose country is us.
  • addresses[0].country,str,us matches only rows whose first address has country us.
  • Mixed nested paths are normalized too, e.g. addresses[*].types[1] becomes addresses..types[1].

Pinot references:

Test Plan:

  • cargo fmt --all --check
  • cargo test -p lance-index --features "protoc lance-encoding/protoc lance-file/protoc lance-table/protoc lance-datafusion/protoc" scalar::inverted::tokenizer::document_tokenizer::tests -- --nocapture
  • PROTOC=/home/user/lance-json/target/debug/build/protobuf-src-0bbb645409253599/out/bin/protoc-27.2.0 cargo test -p lance --features "protoc lance-encoding/protoc lance-file/protoc lance-index/protoc lance-table/protoc lance-datafusion/protoc" test_json_inverted_ -- --nocapture
  • PROTOC=/home/user/lance-json/target/debug/build/protobuf-src-0bbb645409253599/out/bin/protoc-27.2.0 cargo clippy --all --tests --benches -- -D warnings

Summary:
- Emit legacy, wildcard, and exact indexed path tokens for JSON arrays.
- Normalize JSON query triplets with JSONPath-style $ roots, wildcards, and indexes.
- Add tokenizer and dataset regressions for array-of-object JSON FTS.

Test Plan:
- cargo fmt --all --check
- cargo test -p lance-index --features "protoc lance-encoding/protoc lance-file/protoc lance-table/protoc lance-datafusion/protoc" scalar::inverted::tokenizer::document_tokenizer::tests -- --nocapture
- PROTOC=/home/user/lance-json/target/debug/build/protobuf-src-0bbb645409253599/out/bin/protoc-27.2.0 cargo test -p lance --features "protoc lance-encoding/protoc lance-file/protoc lance-index/protoc lance-table/protoc lance-datafusion/protoc" test_json_inverted_ -- --nocapture
- PROTOC=/home/user/lance-json/target/debug/build/protobuf-src-0bbb645409253599/out/bin/protoc-27.2.0 cargo clippy --all --tests --benches -- -D warnings
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer enhancement New feature or request labels Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant