feat(search): try to improve search performance through some creative mechanisms...#9034
feat(search): try to improve search performance through some creative mechanisms...#9034eliandoran merged 35 commits intomainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on significantly enhancing search performance, particularly for large note bases and autocomplete scenarios. It introduces extensive profiling tools to identify and measure performance bottlenecks, and implements several key optimizations. These include a new cached flat text index for faster note scanning, a dedicated fast path for single-token autocomplete, and more efficient content snippet extraction and fuzzy search logic. The changes aim to provide a smoother and more responsive search experience for users. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several significant performance improvements to the search functionality, primarily by adding a flatTextIndex cache to avoid repeated computations and introducing a fast path for single-token searches like autocomplete. The snippet extraction logic has also been optimized. The inclusion of comprehensive integration and unit-level profiling tests is excellent for validating these performance gains. The changes are well-implemented, but I found one critical syntax error that needs to be addressed.
|
|
||
| } |
…l fallback For operators =, !=, and *=*, the search now tries the FTS5 index first via searchViaFts(). If FTS is unavailable or fails, it falls back to the original sequential scan. The flat text attribute search is extracted into its own searchFlatTextAttributes() method and runs after both paths.
Collect rows before inserting — iterateRows() holds an open cursor that conflicts with writes on the same connection.
Adds real SQLite benchmarks showing FTS5 is 15-33x faster for the raw content query, though end-to-end improvement is masked by JS pipeline overhead (scoring, snippets, path walking).
- Remove redundant toLowerCase() before normalizeSearchText() in search_result.ts (normalizeSearchText already lowercases) - Pre-normalize tokens once in addScoreForStrings instead of per-chunk - Skip edit distance computation entirely when fuzzy matching is disabled - Move removeDiacritic() outside the regex while-loop in highlighting - Cache normalized parent titles per search execution in note_flat_text.ts - Use Set for token lookup in searchPathTowardsRoot (O(1) vs O(n)) - Remove redundant toLowerCase in fuzzyMatchWordWithResult (inputs from smartMatch are already normalized)
FTS5 query was 32x faster in isolation, but the content scan is only 1-7% of total search time. The JS pipeline (scoring, snippets, highlighting, tree walk) dominates. The in-memory optimizations in this PR provide the real gains. Removes: migration, fts_index service, event wiring, UI option, integration test. Keeps all in-memory performance optimizations.
The function has multiple callers (not just smartMatch) so it must normalize inputs itself. Removing toLowerCase broke fuzzy matching for the two-phase search path.
All numbers re-measured on the same machine/session after the scoring, highlighting, and tree walk optimizations. Multi-token autocomplete now shows 50-70% improvement over main.
Adds end-to-end full search (fastSearch=false) comparison tables for both fuzzy ON and OFF, plus long queries and realistic typo recovery benchmarks. Full search multi-token shows 45-65% improvement.
Consolidated from 12 sections to 4. Leads with the e2e results a reviewer cares about, follows with scaling data, then lists what changed and known limitations. Removed redundant tables and internal-only details.
|
📚 Documentation preview is ready! 🔗 Preview URL: https://pr-9034.trilium-docs.pages.dev ✅ All checks passed This preview will be updated automatically with new commits. |
Resolved conflicts: - search_result.ts: Keep optimized index-based token iteration - search.ts: Merge OCR text representation support with perf optimizations Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds user-configurable controls for fuzzy matching (full search vs autocomplete) and implements several server-side search performance optimizations (indexing, caching, and cheaper snippet/highlighting paths) to address UX concerns from #9009.
Changes:
- Introduces new synced options to toggle fuzzy matching globally and separately for autocomplete, including UI + translations and server option initialization/whitelisting.
- Optimizes search execution paths (flat-text scanning index with incremental updates, autocomplete single-token fast path, reduced per-result normalization work).
- Adds new profiling/benchmark spec suites intended to measure search performance characteristics.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/commons/src/lib/options_interface.ts | Adds new option types for fuzzy matching toggles. |
| CLAUDE.md | Updates contributor guidance for adding user options (whitelisting + UI + i18n). |
| apps/server/src/services/search/utils/text_utils.ts | Adjusts fuzzy matching helper behavior for performance. |
| apps/server/src/services/search/services/types.ts | Adds autocomplete flag to search params. |
| apps/server/src/services/search/services/search.ts | Wires fuzzy options into search/autocomplete, optimizes snippet extraction and highlighting. |
| apps/server/src/services/search/services/search_profiling.spec.ts | Adds detailed in-memory profiling tests for search pipeline. |
| apps/server/src/services/search/services/search_benchmark.spec.ts | Adds comprehensive benchmark suite across scenarios/scales. |
| apps/server/src/services/search/search_result.ts | Reduces redundant normalization and limits fuzzy scoring work. |
| apps/server/src/services/search/search_context.ts | Initializes fuzzy matching behavior based on user options; adds autocomplete flag. |
| apps/server/src/services/search/expressions/note_flat_text.ts | Adds title caching, attribute normalization usage, autocomplete fast path, and flat-text index scanning. |
| apps/server/src/services/options_init.ts | Adds defaults for the new search options. |
| apps/server/src/routes/api/options.ts | Whitelists new search options for API updates. |
| apps/server/src/becca/entities/bnote.ts | Marks flat-text index entry dirty when note caches are invalidated. |
| apps/server/src/becca/entities/battribute.ts | Adds pre-normalized attribute name/value fields for faster searching. |
| apps/server/src/becca/becca-interface.ts | Implements a flat-text index with incremental updates for fast scanning. |
| apps/server/spec/search_profiling.spec.ts | Adds integration-level profiling that seeds a large dataset into SQLite. |
| apps/client/src/widgets/type_widgets/options/other.tsx | Adds Search settings UI toggles for the new options. |
| apps/client/src/translations/en/translation.json | Adds English strings for the new Search settings section. |
| // Exact match check first (most common case) | ||
| if (normalizedText.includes(normalizedToken)) { | ||
| // Find the exact match in the original text to preserve case | ||
| const exactMatch = text.match(new RegExp(escapeRegExp(token), 'i')); | ||
| return exactMatch ? exactMatch[0] : token; | ||
| return token; | ||
| } |
There was a problem hiding this comment.
fuzzyMatchWordWithResult() docstring says it returns the matched word (and later returns originalWord to preserve case), but the exact-match fast path now returns token instead. That can break callers that expect the returned value to match the original text (case/diacritics) for highlighting. Consider returning the actual matched substring from text (case-preserved) in the exact-match branch, or updating the contract/comments and all call sites accordingly.
| // Pre-compute normalized forms for search (avoids repeated normalize() calls in hot loops) | ||
| this.normalizedName = normalize(this.name); | ||
| this.normalizedValue = normalize(this.value); | ||
|
|
||
| return this; | ||
| } | ||
|
|
There was a problem hiding this comment.
normalizedName / normalizedValue are computed only in update(), but attributes are frequently modified by setting attr.value = ... and then calling attr.save() (e.g. BNote.setAttribute). In that flow, these precomputed fields become stale, causing search/token matching to miss updated attributes. Recompute normalizedName / normalizedValue in beforeSaving() (after sanitizeAttributeName() and value normalization) and/or ensure they are updated whenever name/value changes.
| // Pre-compute normalized forms for search (avoids repeated normalize() calls in hot loops) | |
| this.normalizedName = normalize(this.name); | |
| this.normalizedValue = normalize(this.value); | |
| return this; | |
| } | |
| this.recomputeNormalizedFields(); | |
| return this; | |
| } | |
| private recomputeNormalizedFields() { | |
| // Pre-compute normalized forms for search (avoids repeated normalize() calls in hot loops) | |
| this.normalizedName = normalize(this.name || ""); | |
| this.normalizedValue = normalize(this.value || ""); | |
| } | |
| override beforeSaving(_opts?: SavingOpts) { | |
| this.name = sanitizeAttributeName(this.name); | |
| this.value = this.value || ""; | |
| this.recomputeNormalizedFields(); | |
| } |
| // Check if any search token matches the attribute name or value | ||
| const hasMatch = searchTokens.some(token => { | ||
| const normalizedToken = normalizeString(token.toLowerCase()); | ||
| const normalizedToken = normalize(token); | ||
| return attrName.includes(normalizedToken) || attrValue.includes(normalizedToken); | ||
| }); |
There was a problem hiding this comment.
normalizedToken uses normalize() (lowercase + diacritics removal), but attrName/attrValue are only lowercased. This makes diacritic-insensitive searches (e.g. searching "resume" for an attribute containing "résumé") fail to detect matches for attribute snippets/highlighting. Normalize the attribute name/value too (or use attr.normalizedName / attr.normalizedValue when available) when performing the includes() checks.
| describe("Comprehensive Search Benchmark", () => { | ||
|
|
There was a problem hiding this comment.
This benchmark suite builds datasets up to 20K notes, runs many iterations, and prints extensive console output. With the current Vitest config (apps/server/vite.config.mts includes {src,spec}/**/*.spec.*), this will run in CI by default and likely make pnpm server:test extremely slow/flaky. Consider gating these benchmarks behind an env flag (e.g. if (!process.env.TRILIUM_BENCHMARK) describe.skip(...)) or moving them out of the default test include set.
| describe("Comprehensive Search Benchmark", () => { | |
| const describeBenchmark = process.env.TRILIUM_BENCHMARK ? describe : describe.skip; | |
| describeBenchmark("Comprehensive Search Benchmark", () => { |
| ignoreInternalAttributes: true, | ||
| autocomplete: true | ||
| }); | ||
| ctx.enableFuzzyMatching = fuzzyEnabled; | ||
| return searchService.findResultsWithQuery(query, ctx); |
There was a problem hiding this comment.
In the autocomplete benchmark path, ctx.enableFuzzyMatching = fuzzyEnabled is set, but findResultsWithQuery() overrides searchContext.enableFuzzyMatching for autocomplete=true based on the searchAutocompleteFuzzy option. This means the benchmark may not actually be measuring fuzzy ON vs OFF as intended. Consider setting the option explicitly for the duration of the benchmark (or adding a way to bypass the override for test/benchmark code).
| describe("Search Profiling", () => { | ||
|
|
||
| afterEach(() => { | ||
| becca.reset(); | ||
| }); |
There was a problem hiding this comment.
These profiling tests generate large synthetic datasets (up to 10K notes), run many timing loops, and write extensive output to stdout. Since Vitest includes src/**/*.spec.*, this will run in the normal test suite and can significantly slow down CI. Recommend gating behind an env var / describe.skip by default, or moving to a separate benchmark runner script.
| describe("Search profiling (integration)", () => { | ||
| beforeAll(async () => { | ||
| config.General.noAuthentication = true; | ||
| const buildApp = (await import("../src/app.js")).default; | ||
| app = await buildApp(); | ||
| }); | ||
|
|
||
| it("large-scale profiling (50K notes)", async () => { | ||
| const sql = (await import("../src/services/sql.js")).default; |
There was a problem hiding this comment.
This integration profiling test seeds 50K notes into the in-memory test DB and runs long profiling loops (with a 10-minute timeout). With the current Vitest include pattern (apps/server/vite.config.mts), it will run in CI by default and can also affect subsequent integration tests by mutating the shared DB state. Please gate it behind an explicit env flag or mark it skipped by default.
| const [, ms] = timed(() => | ||
| searchService.findResultsWithQuery("test", new SearchContext({ fastSearch: true, enableFuzzyMatching: false })) | ||
| ); |
There was a problem hiding this comment.
SearchContext constructor accepts SearchParams (which doesn't include enableFuzzyMatching), so passing { fastSearch: true, enableFuzzyMatching: false } will fail TypeScript excess-property checks. Construct the context first and then set ctx.enableFuzzyMatching = false, or extend SearchParams if this is intended to be supported.
| const [, ms] = timed(() => | |
| searchService.findResultsWithQuery("test", new SearchContext({ fastSearch: true, enableFuzzyMatching: false })) | |
| ); | |
| const [, ms] = timed(() => { | |
| const ctx = new SearchContext({ fastSearch: true }); | |
| ctx.enableFuzzyMatching = false; | |
| return searchService.findResultsWithQuery("test", ctx); | |
| }); |
Closes #9009.