Skip to content

feat(search): try to improve search performance through some creative mechanisms...#9034

Merged
eliandoran merged 35 commits intomainfrom
feat/search-perf-take1
Apr 13, 2026
Merged

feat(search): try to improve search performance through some creative mechanisms...#9034
eliandoran merged 35 commits intomainfrom
feat/search-perf-take1

Conversation

@perfectra1n
Copy link
Copy Markdown
Member

@perfectra1n perfectra1n commented Mar 12, 2026

Closes #9009.

@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 12, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on significantly enhancing search performance, particularly for large note bases and autocomplete scenarios. It introduces extensive profiling tools to identify and measure performance bottlenecks, and implements several key optimizations. These include a new cached flat text index for faster note scanning, a dedicated fast path for single-token autocomplete, and more efficient content snippet extraction and fuzzy search logic. The changes aim to provide a smoother and more responsive search experience for users.

Highlights

  • Search Performance Profiling: Introduced comprehensive integration and unit tests for search profiling, allowing for detailed measurement of performance bottlenecks across various search pipeline stages and at large scales (up to 50,000 notes).
  • Optimized Flat Text Scanning: Implemented a flatTextIndex in the Becca cache, which pre-builds parallel arrays of notes and their normalized flat texts. This significantly reduces overhead during candidate note scanning by avoiding per-note property access and getFlatText() calls for every note.
  • Autocomplete Fast Path: Added a specialized fast path for single-token autocomplete searches. This optimization bypasses the expensive recursive parent walk and directly uses getBestNotePath(), improving responsiveness for user typing.
  • Content Snippet Extraction Improvements: Refactored content snippet extraction to use a faster regex for HTML stripping instead of the striptags library, and streamlined whitespace normalization. The logic for multi-line snippets was also enhanced to prioritize lines containing search tokens.
  • Refined Fuzzy Search Logic: Optimized the two-phase exact/fuzzy search mechanism to skip the fuzzy fallback for limited searches (like autocomplete), as users are typically refining their query and exact matches are sufficient.
  • Reduced Normalization Overhead: Replaced normalizeString with removeDiacritic in highlighting functions and streamlined normalization in content snippet extraction, reducing redundant processing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • apps/server/spec/search_profiling.spec.ts
    • Added a new integration test file for large-scale search profiling.
    • Implemented helper functions for timing, random ID/word generation, and content generation.
    • Seeded 50,000 notes with a hierarchical structure and a common keyword for testing.
    • Profiled various stages of the search process, including getCandidateNotes scan, findResultsWithQuery (fast and exact), SearchResult construction, computeScore, getNoteTitleForPath, extractContentSnippet, highlightSearchResults, getBestNotePath, and full autocomplete.
    • Included profiling for full search (non-fast) and raw SQL content scanning.
    • Provided a summary of the profiling results.
  • apps/server/src/becca/becca-interface.ts
    • Added flatTextIndex property to Becca class to store pre-built parallel arrays of notes and their flat texts.
    • Modified constructor to initialize flatTextIndex to null.
    • Updated dirtyNoteSetCache to also set flatTextIndex to null, ensuring cache invalidation.
    • Added getFlatTextIndex method to build and return the flatTextIndex if not already cached, normalizing flat texts for fast scanning.
  • apps/server/src/becca/entities/bnote.ts
    • Modified invalidateThisCache to also dirty the becca.flatTextIndex when a note's content or attributes change, ensuring cache consistency.
  • apps/server/src/services/search/expressions/note_flat_text.ts
    • Implemented a fast path for single-token searches with a limit (e.g., autocomplete) to skip expensive recursive parent walks and use getBestNotePath directly.
    • Updated getCandidateNotes to utilize becca.getFlatTextIndex() for efficient iteration over notes and their pre-normalized flat texts, avoiding per-note property access overhead.
    • Introduced a check for isFullSet within getCandidateNotes to optimize membership checks when the input noteSet is not the full set of notes.
  • apps/server/src/services/search/services/search.ts
    • Removed normalizeString and striptags imports, replacing them with more efficient alternatives.
    • Imported removeDiacritic from utils.js for improved text normalization.
    • Added a limit: 200 to the SearchContext for searchNotesForAutocomplete to restrict the number of results.
    • Introduced a fast path in findResultsWithExpression to skip the two-phase fuzzy fallback for limited searches (e.g., autocomplete).
    • Optimized extractContentSnippet by replacing striptags with a faster regex for HTML stripping and streamlining whitespace normalization.
    • Adjusted snippet extraction logic for multi-line content to prioritize lines containing search tokens.
    • Updated extractAttributeSnippet to use normalize instead of normalizeString(token.toLowerCase()).
    • Modified highlightSearchResults to use removeDiacritic instead of normalizeString for highlighting in note path titles, content snippets, and attribute snippets.
  • apps/server/src/services/search/services/search_profiling.spec.ts
    • Added a new unit test file dedicated to granular search profiling.
    • Implemented helper functions for random word generation, HTML content generation, and timing reports.
    • Created a buildDataset function to generate synthetic notes with varying content sizes, match fractions, and hierarchy depths for testing.
    • Monkeypatched note.getContent() to return synthetic content for isolated profiling.
    • Included granular profiling tests for autocomplete pipeline stages, normalizeSearchText cost, searchPathTowardsRoot cost, content snippet extraction performance, two-phase exact/fuzzy search cost, multi-token search performance, end-to-end autocomplete scaling, and comparison of fast vs. non-fast search.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Comment thread apps/server/src/services/search/services/search.ts Fixed
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several significant performance improvements to the search functionality, primarily by adding a flatTextIndex cache to avoid repeated computations and introducing a fast path for single-token searches like autocomplete. The snippet extraction logic has also been optimized. The inclusion of comprehensive integration and unit-level profiling tests is excellent for validating these performance gains. The changes are well-implemented, but I found one critical syntax error that needs to be addressed.

Comment on lines +207 to +208

}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This closing brace is misplaced and at the wrong indentation level, which introduces a syntax error. The for loop starting on line 192 is not correctly closed. The closing brace should be properly indented and placed to close the loop.

        }

@perfectra1n perfectra1n marked this pull request as draft March 12, 2026 21:33
@perfectra1n perfectra1n marked this pull request as ready for review March 20, 2026 15:32
…l fallback

For operators =, !=, and *=*, the search now tries the FTS5 index first
via searchViaFts(). If FTS is unavailable or fails, it falls back to the
original sequential scan. The flat text attribute search is extracted
into its own searchFlatTextAttributes() method and runs after both
paths.
Collect rows before inserting — iterateRows() holds an open cursor
that conflicts with writes on the same connection.
Adds real SQLite benchmarks showing FTS5 is 15-33x faster for the
raw content query, though end-to-end improvement is masked by JS
pipeline overhead (scoring, snippets, path walking).
- Remove redundant toLowerCase() before normalizeSearchText() in
  search_result.ts (normalizeSearchText already lowercases)
- Pre-normalize tokens once in addScoreForStrings instead of per-chunk
- Skip edit distance computation entirely when fuzzy matching is
disabled
- Move removeDiacritic() outside the regex while-loop in highlighting
- Cache normalized parent titles per search execution in
note_flat_text.ts
- Use Set for token lookup in searchPathTowardsRoot (O(1) vs O(n))
- Remove redundant toLowerCase in fuzzyMatchWordWithResult (inputs
  from smartMatch are already normalized)
FTS5 query was 32x faster in isolation, but the content scan is only
1-7% of total search time. The JS pipeline (scoring, snippets,
highlighting, tree walk) dominates. The in-memory optimizations in
this PR provide the real gains.

Removes: migration, fts_index service, event wiring, UI option,
integration test. Keeps all in-memory performance optimizations.
The function has multiple callers (not just smartMatch) so it must
normalize inputs itself. Removing toLowerCase broke fuzzy matching
for the two-phase search path.
All numbers re-measured on the same machine/session after the scoring,
highlighting, and tree walk optimizations. Multi-token autocomplete
now shows 50-70% improvement over main.
Adds end-to-end full search (fastSearch=false) comparison tables
for both fuzzy ON and OFF, plus long queries and realistic typo
recovery benchmarks. Full search multi-token shows 45-65% improvement.
Consolidated from 12 sections to 4. Leads with the e2e results a
reviewer cares about, follows with scaling data, then lists what
changed and known limitations. Removed redundant tables and
internal-only details.
@github-actions
Copy link
Copy Markdown
Contributor

📚 Documentation preview is ready!

🔗 Preview URL: https://pr-9034.trilium-docs.pages.dev
📖 Production URL: https://docs.triliumnotes.org

✅ All checks passed

This preview will be updated automatically with new commits.

eliandoran and others added 4 commits April 13, 2026 12:45
Resolved conflicts:
- search_result.ts: Keep optimized index-based token iteration
- search.ts: Merge OCR text representation support with perf optimizations

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Apr 13, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds user-configurable controls for fuzzy matching (full search vs autocomplete) and implements several server-side search performance optimizations (indexing, caching, and cheaper snippet/highlighting paths) to address UX concerns from #9009.

Changes:

  • Introduces new synced options to toggle fuzzy matching globally and separately for autocomplete, including UI + translations and server option initialization/whitelisting.
  • Optimizes search execution paths (flat-text scanning index with incremental updates, autocomplete single-token fast path, reduced per-result normalization work).
  • Adds new profiling/benchmark spec suites intended to measure search performance characteristics.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
packages/commons/src/lib/options_interface.ts Adds new option types for fuzzy matching toggles.
CLAUDE.md Updates contributor guidance for adding user options (whitelisting + UI + i18n).
apps/server/src/services/search/utils/text_utils.ts Adjusts fuzzy matching helper behavior for performance.
apps/server/src/services/search/services/types.ts Adds autocomplete flag to search params.
apps/server/src/services/search/services/search.ts Wires fuzzy options into search/autocomplete, optimizes snippet extraction and highlighting.
apps/server/src/services/search/services/search_profiling.spec.ts Adds detailed in-memory profiling tests for search pipeline.
apps/server/src/services/search/services/search_benchmark.spec.ts Adds comprehensive benchmark suite across scenarios/scales.
apps/server/src/services/search/search_result.ts Reduces redundant normalization and limits fuzzy scoring work.
apps/server/src/services/search/search_context.ts Initializes fuzzy matching behavior based on user options; adds autocomplete flag.
apps/server/src/services/search/expressions/note_flat_text.ts Adds title caching, attribute normalization usage, autocomplete fast path, and flat-text index scanning.
apps/server/src/services/options_init.ts Adds defaults for the new search options.
apps/server/src/routes/api/options.ts Whitelists new search options for API updates.
apps/server/src/becca/entities/bnote.ts Marks flat-text index entry dirty when note caches are invalidated.
apps/server/src/becca/entities/battribute.ts Adds pre-normalized attribute name/value fields for faster searching.
apps/server/src/becca/becca-interface.ts Implements a flat-text index with incremental updates for fast scanning.
apps/server/spec/search_profiling.spec.ts Adds integration-level profiling that seeds a large dataset into SQLite.
apps/client/src/widgets/type_widgets/options/other.tsx Adds Search settings UI toggles for the new options.
apps/client/src/translations/en/translation.json Adds English strings for the new Search settings section.

Comment on lines 283 to 286
// Exact match check first (most common case)
if (normalizedText.includes(normalizedToken)) {
// Find the exact match in the original text to preserve case
const exactMatch = text.match(new RegExp(escapeRegExp(token), 'i'));
return exactMatch ? exactMatch[0] : token;
return token;
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fuzzyMatchWordWithResult() docstring says it returns the matched word (and later returns originalWord to preserve case), but the exact-match fast path now returns token instead. That can break callers that expect the returned value to match the original text (case/diacritics) for highlighting. Consider returning the actual matched substring from text (case-preserved) in the exact-match branch, or updating the contract/comments and all call sites accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +68 to 74
// Pre-compute normalized forms for search (avoids repeated normalize() calls in hot loops)
this.normalizedName = normalize(this.name);
this.normalizedValue = normalize(this.value);

return this;
}

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalizedName / normalizedValue are computed only in update(), but attributes are frequently modified by setting attr.value = ... and then calling attr.save() (e.g. BNote.setAttribute). In that flow, these precomputed fields become stale, causing search/token matching to miss updated attributes. Recompute normalizedName / normalizedValue in beforeSaving() (after sanitizeAttributeName() and value normalization) and/or ensure they are updated whenever name/value changes.

Suggested change
// Pre-compute normalized forms for search (avoids repeated normalize() calls in hot loops)
this.normalizedName = normalize(this.name);
this.normalizedValue = normalize(this.value);
return this;
}
this.recomputeNormalizedFields();
return this;
}
private recomputeNormalizedFields() {
// Pre-compute normalized forms for search (avoids repeated normalize() calls in hot loops)
this.normalizedName = normalize(this.name || "");
this.normalizedValue = normalize(this.value || "");
}
override beforeSaving(_opts?: SavingOpts) {
this.name = sanitizeAttributeName(this.name);
this.value = this.value || "";
this.recomputeNormalizedFields();
}

Copilot uses AI. Check for mistakes.
Comment on lines 602 to 606
// Check if any search token matches the attribute name or value
const hasMatch = searchTokens.some(token => {
const normalizedToken = normalizeString(token.toLowerCase());
const normalizedToken = normalize(token);
return attrName.includes(normalizedToken) || attrValue.includes(normalizedToken);
});
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalizedToken uses normalize() (lowercase + diacritics removal), but attrName/attrValue are only lowercased. This makes diacritic-insensitive searches (e.g. searching "resume" for an attribute containing "résumé") fail to detect matches for attribute snippets/highlighting. Normalize the attribute name/value too (or use attr.normalizedName / attr.normalizedValue when available) when performing the includes() checks.

Copilot uses AI. Check for mistakes.
Comment on lines +284 to +285
describe("Comprehensive Search Benchmark", () => {

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark suite builds datasets up to 20K notes, runs many iterations, and prints extensive console output. With the current Vitest config (apps/server/vite.config.mts includes {src,spec}/**/*.spec.*), this will run in CI by default and likely make pnpm server:test extremely slow/flaky. Consider gating these benchmarks behind an env flag (e.g. if (!process.env.TRILIUM_BENCHMARK) describe.skip(...)) or moving them out of the default test include set.

Suggested change
describe("Comprehensive Search Benchmark", () => {
const describeBenchmark = process.env.TRILIUM_BENCHMARK ? describe : describe.skip;
describeBenchmark("Comprehensive Search Benchmark", () => {

Copilot uses AI. Check for mistakes.
Comment on lines +225 to +229
ignoreInternalAttributes: true,
autocomplete: true
});
ctx.enableFuzzyMatching = fuzzyEnabled;
return searchService.findResultsWithQuery(query, ctx);
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the autocomplete benchmark path, ctx.enableFuzzyMatching = fuzzyEnabled is set, but findResultsWithQuery() overrides searchContext.enableFuzzyMatching for autocomplete=true based on the searchAutocompleteFuzzy option. This means the benchmark may not actually be measuring fuzzy ON vs OFF as intended. Consider setting the option explicitly for the duration of the benchmark (or adding a way to bypass the override for test/benchmark code).

Copilot uses AI. Check for mistakes.
Comment on lines +185 to +189
describe("Search Profiling", () => {

afterEach(() => {
becca.reset();
});
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These profiling tests generate large synthetic datasets (up to 10K notes), run many timing loops, and write extensive output to stdout. Since Vitest includes src/**/*.spec.*, this will run in the normal test suite and can significantly slow down CI. Recommend gating behind an env var / describe.skip by default, or moving to a separate benchmark runner script.

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +62
describe("Search profiling (integration)", () => {
beforeAll(async () => {
config.General.noAuthentication = true;
const buildApp = (await import("../src/app.js")).default;
app = await buildApp();
});

it("large-scale profiling (50K notes)", async () => {
const sql = (await import("../src/services/sql.js")).default;
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This integration profiling test seeds 50K notes into the in-memory test DB and runs long profiling loops (with a 10-minute timeout). With the current Vitest include pattern (apps/server/vite.config.mts), it will run in CI by default and can also affect subsequent integration tests by mutating the shared DB state. Please gate it behind an explicit env flag or mark it skipped by default.

Copilot uses AI. Check for mistakes.
Comment on lines +191 to +193
const [, ms] = timed(() =>
searchService.findResultsWithQuery("test", new SearchContext({ fastSearch: true, enableFuzzyMatching: false }))
);
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SearchContext constructor accepts SearchParams (which doesn't include enableFuzzyMatching), so passing { fastSearch: true, enableFuzzyMatching: false } will fail TypeScript excess-property checks. Construct the context first and then set ctx.enableFuzzyMatching = false, or extend SearchParams if this is intended to be supported.

Suggested change
const [, ms] = timed(() =>
searchService.findResultsWithQuery("test", new SearchContext({ fastSearch: true, enableFuzzyMatching: false }))
);
const [, ms] = timed(() => {
const ctx = new SearchContext({ fastSearch: true });
ctx.enableFuzzyMatching = false;
return searchService.findResultsWithQuery("test", ctx);
});

Copilot uses AI. Check for mistakes.
@eliandoran eliandoran merged commit ad864cf into main Apr 13, 2026
11 of 12 checks passed
@eliandoran eliandoran deleted the feat/search-perf-take1 branch April 13, 2026 11:17
@eliandoran eliandoran added this to the v0.103.0 milestone Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Option to easily toggle the fuzzy search on/off

4 participants