TreeSearch: Structure-Aware Document Retrieval

TreeSearch is a structure-aware document retrieval library. No vector embeddings. No chunk splitting. SQLite FTS5 keyword matching over document tree structures. Supports Markdown, plain text, code files (Python AST + regex, Java/Go/JS/C++ etc.), HTML, XML, JSON, CSV, PDF, and DOCX.

Millisecond-latency search over tens of thousands of documents and large codebases, with structure preservation.

Installation

Python Library

Use this when you want to call TreeSearch from Python code, such as scripts, backend services, or data pipelines:

pip install -U pytreesearch

Then use it in code:

from treesearch import TreeSearch

Python CLI

Use this when you're already in a Python environment but want a command-line tool right away:

pip install -U pytreesearch
treesearch --help

Common commands:

treesearch "How does auth work?" src/ docs/
treesearch index --paths src/ docs/
treesearch search --db ./index.db --query "auth"

Wildcard shortcuts supported in the Python CLI:

auth* for prefix matching
*auth* for contains-style regex matching
Other wildcard shapes currently fall back to regular query parsing

Explicit query controls:

treesearch --regex "o?auth" src/ to treat the query as raw regex
treesearch search --query "o?auth" --regex for indexed search with regex
treesearch --fts-expression "auth*" src/ to pass a raw FTS5 expression
treesearch search --fts-expression "auth*" for indexed search with raw FTS5 syntax

Rust CLI

Use this when you want a standalone CLI without depending on a Python runtime. The Rust binary name is ts.

Option 1: Homebrew (macOS / Linux, recommended)

brew tap shibing624/tap
brew install treesearch
ts --help

Option 2: Cargo (any platform with Rust toolchain)

cargo install treesearch
ts --help

Option 3: Prebuilt binary download

Download from the current stable release: v1.0.8.

macOS Apple Silicon (M1 / M2 / M3 / M4): aarch64-apple-darwin
macOS Intel: x86_64-apple-darwin
Linux x86_64: x86_64-unknown-linux-gnu
Windows x86_64: x86_64-pc-windows-msvc

After extracting the archive, run ts directly.

Wildcard shortcuts supported in the Rust CLI:

auth* for prefix matching
*auth* for contains-style regex matching
Other wildcard shapes currently fall back to regular query parsing

Explicit query controls:

ts --regex "o?auth" . to treat the query as raw regex
ts search --regex "o?auth" for indexed search with regex
ts --fts-expression "auth*" . to pass a raw FTS5 expression
ts search --fts-expression "auth*" for indexed search with raw FTS5 syntax
Invalid regex patterns now raise an explicit error instead of silently returning no results

Quick Start

from treesearch import TreeSearch

# Just pass directories — auto-discovers all supported files
ts = TreeSearch("project_root/", "docs/")
results = ts.search("How does auth work?")
for doc in results["documents"]:
    for node in doc["nodes"]:
        print(f"[{node['score']:.2f}] {node['title']}")
        print(f"  {node['text'][:200]}")

Directories are walked recursively with smart defaults:

Auto-discovers .py, .md, .json, .jsonl, .java, .go, .ts, .pdf, .docx, etc.
Skips .git, node_modules, __pycache__, .venv, dist, build, etc.
Respects .gitignore when pathspec is installed(pip install pathspec)
Safety cap of 10,000 files per directory (configurable via max_files)

You can also mix directories, files, and glob patterns freely:

# All three input types work together
ts = TreeSearch("src/", "docs/*.md", "README.md")
results = ts.search("authentication")

In-Memory Mode

For quick searches, scripts, or ephemeral use cases, set db_path=None to skip writing any .db file to disk:

# In-memory mode — no index.db file, all indexes kept in memory
ts = TreeSearch("docs/", db_path=None)
results = ts.search("voice calls")

Performance is excellent even with thousands of documents (5,000 docs < 10ms). The trade-off is that indexes are lost when the process exits. For persistent, incremental indexing, use the default db_path or set it to a file path.

Tree Mode (Best for Papers & Documents)

For academic papers, long documents, and technical docs with deep heading hierarchy, use tree mode to get structure-aware best-first search:

from treesearch import TreeSearch

# Tree mode: anchor retrieval → tree walk → path aggregation
ts = TreeSearch("papers/", "docs/")
results = ts.search("experimental methodology", search_mode="tree")

# Tree mode returns ranked nodes (same as flat mode)
for doc in results["documents"]:
    for node in doc["nodes"]:
        print(f"[{node['score']:.2f}] {node['title']}")

# Plus: tree traversal paths showing how results connect
for path in results["paths"]:
    chain = " > ".join(p["title"] for p in path["path"])
    print(f"[{path['score']:.2f}] {chain}")
    print(f"  {path['snippet'][:200]}")

When to use which mode?

Mode	Best For	MRR Advantage
`"auto"` (default)	Auto-selects based on document type	Best overall — zero config
`"tree"`	Academic papers, technical docs with heading hierarchy	Best on QASPER (MRR 0.50, +25% vs FTS5)
`"flat"`	Code search, keyword-heavy queries	Best on CodeSearchNet (MRR 0.91)

Auto Mode (search_mode="auto", default): Intelligently selects tree vs flat using a three-layer strategy:

Type mapping — Each source_type has an explicit tree-benefit flag (_TREE_BENEFIT)
Depth verification — Only docs with actual tree depth ≥ 2 count as hierarchical
Proportion threshold — If ≥ 30% of docs truly benefit from tree → tree mode; otherwise → flat

This avoids the old "1 markdown among 50 code files → tree for everything" problem.

Document Type	Tree Benefit?	Depth Check	Auto Mode
Markdown (.md)	✅ Yes	Must have headings (depth ≥ 2)	`tree` if deep
JSON (.json)	✅ Yes	Must have nesting (depth ≥ 2)	`tree` if nested
Code (.py/.js/.go...)	❌ No	—	`flat`
PDF (.pdf)	❌ No	—	`flat`
DOCX (.docx)	❌ No	—	`flat`
CSV (.csv)	❌ No	—	`flat`
Text (.txt)	❌ No	—	`flat`
JSONL (.jsonl)	❌ No	—	`flat`
Unknown types	❌ No (safe default)	—	`flat`

Why TreeSearch?

Traditional RAG systems split documents into fixed-size chunks and retrieve by vector similarity. This destroys document structure, loses heading hierarchy, and misses reasoning-dependent queries.

TreeSearch takes a fundamentally different approach — parse documents into tree structures based on their natural heading hierarchy, then search with FTS5 keyword matching (zero-cost, no API key needed).

	Traditional RAG	TreeSearch
Preprocessing	Chunk splitting + embedding	Parse headings → build tree
Retrieval	Vector similarity search	FTS5 keyword matching (no LLM needed)
Multi-doc	Needs vector DB for routing	FTS5 cross-doc scoring
Structure	Lost after chunking	Fully preserved as tree hierarchy
Dependencies	Vector DB + embedding model	SQLite only (no embedding, no vector DB)

Key Advantages

No vector embeddings — No embedding model to train, deploy, or pay for
No chunk splitting — Documents retain their natural heading structure
No vector DB — No Pinecone, Milvus, or Chroma to manage
Tree-aware retrieval — Heading hierarchy guides search, not arbitrary chunk boundaries
SQLite FTS5 engine — Persistent inverted index with WAL mode, incremental updates, CJK support, and SQL aggregation

Features

Smart directory discovery — ts.index("src/") recursively discovers all supported files; skips .git/node_modules/__pycache__; respects .gitignore
FTS5 search — Zero LLM calls, millisecond-level FTS5 keyword matching, no API key needed
SQLite FTS5 engine — Persistent inverted index, WAL mode, incremental updates, MD structure-aware columns (title/summary/body/code/front_matter), column weighting, CJK tokenization
Tree-structured indexing — Markdown, plain text, code files (Python AST + regex, Java/Go/JS/C++/PHP), HTML, XML, JSON, CSV, PDF, and DOCX are parsed into hierarchical trees
Ripgrep-accelerated GrepFilter — Auto-uses system rg for fast line-level matching with transparent native Python fallback; hit-count-based scoring ranks multi-match nodes higher
Parser registry — Extensible ParserRegistry with built-in parsers auto-registered; custom parsers via ParserRegistry.register()
Python AST parsing — ast module extracts classes/functions with full signatures (parameters, return types); regex fallback for syntax errors
PDF/DOCX/HTML parsers — Optional parsers via PyMuPDF, python-docx, beautifulsoup4 (install with pip install pytreesearch[all])
GrepFilter — Exact literal/regex matching for precise symbol and keyword search across tree nodes
Source-type routing — Automatic pre-filter selection based on file type (e.g., code files use GrepFilter + FTS5)
Chinese + English — Built-in jieba tokenization for Chinese and regex tokenization for English
Batch indexing — build_index() supports glob patterns, files, and directories for concurrent multi-file processing
Async-first — All core functions are async with sync wrappers available
Config-driven defaults — search() and build_index() read defaults from get_config(), overridable per-call
CLI included — treesearch "query" path/ for instant search; treesearch index and treesearch search for advanced workflows

FTS5 Standalone

from treesearch import FTS5Index, Document, load_index, save_index, md_to_tree
import asyncio

# Option 1: Build from Markdown and save to DB
result = asyncio.run(md_to_tree(md_path="docs/voice-call.md", if_add_node_summary=True))
save_index(result, "indexes/voice-call.db")

# Option 2: Load a previously saved document from DB
doc = load_index("indexes/voice-call.db")  # returns a Document object

# Create FTS5 index and search
fts = FTS5Index(db_path="indexes/fts.db")  # persistent, or omit for in-memory
fts.index_documents([doc])

# Simple keyword search
results = fts.search("authentication config", top_k=5)
for r in results:
    print(f"[{r['fts_score']:.4f}] {r['title']}")

# Advanced FTS5 query syntax
results = fts.search("auth", fts_expression='title:auth AND body:config', top_k=5)

# Per-document aggregation
agg = fts.search_with_aggregation("authentication", group_by_doc=True)
for doc_agg in agg:
    print(f"{doc_agg['doc_name']}: {doc_agg['hit_count']} hits, best={doc_agg['best_score']:.4f}")

CLI

# Default mode: one command does everything (lazy index + search)
treesearch "How does auth work?" src/ docs/
treesearch "configure Redis" project/

# With options
treesearch "auth" src/ --max-nodes 10 --db ./my_index.db

# Advanced: build index separately (for large codebases)
treesearch index --paths src/ docs/ --add-description
treesearch index --paths "docs/*.md" "src/**/*.py" --add-description

# Advanced: search a pre-built index
treesearch search --index_dir ./indexes/ --query "How does auth work?"

How It Works

Input Documents (MD/TXT/Code/JSON/CSV/HTML/XML/PDF/DOCX)
        │
        ▼
   ┌──────────┐
   │  Indexer  │  ParserRegistry dispatch → parse structure → build tree → generate summaries
   └────┬─────┘    (build_index supports glob for batch processing)
        │  SQLite DB (FTS5)
        ▼
   ┌──────────┐
   │  search   │  FTS5/Grep pre-filter → cross-doc scoring → ranked results
   └────┬─────┘
        │  dict result
        ▼
  Ranked nodes with scores and text

Flat Mode (default): FTS5Index uses SQLite FTS5 inverted index with structure-aware columns (title/summary/body/code/front_matter) and column weighting for fast scoring. Instant results, no LLM needed.

Tree Mode: Best-first search over document trees — FTS5 finds anchor nodes, then walks the tree (parent/child/sibling) with heuristic scoring (title match, term overlap, IDF weighting, generic section demotion) to find optimal paths through the document hierarchy.

Source-Type Routing: For code files, GrepFilter + FTS5 are combined automatically for precise symbol matching. The pre-filter is selected based on file type via PREFILTER_ROUTING.

Use Cases

Use Case 1: Technical Documentation QA (Best Scenario)

Problem: Your company has 100+ technical docs (API docs, design docs, RFCs), and traditional search can't find the right answers.

from treesearch import build_index, search

# 1. Build index — just pass directories (run once)
docs = await build_index(
    paths=["docs/", "specs/"],
    output_dir="./indexes"
)

# 2. Search — millisecond response
result = await search(
    query="How to configure Redis cluster?",
    documents=docs,
)

# 3. Results — complete sections, not fragments
for doc in result["documents"]:
    print(f"Doc: {doc['doc_name']}")
    for node in doc["nodes"]:
        print(f"  Section: {node['title']}")
        print(f"  Content: {node['text'][:200]}...")

Why better than traditional RAG?

Finds complete sections, not fragments
Includes section titles as context anchors
Supports hierarchical navigation (parent/child sections)

Use Case 2: Codebase Search

Problem: Want to search for "login-related classes and methods" in a large codebase, but grep only finds lines without structure.

# Index entire directories — auto-discovers .py, .java, .go, etc.
docs = await build_index(
    paths=["src/", "lib/"],
    output_dir="./code_indexes"
)

# Search — auto-detects code files, uses AST parsing + GrepFilter (ripgrep-accelerated)
result = await search(
    query="user login authentication",
    documents=docs,
)

# Results example:
# Doc: auth_service.py
#   class UserAuthenticator
#     def login(username, password)
#     def verify_token(token)

Why better than grep/IDE search?

Semantic understanding: Not just keyword matching, understands "login" = "authentication"
Structure-aware: Finds complete classes/methods with docstrings
Precise location: Directly locates to code line numbers

Use Case 3: Long Document QA (Papers/Books)

Problem: Have a 50-page paper, want to ask "What experimental methods are mentioned in Chapter 3?"

docs = await build_index(paths=["paper.pdf"])

result = await search(
    query="experimental methodology",
    documents=docs,
)

# Automatically finds "3.2 Experimental Design" section content

Why better than Ctrl+F?

Semantic matching: Finds synonymous paragraphs for "experimental methods"
Section location: Tells you which chapter and section
Scalable to multi-doc: Search 10 papers simultaneously

Real Case Comparison

Case: Find "How to request GPU machines" in company docs

Traditional way (Ctrl+F):

Search "GPU" → Found 47 matches → Manual review → 10 minutes

TreeSearch way:

result = await search("How to request GPU machines", docs)
# Directly returns "Resource Guide > GPU Request Process" section
# Time: < 100ms

Efficiency gain: 100x

Comparison with Other Solutions

Solution	Pros	Cons	Best For
Ctrl+F	Simple	No semantic understanding, fragmented results	Known keywords
Vector DB	Similarity search	Requires embedding preprocessing, high cost	Large-scale semantic retrieval
TreeSearch	Preserves structure + Fast + Zero cost	Requires structured documents	Tech docs/Codebase

Benchmark

Document Retrieval (QASPER)

Evaluated on QASPER dataset (50 queries, academic papers):

Metric	Embedding (zhipu emb-3)	FTS5	Tree	Auto
MRR	0.4235	0.4033	0.5046	0.5046
R@5	0.4259	0.5337	0.5812	0.5812
R@10	0.6075	0.8372	0.8674	0.8674
Hit@5	0.6383	0.7021	0.7660	0.7660
Hit@10	0.8085	0.9574	0.9787	0.9787
Index Time	0.0s	0.1s	0.1s	0.1s
Avg Query Time	154.8ms	0.7ms	1.0ms	1.0ms

Key Findings:

🏆 Tree / Auto wins MRR (0.50 vs 0.42 Embedding) — Structure-aware tree walk boosts ranking quality
R@5: Tree 0.58 vs Embedding 0.43 — +35% recall
Auto routes to Tree (markdown deep hierarchy) — zero performance loss, 155x faster vs Embedding

Financial Document Retrieval (FinanceBench)

Evaluated on FinanceBench dataset (50 queries, SEC filings):

Metric	Embedding (zhipu-embedding-3)	FTS5	Tree	Auto
MRR	0.2206	0.2420	0.2512	0.2420
R@5	0.2782	0.2067	0.2344	0.2067
Index Time	406.0s	0.24s	0.24s	0.24s
Avg Query Time	154.3ms	5.7ms	23.5ms	5.4ms

Key Findings:

Tree mode wins both MRR and R@5 on financial docs — parent context boost lifts low-scoring child nodes
Auto routes to FTS5 (flat PDF structure) — fastest query at 5.4ms, no quality loss
TreeSearch 1692x faster indexing — No embedding API calls

Code Retrieval (CodeSearchNet)

Evaluated on CodeSearchNet dataset (100 queries, Python corpus):

Metric	Embedding (zhipu-embedding-3)	FTS5	Tree	Auto
MRR	0.8483	0.9050	0.2833	0.9100
R@5	0.9400	0.9200	0.3000	0.9200
Index Time	33.8s	2.8s	2.8s	2.8s
Avg Query Time	166.0ms	4.5ms	30.2ms	4.5ms

Key Findings:

🏆 Auto wins MRR (0.91 vs 0.85 Embedding), even edges out FTS5 (0.91 vs 0.905)
Auto routes to FTS5 (code is flat, no hierarchy) — completely avoids Tree's severe degradation on code (MRR 0.28)
TreeSearch 37x faster queries — Milliseconds vs hundreds of milliseconds

Multi-Hop Reasoning (HotpotQA)

Evaluated on HotpotQA dataset (50 queries, multi-hop QA):

Metric	FTS5	Tree	Auto
MRR	0.9712	0.9115	1.0000
SP-Recall@3	0.9939	0.9879	1.0000
2-hop-Cov@3	0.9939	0.9879	1.0000
SP-Recall@5	1.0000	1.0000	1.0000
Avg Latency	6ms	3ms	13ms

Key Findings:

🏆 Auto achieves perfect MRR 1.0 — routes to FTS5 (shallow text), covers all multi-hop questions
Tree slightly lower than FTS5 on flat documents (expected: no structural signal, reranking adds noise)
Auto completely avoids Tree's degradation on flat/shallow documents

Summary

Auto Mode is the recommended choice for production: automatically identifies document types and always takes the optimal path — zero config, zero pitfalls.

Benchmark	Best Mode	MRR	vs Embedding	Query Speed
QASPER (Academic Papers)	Auto = Tree	0.5046	+19%	190x faster
FinanceBench (SEC Filings)	Auto = FTS5	0.2420	+10%	29x faster
CodeSearchNet (Python)	Auto = FTS5	0.9100	+7%	37x faster
HotpotQA (Multi-hop)	Auto = FTS5	1.0000	—	ultra-fast

Run the benchmarks yourself:

# Document retrieval (QASPER)
python examples/benchmark/qasper_benchmark.py --max-samples 50 --max-papers 20 --with-embedding

# Financial document retrieval (FinanceBench)
python examples/benchmark/financebench_benchmark.py --max-samples 50 --with-embedding

# Code retrieval (CodeSearchNet)
python examples/benchmark/codesearchnet_benchmark.py --max-samples 50 --max-corpus 500 --with-embedding

# Multi-hop reasoning (HotpotQA)
python examples/benchmark/hotpotqa_benchmark.py --max-samples 50

Documentation

Architecture — Design principles and architecture
API Reference — Complete API documentation

Community

GitHub Issues — Submit an issue
WeChat Group — Add WeChat ID xuming624, note "nlp", to join the tech group

Citation

If you use TreeSearch in your research, please cite:

@software{xu2026treesearch,
  author = {Xu, Ming},
  title = {TreeSearch: Structure-Aware Document Retrieval Without Embeddings},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/shibing624/TreeSearch}
}

License

Apache License 2.0

Contributing

Contributions are welcome! Please submit a Pull Request.

Acknowledgements

SQLite FTS5 — The full-text search engine powering TreeSearch
VectifyAI/PageIndex — Inspiration for structure-aware indexing and retrieval

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github		.github
docs		docs
examples		examples
rust		rust
tests		tests
treesearch		treesearch
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
dist-workspace.toml		dist-workspace.toml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TreeSearch: Structure-Aware Document Retrieval

Installation

Python Library

Python CLI

Rust CLI

Quick Start

In-Memory Mode

Tree Mode (Best for Papers & Documents)

Why TreeSearch?

Key Advantages

Features

FTS5 Standalone

CLI

How It Works

Use Cases

Use Case 1: Technical Documentation QA (Best Scenario)

Use Case 2: Codebase Search

Use Case 3: Long Document QA (Papers/Books)

Real Case Comparison

Comparison with Other Solutions

Benchmark

Document Retrieval (QASPER)

Financial Document Retrieval (FinanceBench)

Code Retrieval (CodeSearchNet)

Multi-Hop Reasoning (HotpotQA)

Summary

Documentation

Community

Citation

License

Contributing

Acknowledgements

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages