A local document indexer MCP (Model Context Protocol) server written in Rust. Enables semantic search over PDF, Excel, SQL/PL-SQL, Markdown, and HTML files using Qdrant vector database and Voyage AI embeddings. Designed for integration with Claude Code CLI and other MCP-compatible tools.
基于 Rust 编写的本地文档索引 MCP(模型上下文协议)服务器。使用 Qdrant 向量数据库和 Voyage AI 嵌入模型,支持对 PDF、Excel、SQL/PL-SQL、Markdown 和 HTML 文件进行语义搜索。专为 Claude Code CLI 及其他 MCP 兼容工具集成设计。
Rust で書かれたローカルドキュメントインデクサー MCP(Model Context Protocol)サーバー。Qdrant ベクトルデータベースと Voyage AI エンベディングを使用して、PDF、Excel、SQL/PL-SQL、Markdown、HTML ファイルのセマンティック検索を実現。Claude Code CLI や他の MCP 互換ツールとの統合を想定して設計。
- PDF Parsing: Uses
pdftotext(poppler) for text extraction with full Unicode support - Excel Parsing: Native Rust parsing via
calamine(.xlsx, .xls, .xlsm, .ods) - SQL/PL-SQL Parsing: Extracts procedures, functions, packages, and triggers
- Markdown Parsing: Section-aware chunking for documentation
- HTML Parsing: Extracts UI text from web application snapshots
- Vector Search: Qdrant vector database for semantic similarity search
- Embeddings: Voyage AI or OpenAI-compatible embeddings API
- MCP Protocol: Full MCP server implementation using
rmcp 0.13 - Fully Configurable: All settings via environment variables
- Rust 2024 Edition (rustc 1.85+)
- Qdrant vector database
- pdftotext (from poppler-utils) for PDF parsing
- Voyage AI API Key (or OpenAI-compatible endpoint)
# macOS
brew install poppler
# Download Qdrant (macOS ARM64)
curl -LO https://github.com/qdrant/qdrant/releases/download/v1.14.0/qdrant-aarch64-apple-darwin.tar.gz
tar xzf qdrant-aarch64-apple-darwin.tar.gzAll settings are configurable via environment variables. Copy .env.example to .env:
# Embedding API Configuration
VOYAGE_API_KEY=your-voyage-api-key
EMBEDDING_MODEL=voyage-3-large
# Vector Database Configuration
QDRANT_URL=http://localhost:6334
QDRANT_COLLECTION=doc_index
# Document Paths Configuration
DOCS_PATH=/path/to/your/documents
INDEX_SUBDIRS=docs
# Chunk Settings
PDF_CHUNK_SIZE=1000
PDF_CHUNK_OVERLAP=200
EXCEL_ROWS_PER_CHUNK=50
SQL_MAX_CHUNK_SIZE=4000
# Search Settings
SEARCH_TOP_K=10
# Logging
RUST_LOG=info| Document Type | Language | Recommended Size |
|---|---|---|
| Japanese | 600-800 chars | |
| English | 1000-1500 chars | |
| Test Specifications | Any | 1200-1500 chars |
| SQL Code | Any | 4000 chars |
# Development build
cargo build
# Release build (optimized)
cargo build --release# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run specific test module
cargo test parsers::pdf::tests
cargo test parsers::excel::tests
cargo test parsers::sql::tests- Start Qdrant:
./qdrant- Run the MCP server:
cargo run --releaseThe server communicates via stdio following the MCP protocol.
| Tool | Description |
|---|---|
index_document |
Index a single document file |
index_directory |
Recursively index all supported files in configured subdirectories |
search_documents |
Semantic search across indexed documents |
delete_document |
Remove a document from the index |
get_stats |
Get index statistics |
| Extension | Parser | Notes |
|---|---|---|
.pdf |
pdftotext | Full Unicode support |
.xlsx, .xls, .xlsm, .ods |
calamine | All sheets parsed |
.sql, .pls, .pks, .pkb |
SQL Parser | PL/SQL object extraction |
.md, .markdown |
Markdown Parser | Section-aware chunking |
.html, .htm |
HTML Parser | UI text extraction |
cd /path/to/doc-indexer-mcp
cargo build --releaseAdd the MCP server to your Claude Code configuration file ~/.claude.json:
{
"mcpServers": {
"doc-indexer": {
"command": "/path/to/doc-indexer-mcp/target/release/doc-indexer-mcp",
"env": {
"VOYAGE_API_KEY": "your-voyage-api-key",
"EMBEDDING_MODEL": "voyage-3-large",
"QDRANT_URL": "http://localhost:6334",
"QDRANT_COLLECTION": "doc_index",
"DOCS_PATH": "/path/to/your/documents",
"INDEX_SUBDIRS": "docs",
"PDF_CHUNK_SIZE": "1000",
"PDF_CHUNK_OVERLAP": "200",
"RUST_LOG": "info"
}
}
}
}Use the /mcp command in Claude Code to test your MCP server:
claude
> /mcpThis will show all available MCP tools. You can then test individual tools:
> Search for "user authentication" in the indexed documents
> Index all documents in the docs folder
Create a settings.json in your project root for project-specific permissions:
{
"permissions": {
"allow": [
"mcp__doc-indexer__index_document",
"mcp__doc-indexer__index_directory",
"mcp__doc-indexer__search_documents",
"mcp__doc-indexer__get_stats",
"mcp__doc-indexer__delete_document"
]
}
}Organize your documents in the configured subdirectories:
/your/docs/path/
├── docs/ # Design documents, specifications
│ ├── design_spec.pdf
│ ├── test_spec.pdf
│ └── schema.md
└── sql/ # SQL and PL/SQL files
├── procedures.sql
└── packages.pkb
src/
├── main.rs # Entry point
├── config.rs # Configuration from environment
├── embedding/
│ └── client.rs # Embeddings API client (Voyage AI)
├── mcp/
│ ├── server.rs # MCP server setup
│ └── tools.rs # Tool implementations
├── parsers/
│ ├── mod.rs # Parser trait and common types
│ ├── pdf.rs # PDF parser (pdftotext)
│ ├── excel.rs # Excel parser (calamine)
│ ├── sql.rs # SQL/PL-SQL parser
│ ├── markdown.rs # Markdown parser
│ └── html.rs # HTML parser
└── vector_store/
└── qdrant.rs # Qdrant vector database client
Each parser in src/parsers/ implements intelligent chunking for its document type. You can customize the chunking behavior by modifying the section markers and patterns.
The PDF parser uses section markers to split documents into logical chunks:
// Major section markers - customize for your document format
const MAJOR_SECTION_MARKERS: &[&str] = &[
"【Initial Display】", "【On Display】", "【On Save】",
// Add your own section markers here
];
// Sub-section headers
const SUB_SECTION_HEADERS: &[&str] = &[
"Action Definition", "Screen Definition", "Error Check",
// Add your own sub-section patterns
];Key functions to customize:
classify_line()- Determines line type (section header, content, etc.)should_start_new_block()- Decides chunk boundariessplit_into_blocks()- Main chunking logic
The Excel parser handles structured documents with tables and nested sections:
// Bracketed section markers
const MAJOR_SECTION_MARKERS: &[&str] = &[
"【Initial Display】", "【Data Items】", "【Conditions】",
// Add markers matching your Excel templates
];
// Row type classification
enum RowType {
BracketedSection, // 【Section】
MajorSection, // 1. Section
SubSection, // 1.1. Sub Section
TableHeader, // No | Item Name | ...
// Add custom row types
}Key functions to customize:
classify_row()- Classifies Excel rows by typeshould_start_new_block()- Determines chunk boundariesrows_to_markdown()- Converts rows to searchable text
The HTML parser extracts UI text from web application snapshots:
// CSS class patterns to extract text from
let patterns = [
("title", "ui-dialog-title"),
("button", "a-Button-label"),
("column", "a-GV-headerLabel"),
// Add patterns matching your UI framework
];Key functions to customize:
detect_component_type()- Identifies UI component typesextract_texts()- Extracts text by CSS class patterns
The SQL parser extracts PL/SQL objects (procedures, functions, packages):
Key functions to customize:
- Object detection patterns for your database schema
- Package/procedure boundary detection
- Create a new file in
src/parsers/(e.g.,xml.rs) - Implement the
DocumentParsertrait:
#[async_trait::async_trait]
impl DocumentParser for XmlParser {
async fn parse(&self, file_path: &str) -> Result<Vec<DocumentChunk>> {
// Your parsing logic here
}
fn supported_extensions(&self) -> Vec<&'static str> {
vec!["xml"]
}
}- Register in
src/parsers/mod.rs - Add to
src/mcp/tools.rsinget_parser()
The MCP server logs to stderr, which may not be visible in Claude Code CLI. To debug:
- Set
RUST_LOG=debugin your configuration - Run the server manually to see logs:
RUST_LOG=debug ./target/release/doc-indexer-mcp
Ensure Qdrant is running on the configured port (default: 6334):
./qdrant
# Check: curl http://localhost:6334/collectionsEnsure pdftotext is installed:
which pdftotext
# If not found: brew install popplerUse Claude Code's /mcp command to verify the server is connected:
claude
> /mcpThis will list all available MCP servers and their tools.
MIT License - see LICENSE file.