doc-indexer-mcp

A local document indexer MCP (Model Context Protocol) server written in Rust. Enables semantic search over PDF, Excel, SQL/PL-SQL, Markdown, and HTML files using Qdrant vector database and Voyage AI embeddings. Designed for integration with Claude Code CLI and other MCP-compatible tools.

基于 Rust 编写的本地文档索引 MCP（模型上下文协议）服务器。使用 Qdrant 向量数据库和 Voyage AI 嵌入模型，支持对 PDF、Excel、SQL/PL-SQL、Markdown 和 HTML 文件进行语义搜索。专为 Claude Code CLI 及其他 MCP 兼容工具集成设计。

Rust で書かれたローカルドキュメントインデクサー MCP（Model Context Protocol）サーバー。Qdrant ベクトルデータベースと Voyage AI エンベディングを使用して、PDF、Excel、SQL/PL-SQL、Markdown、HTML ファイルのセマンティック検索を実現。Claude Code CLI や他の MCP 互換ツールとの統合を想定して設計。

Features

PDF Parsing: Uses pdftotext (poppler) for text extraction with full Unicode support
Excel Parsing: Native Rust parsing via calamine (.xlsx, .xls, .xlsm, .ods)
SQL/PL-SQL Parsing: Extracts procedures, functions, packages, and triggers
Markdown Parsing: Section-aware chunking for documentation
HTML Parsing: Extracts UI text from web application snapshots
Vector Search: Qdrant vector database for semantic similarity search
Embeddings: Voyage AI or OpenAI-compatible embeddings API
MCP Protocol: Full MCP server implementation using rmcp 0.13
Fully Configurable: All settings via environment variables

Prerequisites

Rust 2024 Edition (rustc 1.85+)
Qdrant vector database
pdftotext (from poppler-utils) for PDF parsing
Voyage AI API Key (or OpenAI-compatible endpoint)

Installing Dependencies

# macOS
brew install poppler

# Download Qdrant (macOS ARM64)
curl -LO https://github.com/qdrant/qdrant/releases/download/v1.14.0/qdrant-aarch64-apple-darwin.tar.gz
tar xzf qdrant-aarch64-apple-darwin.tar.gz

Configuration

All settings are configurable via environment variables. Copy .env.example to .env:

# Embedding API Configuration
VOYAGE_API_KEY=your-voyage-api-key
EMBEDDING_MODEL=voyage-3-large

# Vector Database Configuration
QDRANT_URL=http://localhost:6334
QDRANT_COLLECTION=doc_index

# Document Paths Configuration
DOCS_PATH=/path/to/your/documents
INDEX_SUBDIRS=docs

# Chunk Settings
PDF_CHUNK_SIZE=1000
PDF_CHUNK_OVERLAP=200
EXCEL_ROWS_PER_CHUNK=50
SQL_MAX_CHUNK_SIZE=4000

# Search Settings
SEARCH_TOP_K=10

# Logging
RUST_LOG=info

Chunk Size Recommendations

Document Type	Language	Recommended Size
PDF	Japanese	600-800 chars
PDF	English	1000-1500 chars
Test Specifications	Any	1200-1500 chars
SQL Code	Any	4000 chars

Building

# Development build
cargo build

# Release build (optimized)
cargo build --release

Testing

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run specific test module
cargo test parsers::pdf::tests
cargo test parsers::excel::tests
cargo test parsers::sql::tests

Running

Start Qdrant:

./qdrant

Run the MCP server:

cargo run --release

The server communicates via stdio following the MCP protocol.

MCP Tools

Tool	Description
`index_document`	Index a single document file
`index_directory`	Recursively index all supported files in configured subdirectories
`search_documents`	Semantic search across indexed documents
`delete_document`	Remove a document from the index
`get_stats`	Get index statistics

Supported File Types

Extension	Parser	Notes
`.pdf`	pdftotext	Full Unicode support
`.xlsx`, `.xls`, `.xlsm`, `.ods`	calamine	All sheets parsed
`.sql`, `.pls`, `.pks`, `.pkb`	SQL Parser	PL/SQL object extraction
`.md`, `.markdown`	Markdown Parser	Section-aware chunking
`.html`, `.htm`	HTML Parser	UI text extraction

Integration with Claude Code CLI

Step 1: Build the server

cd /path/to/doc-indexer-mcp
cargo build --release

Step 2: Configure Claude Code CLI

Add the MCP server to your Claude Code configuration file ~/.claude.json:

{
  "mcpServers": {
    "doc-indexer": {
      "command": "/path/to/doc-indexer-mcp/target/release/doc-indexer-mcp",
      "env": {
        "VOYAGE_API_KEY": "your-voyage-api-key",
        "EMBEDDING_MODEL": "voyage-3-large",
        "QDRANT_URL": "http://localhost:6334",
        "QDRANT_COLLECTION": "doc_index",
        "DOCS_PATH": "/path/to/your/documents",
        "INDEX_SUBDIRS": "docs",
        "PDF_CHUNK_SIZE": "1000",
        "PDF_CHUNK_OVERLAP": "200",
        "RUST_LOG": "info"
      }
    }
  }
}

Step 3: Test with Claude Code

Use the /mcp command in Claude Code to test your MCP server:

claude
> /mcp

This will show all available MCP tools. You can then test individual tools:

> Search for "user authentication" in the indexed documents
> Index all documents in the docs folder

Step 4: Project-specific settings (optional)

Create a settings.json in your project root for project-specific permissions:

{
  "permissions": {
    "allow": [
      "mcp__doc-indexer__index_document",
      "mcp__doc-indexer__index_directory",
      "mcp__doc-indexer__search_documents",
      "mcp__doc-indexer__get_stats",
      "mcp__doc-indexer__delete_document"
    ]
  }
}

Directory Structure for DOCS_PATH

Organize your documents in the configured subdirectories:

/your/docs/path/
├── docs/                    # Design documents, specifications
│   ├── design_spec.pdf
│   ├── test_spec.pdf
│   └── schema.md
└── sql/                     # SQL and PL/SQL files
    ├── procedures.sql
    └── packages.pkb

Architecture

src/
├── main.rs              # Entry point
├── config.rs            # Configuration from environment
├── embedding/
│   └── client.rs        # Embeddings API client (Voyage AI)
├── mcp/
│   ├── server.rs        # MCP server setup
│   └── tools.rs         # Tool implementations
├── parsers/
│   ├── mod.rs           # Parser trait and common types
│   ├── pdf.rs           # PDF parser (pdftotext)
│   ├── excel.rs         # Excel parser (calamine)
│   ├── sql.rs           # SQL/PL-SQL parser
│   ├── markdown.rs      # Markdown parser
│   └── html.rs          # HTML parser
└── vector_store/
    └── qdrant.rs        # Qdrant vector database client

Customizing Chunking Logic

Each parser in src/parsers/ implements intelligent chunking for its document type. You can customize the chunking behavior by modifying the section markers and patterns.

PDF Parser (`src/parsers/pdf.rs`)

The PDF parser uses section markers to split documents into logical chunks:

// Major section markers - customize for your document format
const MAJOR_SECTION_MARKERS: &[&str] = &[
    "【Initial Display】", "【On Display】", "【On Save】",
    // Add your own section markers here
];

// Sub-section headers
const SUB_SECTION_HEADERS: &[&str] = &[
    "Action Definition", "Screen Definition", "Error Check",
    // Add your own sub-section patterns
];

Key functions to customize:

classify_line() - Determines line type (section header, content, etc.)
should_start_new_block() - Decides chunk boundaries
split_into_blocks() - Main chunking logic

Excel Parser (`src/parsers/excel.rs`)

The Excel parser handles structured documents with tables and nested sections:

// Bracketed section markers
const MAJOR_SECTION_MARKERS: &[&str] = &[
    "【Initial Display】", "【Data Items】", "【Conditions】",
    // Add markers matching your Excel templates
];

// Row type classification
enum RowType {
    BracketedSection,    // 【Section】
    MajorSection,        // 1. Section
    SubSection,          // 1.1. Sub Section
    TableHeader,         // No | Item Name | ...
    // Add custom row types
}

Key functions to customize:

classify_row() - Classifies Excel rows by type
should_start_new_block() - Determines chunk boundaries
rows_to_markdown() - Converts rows to searchable text

HTML Parser (`src/parsers/html.rs`)

The HTML parser extracts UI text from web application snapshots:

// CSS class patterns to extract text from
let patterns = [
    ("title", "ui-dialog-title"),
    ("button", "a-Button-label"),
    ("column", "a-GV-headerLabel"),
    // Add patterns matching your UI framework
];

Key functions to customize:

detect_component_type() - Identifies UI component types
extract_texts() - Extracts text by CSS class patterns

SQL Parser (`src/parsers/sql.rs`)

The SQL parser extracts PL/SQL objects (procedures, functions, packages):

Key functions to customize:

Object detection patterns for your database schema
Package/procedure boundary detection

Adding a New Parser

Create a new file in src/parsers/ (e.g., xml.rs)
Implement the DocumentParser trait:

#[async_trait::async_trait]
impl DocumentParser for XmlParser {
    async fn parse(&self, file_path: &str) -> Result<Vec<DocumentChunk>> {
        // Your parsing logic here
    }

    fn supported_extensions(&self) -> Vec<&'static str> {
        vec!["xml"]
    }
}

Register in src/parsers/mod.rs
Add to src/mcp/tools.rs in get_parser()

Troubleshooting

No logs visible in Claude Code CLI

The MCP server logs to stderr, which may not be visible in Claude Code CLI. To debug:

Set RUST_LOG=debug in your configuration

Run the server manually to see logs:

RUST_LOG=debug ./target/release/doc-indexer-mcp

Qdrant connection issues

Ensure Qdrant is running on the configured port (default: 6334):

./qdrant
# Check: curl http://localhost:6334/collections

PDF parsing errors

Ensure pdftotext is installed:

which pdftotext
# If not found: brew install poppler

Testing MCP connection

Use Claude Code's /mcp command to verify the server is connected:

claude
> /mcp

This will list all available MCP servers and their tools.

License

MIT License - see LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.env.example		.env.example
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

doc-indexer-mcp

Features

Prerequisites

Installing Dependencies

Configuration

Chunk Size Recommendations

Building

Testing

Running

MCP Tools

Supported File Types

Integration with Claude Code CLI

Step 1: Build the server

Step 2: Configure Claude Code CLI

Step 3: Test with Claude Code

Step 4: Project-specific settings (optional)

Directory Structure for DOCS_PATH

Architecture

Customizing Chunking Logic

PDF Parser (src/parsers/pdf.rs)

Excel Parser (src/parsers/excel.rs)

HTML Parser (src/parsers/html.rs)

SQL Parser (src/parsers/sql.rs)

Adding a New Parser

Troubleshooting

No logs visible in Claude Code CLI

Qdrant connection issues

PDF parsing errors

Testing MCP connection

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

PDF Parser (`src/parsers/pdf.rs`)

Excel Parser (`src/parsers/excel.rs`)

HTML Parser (`src/parsers/html.rs`)

SQL Parser (`src/parsers/sql.rs`)

Packages