Skip to content

eztakesin/doc-indexer-mcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

doc-indexer-mcp

Rust MCP License: MIT

A local document indexer MCP (Model Context Protocol) server written in Rust. Enables semantic search over PDF, Excel, SQL/PL-SQL, Markdown, and HTML files using Qdrant vector database and Voyage AI embeddings. Designed for integration with Claude Code CLI and other MCP-compatible tools.

基于 Rust 编写的本地文档索引 MCP(模型上下文协议)服务器。使用 Qdrant 向量数据库和 Voyage AI 嵌入模型,支持对 PDF、Excel、SQL/PL-SQL、Markdown 和 HTML 文件进行语义搜索。专为 Claude Code CLI 及其他 MCP 兼容工具集成设计。

Rust で書かれたローカルドキュメントインデクサー MCP(Model Context Protocol)サーバー。Qdrant ベクトルデータベースと Voyage AI エンベディングを使用して、PDF、Excel、SQL/PL-SQL、Markdown、HTML ファイルのセマンティック検索を実現。Claude Code CLI や他の MCP 互換ツールとの統合を想定して設計。

Features

  • PDF Parsing: Uses pdftotext (poppler) for text extraction with full Unicode support
  • Excel Parsing: Native Rust parsing via calamine (.xlsx, .xls, .xlsm, .ods)
  • SQL/PL-SQL Parsing: Extracts procedures, functions, packages, and triggers
  • Markdown Parsing: Section-aware chunking for documentation
  • HTML Parsing: Extracts UI text from web application snapshots
  • Vector Search: Qdrant vector database for semantic similarity search
  • Embeddings: Voyage AI or OpenAI-compatible embeddings API
  • MCP Protocol: Full MCP server implementation using rmcp 0.13
  • Fully Configurable: All settings via environment variables

Prerequisites

  1. Rust 2024 Edition (rustc 1.85+)
  2. Qdrant vector database
  3. pdftotext (from poppler-utils) for PDF parsing
  4. Voyage AI API Key (or OpenAI-compatible endpoint)

Installing Dependencies

# macOS
brew install poppler

# Download Qdrant (macOS ARM64)
curl -LO https://github.com/qdrant/qdrant/releases/download/v1.14.0/qdrant-aarch64-apple-darwin.tar.gz
tar xzf qdrant-aarch64-apple-darwin.tar.gz

Configuration

All settings are configurable via environment variables. Copy .env.example to .env:

# Embedding API Configuration
VOYAGE_API_KEY=your-voyage-api-key
EMBEDDING_MODEL=voyage-3-large

# Vector Database Configuration
QDRANT_URL=http://localhost:6334
QDRANT_COLLECTION=doc_index

# Document Paths Configuration
DOCS_PATH=/path/to/your/documents
INDEX_SUBDIRS=docs

# Chunk Settings
PDF_CHUNK_SIZE=1000
PDF_CHUNK_OVERLAP=200
EXCEL_ROWS_PER_CHUNK=50
SQL_MAX_CHUNK_SIZE=4000

# Search Settings
SEARCH_TOP_K=10

# Logging
RUST_LOG=info

Chunk Size Recommendations

Document Type Language Recommended Size
PDF Japanese 600-800 chars
PDF English 1000-1500 chars
Test Specifications Any 1200-1500 chars
SQL Code Any 4000 chars

Building

# Development build
cargo build

# Release build (optimized)
cargo build --release

Testing

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run specific test module
cargo test parsers::pdf::tests
cargo test parsers::excel::tests
cargo test parsers::sql::tests

Running

  1. Start Qdrant:
./qdrant
  1. Run the MCP server:
cargo run --release

The server communicates via stdio following the MCP protocol.

MCP Tools

Tool Description
index_document Index a single document file
index_directory Recursively index all supported files in configured subdirectories
search_documents Semantic search across indexed documents
delete_document Remove a document from the index
get_stats Get index statistics

Supported File Types

Extension Parser Notes
.pdf pdftotext Full Unicode support
.xlsx, .xls, .xlsm, .ods calamine All sheets parsed
.sql, .pls, .pks, .pkb SQL Parser PL/SQL object extraction
.md, .markdown Markdown Parser Section-aware chunking
.html, .htm HTML Parser UI text extraction

Integration with Claude Code CLI

Step 1: Build the server

cd /path/to/doc-indexer-mcp
cargo build --release

Step 2: Configure Claude Code CLI

Add the MCP server to your Claude Code configuration file ~/.claude.json:

{
  "mcpServers": {
    "doc-indexer": {
      "command": "/path/to/doc-indexer-mcp/target/release/doc-indexer-mcp",
      "env": {
        "VOYAGE_API_KEY": "your-voyage-api-key",
        "EMBEDDING_MODEL": "voyage-3-large",
        "QDRANT_URL": "http://localhost:6334",
        "QDRANT_COLLECTION": "doc_index",
        "DOCS_PATH": "/path/to/your/documents",
        "INDEX_SUBDIRS": "docs",
        "PDF_CHUNK_SIZE": "1000",
        "PDF_CHUNK_OVERLAP": "200",
        "RUST_LOG": "info"
      }
    }
  }
}

Step 3: Test with Claude Code

Use the /mcp command in Claude Code to test your MCP server:

claude
> /mcp

This will show all available MCP tools. You can then test individual tools:

> Search for "user authentication" in the indexed documents
> Index all documents in the docs folder

Step 4: Project-specific settings (optional)

Create a settings.json in your project root for project-specific permissions:

{
  "permissions": {
    "allow": [
      "mcp__doc-indexer__index_document",
      "mcp__doc-indexer__index_directory",
      "mcp__doc-indexer__search_documents",
      "mcp__doc-indexer__get_stats",
      "mcp__doc-indexer__delete_document"
    ]
  }
}

Directory Structure for DOCS_PATH

Organize your documents in the configured subdirectories:

/your/docs/path/
├── docs/                    # Design documents, specifications
│   ├── design_spec.pdf
│   ├── test_spec.pdf
│   └── schema.md
└── sql/                     # SQL and PL/SQL files
    ├── procedures.sql
    └── packages.pkb

Architecture

src/
├── main.rs              # Entry point
├── config.rs            # Configuration from environment
├── embedding/
│   └── client.rs        # Embeddings API client (Voyage AI)
├── mcp/
│   ├── server.rs        # MCP server setup
│   └── tools.rs         # Tool implementations
├── parsers/
│   ├── mod.rs           # Parser trait and common types
│   ├── pdf.rs           # PDF parser (pdftotext)
│   ├── excel.rs         # Excel parser (calamine)
│   ├── sql.rs           # SQL/PL-SQL parser
│   ├── markdown.rs      # Markdown parser
│   └── html.rs          # HTML parser
└── vector_store/
    └── qdrant.rs        # Qdrant vector database client

Customizing Chunking Logic

Each parser in src/parsers/ implements intelligent chunking for its document type. You can customize the chunking behavior by modifying the section markers and patterns.

PDF Parser (src/parsers/pdf.rs)

The PDF parser uses section markers to split documents into logical chunks:

// Major section markers - customize for your document format
const MAJOR_SECTION_MARKERS: &[&str] = &[
    "【Initial Display】", "【On Display】", "【On Save】",
    // Add your own section markers here
];

// Sub-section headers
const SUB_SECTION_HEADERS: &[&str] = &[
    "Action Definition", "Screen Definition", "Error Check",
    // Add your own sub-section patterns
];

Key functions to customize:

  • classify_line() - Determines line type (section header, content, etc.)
  • should_start_new_block() - Decides chunk boundaries
  • split_into_blocks() - Main chunking logic

Excel Parser (src/parsers/excel.rs)

The Excel parser handles structured documents with tables and nested sections:

// Bracketed section markers
const MAJOR_SECTION_MARKERS: &[&str] = &[
    "【Initial Display】", "【Data Items】", "【Conditions】",
    // Add markers matching your Excel templates
];

// Row type classification
enum RowType {
    BracketedSection,    // 【Section】
    MajorSection,        // 1. Section
    SubSection,          // 1.1. Sub Section
    TableHeader,         // No | Item Name | ...
    // Add custom row types
}

Key functions to customize:

  • classify_row() - Classifies Excel rows by type
  • should_start_new_block() - Determines chunk boundaries
  • rows_to_markdown() - Converts rows to searchable text

HTML Parser (src/parsers/html.rs)

The HTML parser extracts UI text from web application snapshots:

// CSS class patterns to extract text from
let patterns = [
    ("title", "ui-dialog-title"),
    ("button", "a-Button-label"),
    ("column", "a-GV-headerLabel"),
    // Add patterns matching your UI framework
];

Key functions to customize:

  • detect_component_type() - Identifies UI component types
  • extract_texts() - Extracts text by CSS class patterns

SQL Parser (src/parsers/sql.rs)

The SQL parser extracts PL/SQL objects (procedures, functions, packages):

Key functions to customize:

  • Object detection patterns for your database schema
  • Package/procedure boundary detection

Adding a New Parser

  1. Create a new file in src/parsers/ (e.g., xml.rs)
  2. Implement the DocumentParser trait:
#[async_trait::async_trait]
impl DocumentParser for XmlParser {
    async fn parse(&self, file_path: &str) -> Result<Vec<DocumentChunk>> {
        // Your parsing logic here
    }

    fn supported_extensions(&self) -> Vec<&'static str> {
        vec!["xml"]
    }
}
  1. Register in src/parsers/mod.rs
  2. Add to src/mcp/tools.rs in get_parser()

Troubleshooting

No logs visible in Claude Code CLI

The MCP server logs to stderr, which may not be visible in Claude Code CLI. To debug:

  1. Set RUST_LOG=debug in your configuration
  2. Run the server manually to see logs:
    RUST_LOG=debug ./target/release/doc-indexer-mcp

Qdrant connection issues

Ensure Qdrant is running on the configured port (default: 6334):

./qdrant
# Check: curl http://localhost:6334/collections

PDF parsing errors

Ensure pdftotext is installed:

which pdftotext
# If not found: brew install poppler

Testing MCP connection

Use Claude Code's /mcp command to verify the server is connected:

claude
> /mcp

This will list all available MCP servers and their tools.

License

MIT License - see LICENSE file.

About

Local document indexer MCP server for semantic search over PDF, Excel, SQL, Markdown, and HTML files using Qdrant and Voyage AI embeddings.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages