Data Indexing & RAG

Motivation

OPAA's value comes from having access to organizational knowledge. This feature describes how documents from diverse sources (wikis, email, file systems) are discovered, processed, and made searchable through semantic embeddings. In addition to organizational data sources discovered by connectors, OPAA also supports direct document uploads by individual users, enabling personal knowledge to enter the RAG pipeline.

The Retrieval-Augmented Generation (RAG) pipeline ensures answers are grounded in actual organizational documents, with full attribution and traceability.

Overview

The Data Indexing & RAG system consists of three phases:

Source Discovery, Upload & Ingestion — Finding documents in various sources or receiving them via user uploads
Document Processing — Extracting, chunking, and embedding documents
Retrieval & Ranking — Finding relevant documents for user questions

Documents enter the pipeline through two paths:

Connector-Based Ingestion: OPAA pulls documents from configured data sources (Confluence, email, file systems) on schedule or via events.
User Upload Ingestion: Users push documents directly into OPAA through frontends (Web UI, Chat, REST API).

Supported Data Sources

Source Categories

OPAA connects to multiple source types:

1. Knowledge Management Systems

Confluence — Wiki pages, spaces, attachments
Notion — Pages, databases, wikis
MediaWiki — Wikipedia-style wikis
Custom Wikis — Via REST API

2. Email Archives

Email Servers — IMAP/SMTP (Gmail, Office 365, on-premises Exchange)
Email Exports — MBOX, PST files
Email Services — Gmail API, Microsoft Graph API

3. File Systems & Cloud Storage

Local File Systems — On-premises servers
HTTP Directory Listings — Apache mod_autoindex / nginx autoindex servers (see below)
Cloud Storage — S3, Azure Blob, Google Cloud Storage, Google Drive, Dropbox
Network Drives — SMB/CIFS shares
Git Repositories — Documentation in GitHub/GitLab

4. Issue Trackers & Project Management

Jira — Issues, comments, attachments
GitHub Issues / GitLab Issues — Issues, discussions, pull requests
Custom Issue Trackers — Via REST API

5. Document Formats

Automatically detected and processed:

Markdown (.md)
AsciiDoc (.adoc)
PDF (.pdf) — text extracted via OCR if needed
Microsoft Office (.docx, .xlsx, .pptx)
Plain Text (.txt)
HTML (.html)
Structured Data (.json, .csv, .xml)

6. APIs & Custom Sources

REST APIs — Any system with documented API
Webhooks — Push updates to OPAA
Custom Connectors — Extensible plugin system

HTTP Directory Listings

OPAA can crawl and index documents from HTTP servers that expose Apache mod_autoindex (or compatible) directory listings. This is useful for accessing document repositories hosted on internal web servers without requiring specialized connectors.

How it works:

OPAA crawls the HTML directory listing at the given URL recursively
Discovers all files across subdirectories
Downloads each file to a temporary location for processing
Uses the lastModified timestamp from the directory listing to skip downloads of unchanged files (bandwidth optimization)
After download, computes a SHA-256 checksum on the file for content-based deduplication (detects renames, ensures content integrity)
Processes each file through the standard pipeline (extraction, chunking, embedding)
Cleans up temporary files after processing

Supported features:

Basic authentication (username:password)
HTTP proxy support (host:port)
Insecure SSL mode (skip certificate verification for self-signed certificates)
Recursive directory traversal
Robust HTML parser that handles various Apache/nginx autoindex output formats

Triggering URL-based indexing:

Via Admin UI: Open the Admin drawer, expand "URL Source (optional)", enter the URL and optional proxy/credentials, then click "Index Documents".

Via API:

curl -X POST http://localhost:8080/api/v1/indexing/trigger \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://files.example.com/documents/",
    "proxy": "proxy.example.com:8080",
    "credentials": "user:password",
    "insecureSsl": false
  }'

When no URL is provided, the standard filesystem-based indexing is triggered instead.

Connector Model and Workspace Mapping

A connector defines the type and shared configuration (credentials, schedule). Each connector has one or more sources, each of which can be mapped to one or more OPAA workspaces. Connectors and sources are configured independently of workspaces — Workspace-Admins then choose which available sources to include in their workspace. Only System-Admins can create connectors and define source mappings.

Some connector types have a natural instance level with sub-units (e.g., Confluence server with spaces). Others have no shared instance — each source is standalone (e.g., individual file paths or URLs).

Example 1: Confluence (instance with sub-units)
  Connector: "Confluence Production"
    Type: confluence
    URL: https://wiki.company.com
    Credentials: service-account / API-token
    Schedule: Daily 2 AM
    Sources:
      Space "ENG"  → Workspaces: ["Engineering"]
      Space "MKT"  → Workspaces: ["Marketing"]
      Space "HR"   → Workspaces: ["HR", "Onboarding"]
      Space "ALL"  → Workspaces: ["Company"]

Example 2: File System / Network Drive (one path per source)
  Connector: "Network Drive Engineering"
    Type: filesystem
    Schedule: Daily 3 AM
    Sources:
      Path "//fileserver/engineering/docs" → Workspaces: ["Engineering"]

Example 3: HTTP Directory (one URL per source)
  Connector: "Docs Server Engineering"
    Type: http
    Schedule: Daily 4 AM
    Sources:
      URL "https://docs.internal/engineering/" → Workspaces: ["Engineering", "Phoenix"]

Connector Types and Their Sources

Connector Type	Shared Config (Connector)	Source (one or more per connector)
Confluence	Server URL, Credentials	Space key
Jira	Server URL, Credentials	Project key
Email (IMAP)	Server URL, Credentials	Folder / Label
File System / Network Drive	optionally Schedule	Path (local or UNC)
HTTP Directory	optionally Proxy, Auth	URL
Git	optionally Credentials	Repository URL + Branch

Mapping Rules

1:N — Each source can be mapped to one or more OPAA workspaces. Documents from that source are indexed into all mapped workspaces (their chunks receive all corresponding workspace_ids).
Unmapped sub-units are ignored (e.g., Confluence spaces without a mapping are not indexed)
Multiple connectors can index into the same workspace (e.g., Confluence space "ENG" + network drive path both → "Engineering")

Source Filtering

Each source can optionally define include/exclude patterns:

Source: Confluence Space "ENG" → Workspace "Engineering"
Filtering:
  - Include patterns: ["public/*", "team/*"]
  - Exclude patterns: ["draft/*", "archive/*"]
Incremental: Only new/changed documents

User Document Upload

Concept

In addition to connector-based ingestion, users can upload documents directly into OPAA through any frontend (Web UI, Chat, REST API). Uploaded documents are stored on a configurable storage backend and processed through the same document processing pipeline as connector-sourced documents.

How It Differs from Connectors

Aspect	Connectors	User Upload
Direction	OPAA pulls from sources	User pushes to OPAA
Trigger	Scheduled or event-based	On-demand (user action)
Scope	Organizational data sources	Individual user documents
Workspace	Configured per connector	User's personal workspace (default)
Storage	Original stays in source system	Stored on OPAA's storage backend

Upload Flow

User selects file(s) through frontend (Web UI drag-and-drop, chat attachment, or API multipart upload)
File is validated (format, size limits, virus scan)
File is stored on the configured storage backend (S3, network drive, local FS)
Document enters the standard processing pipeline (extraction, chunking, embedding, vector storage)
Document is indexed into the user's personal workspace by default
User can optionally share/publish into other workspaces they have access to

Storage Backend Abstraction

Uploaded files are stored on a pluggable storage backend, chosen at deployment time. This is separate from the vector database — the storage backend holds the original uploaded files (PDF, DOCX, etc.) for download and re-processing, while the vector database holds the embeddings and chunk text for search.

Option 1: S3-Compatible Object Storage

AWS S3, MinIO, or any S3-compatible store
Best for cloud and hybrid deployments
Built-in redundancy and lifecycle management

Option 2: Network Drive (SMB/NFS)

Shared file system mount
Best for on-premises deployments with existing file servers
Familiar to operations teams

Option 3: Local Filesystem

Direct disk storage on OPAA server
Simplest option for small deployments and development
Requires separate backup strategy

Storage Backend Configuration:

storage:
  backend: "s3"  # or "network-drive" or "local"
  s3:
    endpoint: "https://s3.company.com"
    bucket: "opaa-uploads"
    region: "eu-central-1"
  network-drive:
    path: "//fileserver/opaa-uploads"
  local:
    path: "/data/opaa/uploads"
  limits:
    max_file_size: "50MB"
    allowed_formats: ["pdf", "docx", "md", "txt", "pptx", "xlsx"]

Supported Upload Formats

Same document formats as connector-sourced documents (see Document Formats section above), with these additions for the upload context:

Maximum file size configurable (default: 50 MB)
Batch upload support (multiple files at once)
Drag-and-drop in Web UI
File attachment in chat platforms

Upload Metadata

Each uploaded document stores:

{
  "document_id": "upload-456",
  "filename": "design-review-q1.pdf",
  "uploaded_by": "user-123",
  "uploaded_at": "2026-02-16T10:30:00Z",
  "workspace_id": "personal-user-123",
  "storage_backend": "s3",
  "storage_path": "s3://opaa-uploads/user-123/design-review-q1.pdf",
  "file_size_bytes": 2048576,
  "content_type": "application/pdf",
  "source_type": "user_upload"
}

Document Processing Pipeline

Step 1: Discovery & Extraction

For each source, OPAA:

Connects to source system
Lists all available documents
Checks modification timestamp against last index
Downloads new/modified documents
Extracts text content (handles binary formats like PDF)

For user uploads: The discovery step is replaced by the upload event itself. The uploaded file is retrieved from the storage backend and enters the pipeline at the extraction phase. All subsequent steps (chunking, embedding, storage) are identical to connector-sourced documents.

Error Handling:

Skips documents that can't be extracted
Logs failures for admin review
Retries failed documents on next run

Step 2: Chunking

Large documents are broken into smaller chunks:

Strategy: Semantic chunking (split on natural boundaries)
Chunk Size: 512-1024 tokens (configurable)
Overlap: 10% overlap between chunks to preserve context
Metadata: Each chunk preserves:
- Source document ID
- Document title
- Chunk position
- Timestamp

Example:

Document: "Enterprise Architecture Guide" (15,000 words)
↓
Chunks:
  1. "Introduction & Principles" (chunk 0)
  2. "Infrastructure Layer" (chunk 1)
  3. "Application Architecture" (chunk 2)
  ...
  15. "Appendix & References" (chunk 14)

Step 3: Embedding Generation

Each chunk is converted to a semantic embedding:

Model Choice: Configurable (OpenAI, open-source alternatives)
Dimension: 1536 for OpenAI, configurable for others
Caching: Embeddings cached to avoid re-computing
Batching: Processed in batches for efficiency
Error Recovery: Failed embeddings logged for retry

Cost Consideration: Embedding generation has minimal cost compared to LLM inference. Organizations can use cheaper embedding models.

Step 4: Storage in Vector Database

Processed chunks stored with:

Embedding vector
Chunk text
Metadata (source, document ID, timestamp, chunk index)
Document URL (for retrieval)
Workspace IDs (for multi-tenancy; will also support cross-workspace sharing in the future)

Metadata Stored:

{
  "chunk_id": "doc-123-chunk-5",
  "document_id": "doc-123",
  "document_title": "Enterprise Architecture Guide",
  "workspace_ids": ["workspace-eng"],
  "source": "confluence",
  "source_type": "connector",
  "source_url": "https://wiki.company.com/pages/view/123456",
  "chunk_index": 5,
  "chunk_text": "...",
  "embedding": [0.123, -0.456, ...],
  "indexed_at": "2024-02-16T14:30:00Z"
}

Note: workspace_ids is an array. A document can appear in multiple workspaces (e.g., when a source is mapped to multiple workspaces, or in the future via cross-workspace sharing). Permission enforcement uses this field as a metadata filter in the vector search (see Access Control — Query-Time Permission Enforcement).

Step 5: Index Updates

Incremental processing:

Only new/modified documents processed
Changed chunks updated in vector store
Deleted documents removed from index
Full re-index available (force option)

Supported Vector Databases

OPAA supports multiple vector database backends. Organizations choose based on:

Infrastructure constraints (on-premises vs. cloud)
Scale requirements
Cost considerations
Integration with existing systems

Option 1: Elasticsearch with Vector Search

Self-hosted or managed
Hybrid search (vector + keyword)
Advanced filtering and aggregation
Familiar to many ops teams

Option 2: PostgreSQL + pgvector

Lightweight, runs in existing database
No additional infrastructure
Good for small to mid-size deployments
SQL-native integration

Option 3: Milvus

Open-source vector database
Designed for large-scale similarity search
Self-hosted, horizontally scalable
Optimized for high throughput

Option 4: Cloud Vector Databases

Pinecone, Weaviate, Qdrant (managed)
Easy managed option
Scalability built-in
Can be combined with on-premises fallback

Implementation Detail

Vector database choice made at deployment time, not application design. No vendor lock-in. Switching databases requires re-indexing but no code changes.

OPAA uses Spring AI's VectorStore abstraction for all indexing and retrieval operations. Embedding generation, storage, and similarity search are delegated to the VectorStore interface, making the vector database backend interchangeable via configuration.

Retrieval & Ranking

Retrieval Process

When a user asks a question:

Workspace-IDs: Load all workspace IDs the user is a member of
Embedding Generation: Question converted to embedding (same model as documents)
Vector Search with Workspace Filter: Find top-K similar chunks, filtering by workspace_ids — only chunks whose workspace_ids include at least one of the user's workspace IDs are searched. The permission filter is part of the vector search itself, not a post-processing step.
Deduplication: Remove duplicate information from same document
Source Deduplication: When multiple chunks originate from the same file, only the chunk with the highest relevance score is kept as source reference (implemented in QueryService.mapSources())
Re-ranking: Score results by relevance

Retrieval Configuration

Retrieval:
  similarity_threshold: 0.6
  top_k: 20
  apply_permissions: true
  chunk_recency_boost: true
  source_diversity: true

Re-ranking Strategy

After initial retrieval, results scored by:

Semantic Similarity: How close the embedding is to question
Document Recency: Newer documents ranked higher (optional)
Source Trust Score: Frequently updated sources ranked higher (optional)
Keyword Overlap: Exact phrase matches in document (optional)

Score Combination:

final_score = (
  0.6 * semantic_similarity +
  0.2 * recency_boost +
  0.1 * source_trust +
  0.1 * keyword_overlap
)

Confidence Scoring

System provides confidence for each retrieved document:

High (> 0.85): Definitely relevant to question
Medium (0.6 - 0.85): Probably relevant
Low (< 0.6): Questionable relevance, marked as uncertain

Users see scores and can filter by confidence.

Advanced Features

Multi-Language Support

Documents in different languages indexed and searched:

Each document tagged with language
Embedding model must support language
Queries in any language matched to documents
Results returned in original language

Document Metadata Extraction

From each document, system automatically extracts:

Title
Author (if available)
Creation/modification date
Document type (report, meeting notes, policy, etc.)
Key topics/tags (via NLP)

This metadata enables:

Better search filtering
Trustworthiness signals
Related document discovery

Semantic Caching

Frequently asked questions cached:

Same question asked within N hours returns cached answer
Cache aware of document updates (invalidates on source change)
Reduces embedding & LLM calls
User can force fresh answer

Document Expiry & Archival

Documents can be marked:

Active: Included in searches
Archived: Searchable but flagged as older
Expired: Removed from searches (but kept for audit)
Sensitive: Restricted by permissions

Indexing Status & Monitoring

Admin Visibility

Admins can see:

Which sources are active, when last indexed
Total documents in each source
Failed documents and error logs
Indexing queue status
Resource usage (CPU, memory, disk)

Indexing Alerts

System alerts admins on:

Source connection failures (3 failed attempts)
Large number of processing errors (> 10% of documents)
Indexing taking longer than expected (> 2 hours)
Vector database storage nearly full

Indexing Triggers

Indexing can start:

On schedule (daily, hourly, etc.)
On demand (manual admin trigger)
Via webhook (source system pings OPAA)
On document change (streaming if supported)
On user upload (immediate processing when user uploads a file)

Permissions & Multi-Tenancy

Workspace-Based Permissions

Every indexed document belongs to exactly one home workspace (determined by the connector's source mapping or the upload target). Permissions are enforced at the workspace level:

Users can only find documents in workspaces they are members of
The workspace filter is integrated into the vector search (not a post-filter)
Search results never leak across workspaces

Cross-Workspace Sharing (Future Feature)

Cross-workspace document sharing is planned as a future feature. When implemented, shared documents' chunks would gain additional workspace_ids entries, making them searchable in multiple workspaces without duplication. See Document Sharing for the current concept and open questions.

User-Uploaded Document Permissions

Documents uploaded by users follow a specific permission model:

Default: Private to the uploading user (in their personal workspace)
Direct upload to team workspace: Users with Editor role can upload directly to a team workspace — the document's home workspace is then the team workspace (see Access Control — Upload to Team Workspace)
Owner: The uploading user is always the document owner
Upload quotas: Configurable per user with a global default
Cross-workspace sharing: Planned as a future feature — see Document Sharing

Connector Document Permissions

Connector-indexed documents inherit their workspace(s) from the source mapping:

Each source sub-unit (e.g., Confluence space) can be mapped to one or more workspaces
All documents from that source are indexed into all mapped workspaces
Workspace Admins can exclude individual documents from the index (see Access Control — Exclude Mechanism)

Duplicate Detection

When a user uploads a document, OPAA performs a similarity check against existing documents the user has access to. If similar documents are found, the user is notified before the upload completes — helping prevent duplicate indexing (e.g., two users uploading the same meeting notes).

Performance & Scalability

Indexing Performance

Small organization (100 documents): 5-10 minutes
Mid-size (10,000 documents): 30-60 minutes
Large (100,000+ documents): Parallel processing, as needed

Query Performance

Vector search (incl. workspace filter): < 500ms for typical queries
Re-ranking: + 50-100ms
Total retrieval time: < 1 second

Note: Permission filtering is integrated into the vector search via metadata filter on workspace_ids and does not add a separate processing step.

Scalability

System scales to:

Millions of documents (via horizontal scaling)
Thousands of concurrent users (via distributed vector DB)
Multiple data sources simultaneously
Large chunks or small chunks (configurable)

Integration Points

User Frontends: Provide retrieved documents and answers
LLM Integration: Feed retrieved documents to LLM
Access Control: Enforce workspace/document permissions
Deployment Infrastructure: Storage configuration, resource allocation

Resolved Questions

Storage quotas: Yes, for manual uploads. Upload limit is configurable per user with a global default.
Document versioning: Yes, ideally. Additionally, similar documents visible to the user are shown during upload to detect duplicates (see Duplicate Detection above).

Open Questions / Future Enhancements

Should we support real-time indexing (as documents change) vs. scheduled batch?
Should re-ranking use a learned model or simple scoring?
Should we support document clustering (for discovering related docs automatically)?
Should we offer semantic deduplication (remove redundant documents automatically)? (Note: basic source-reference deduplication by file name is already implemented — see Issue #42)
How to handle very large documents (100K+ pages)?
Should we support hybrid retrieval (vector + keyword search together)?
Should we support bulk import from a user's local drive?

Success Metrics

Indexing Completeness: % of source documents successfully indexed
Retrieval Latency: P95 search time < 500ms
Relevance: % of retrieved documents actually used in final answer
Coverage: Average # of relevant documents returned per query
Freshness: Median time between document change and re-indexing

FilesExpand file tree

data-indexing-rag.md

Latest commit

History