OPAA's value comes from having access to organizational knowledge. This feature describes how documents from diverse sources (wikis, email, file systems) are discovered, processed, and made searchable through semantic embeddings. In addition to organizational data sources discovered by connectors, OPAA also supports direct document uploads by individual users, enabling personal knowledge to enter the RAG pipeline.
The Retrieval-Augmented Generation (RAG) pipeline ensures answers are grounded in actual organizational documents, with full attribution and traceability.
The Data Indexing & RAG system consists of three phases:
- Source Discovery, Upload & Ingestion — Finding documents in various sources or receiving them via user uploads
- Document Processing — Extracting, chunking, and embedding documents
- Retrieval & Ranking — Finding relevant documents for user questions
Documents enter the pipeline through two paths:
- Connector-Based Ingestion: OPAA pulls documents from configured data sources (Confluence, email, file systems) on schedule or via events.
- User Upload Ingestion: Users push documents directly into OPAA through frontends (Web UI, Chat, REST API).
OPAA connects to multiple source types:
- Confluence — Wiki pages, spaces, attachments
- Notion — Pages, databases, wikis
- MediaWiki — Wikipedia-style wikis
- Custom Wikis — Via REST API
- Email Servers — IMAP/SMTP (Gmail, Office 365, on-premises Exchange)
- Email Exports — MBOX, PST files
- Email Services — Gmail API, Microsoft Graph API
- Local File Systems — On-premises servers
- HTTP Directory Listings — Apache mod_autoindex / nginx autoindex servers (see below)
- Cloud Storage — S3, Azure Blob, Google Cloud Storage, Google Drive, Dropbox
- Network Drives — SMB/CIFS shares
- Git Repositories — Documentation in GitHub/GitLab
- Jira — Issues, comments, attachments
- GitHub Issues / GitLab Issues — Issues, discussions, pull requests
- Custom Issue Trackers — Via REST API
Automatically detected and processed:
- Markdown (.md)
- AsciiDoc (.adoc)
- PDF (.pdf) — text extracted via OCR if needed
- Microsoft Office (.docx, .xlsx, .pptx)
- Plain Text (.txt)
- HTML (.html)
- Structured Data (.json, .csv, .xml)
- REST APIs — Any system with documented API
- Webhooks — Push updates to OPAA
- Custom Connectors — Extensible plugin system
OPAA can crawl and index documents from HTTP servers that expose Apache mod_autoindex (or compatible) directory listings. This is useful for accessing document repositories hosted on internal web servers without requiring specialized connectors.
How it works:
- OPAA crawls the HTML directory listing at the given URL recursively
- Discovers all files across subdirectories
- Downloads each file to a temporary location for processing
- Uses the
lastModifiedtimestamp from the directory listing to skip downloads of unchanged files (bandwidth optimization) - After download, computes a SHA-256 checksum on the file for content-based deduplication (detects renames, ensures content integrity)
- Processes each file through the standard pipeline (extraction, chunking, embedding)
- Cleans up temporary files after processing
Supported features:
- Basic authentication (username:password)
- HTTP proxy support (host:port)
- Insecure SSL mode (skip certificate verification for self-signed certificates)
- Recursive directory traversal
- Robust HTML parser that handles various Apache/nginx autoindex output formats
Triggering URL-based indexing:
Via Admin UI: Open the Admin drawer, expand "URL Source (optional)", enter the URL and optional proxy/credentials, then click "Index Documents".
Via API:
curl -X POST http://localhost:8080/api/v1/indexing/trigger \
-H "Content-Type: application/json" \
-d '{
"url": "https://files.example.com/documents/",
"proxy": "proxy.example.com:8080",
"credentials": "user:password",
"insecureSsl": false
}'When no URL is provided, the standard filesystem-based indexing is triggered instead.
A connector defines the type and shared configuration (credentials, schedule). Each connector has one or more sources, each of which can be mapped to one or more OPAA workspaces. Connectors and sources are configured independently of workspaces — Workspace-Admins then choose which available sources to include in their workspace. Only System-Admins can create connectors and define source mappings.
Some connector types have a natural instance level with sub-units (e.g., Confluence server with spaces). Others have no shared instance — each source is standalone (e.g., individual file paths or URLs).
Example 1: Confluence (instance with sub-units)
Connector: "Confluence Production"
Type: confluence
URL: https://wiki.company.com
Credentials: service-account / API-token
Schedule: Daily 2 AM
Sources:
Space "ENG" → Workspaces: ["Engineering"]
Space "MKT" → Workspaces: ["Marketing"]
Space "HR" → Workspaces: ["HR", "Onboarding"]
Space "ALL" → Workspaces: ["Company"]
Example 2: File System / Network Drive (one path per source)
Connector: "Network Drive Engineering"
Type: filesystem
Schedule: Daily 3 AM
Sources:
Path "//fileserver/engineering/docs" → Workspaces: ["Engineering"]
Example 3: HTTP Directory (one URL per source)
Connector: "Docs Server Engineering"
Type: http
Schedule: Daily 4 AM
Sources:
URL "https://docs.internal/engineering/" → Workspaces: ["Engineering", "Phoenix"]
| Connector Type | Shared Config (Connector) | Source (one or more per connector) |
|---|---|---|
| Confluence | Server URL, Credentials | Space key |
| Jira | Server URL, Credentials | Project key |
| Email (IMAP) | Server URL, Credentials | Folder / Label |
| File System / Network Drive | optionally Schedule | Path (local or UNC) |
| HTTP Directory | optionally Proxy, Auth | URL |
| Git | optionally Credentials | Repository URL + Branch |
- 1:N — Each source can be mapped to one or more OPAA workspaces. Documents from that source are indexed into all mapped workspaces (their chunks receive all corresponding
workspace_ids). - Unmapped sub-units are ignored (e.g., Confluence spaces without a mapping are not indexed)
- Multiple connectors can index into the same workspace (e.g., Confluence space "ENG" + network drive path both → "Engineering")
Each source can optionally define include/exclude patterns:
Source: Confluence Space "ENG" → Workspace "Engineering"
Filtering:
- Include patterns: ["public/*", "team/*"]
- Exclude patterns: ["draft/*", "archive/*"]
Incremental: Only new/changed documents
In addition to connector-based ingestion, users can upload documents directly into OPAA through any frontend (Web UI, Chat, REST API). Uploaded documents are stored on a configurable storage backend and processed through the same document processing pipeline as connector-sourced documents.
| Aspect | Connectors | User Upload |
|---|---|---|
| Direction | OPAA pulls from sources | User pushes to OPAA |
| Trigger | Scheduled or event-based | On-demand (user action) |
| Scope | Organizational data sources | Individual user documents |
| Workspace | Configured per connector | User's personal workspace (default) |
| Storage | Original stays in source system | Stored on OPAA's storage backend |
- User selects file(s) through frontend (Web UI drag-and-drop, chat attachment, or API multipart upload)
- File is validated (format, size limits, virus scan)
- File is stored on the configured storage backend (S3, network drive, local FS)
- Document enters the standard processing pipeline (extraction, chunking, embedding, vector storage)
- Document is indexed into the user's personal workspace by default
- User can optionally share/publish into other workspaces they have access to
Uploaded files are stored on a pluggable storage backend, chosen at deployment time. This is separate from the vector database — the storage backend holds the original uploaded files (PDF, DOCX, etc.) for download and re-processing, while the vector database holds the embeddings and chunk text for search.
- AWS S3, MinIO, or any S3-compatible store
- Best for cloud and hybrid deployments
- Built-in redundancy and lifecycle management
- Shared file system mount
- Best for on-premises deployments with existing file servers
- Familiar to operations teams
- Direct disk storage on OPAA server
- Simplest option for small deployments and development
- Requires separate backup strategy
Storage Backend Configuration:
storage:
backend: "s3" # or "network-drive" or "local"
s3:
endpoint: "https://s3.company.com"
bucket: "opaa-uploads"
region: "eu-central-1"
network-drive:
path: "//fileserver/opaa-uploads"
local:
path: "/data/opaa/uploads"
limits:
max_file_size: "50MB"
allowed_formats: ["pdf", "docx", "md", "txt", "pptx", "xlsx"]Same document formats as connector-sourced documents (see Document Formats section above), with these additions for the upload context:
- Maximum file size configurable (default: 50 MB)
- Batch upload support (multiple files at once)
- Drag-and-drop in Web UI
- File attachment in chat platforms
Each uploaded document stores:
{
"document_id": "upload-456",
"filename": "design-review-q1.pdf",
"uploaded_by": "user-123",
"uploaded_at": "2026-02-16T10:30:00Z",
"workspace_id": "personal-user-123",
"storage_backend": "s3",
"storage_path": "s3://opaa-uploads/user-123/design-review-q1.pdf",
"file_size_bytes": 2048576,
"content_type": "application/pdf",
"source_type": "user_upload"
}For each source, OPAA:
- Connects to source system
- Lists all available documents
- Checks modification timestamp against last index
- Downloads new/modified documents
- Extracts text content (handles binary formats like PDF)
For user uploads: The discovery step is replaced by the upload event itself. The uploaded file is retrieved from the storage backend and enters the pipeline at the extraction phase. All subsequent steps (chunking, embedding, storage) are identical to connector-sourced documents.
Error Handling:
- Skips documents that can't be extracted
- Logs failures for admin review
- Retries failed documents on next run
Large documents are broken into smaller chunks:
- Strategy: Semantic chunking (split on natural boundaries)
- Chunk Size: 512-1024 tokens (configurable)
- Overlap: 10% overlap between chunks to preserve context
- Metadata: Each chunk preserves:
- Source document ID
- Document title
- Chunk position
- Timestamp
Example:
Document: "Enterprise Architecture Guide" (15,000 words)
↓
Chunks:
1. "Introduction & Principles" (chunk 0)
2. "Infrastructure Layer" (chunk 1)
3. "Application Architecture" (chunk 2)
...
15. "Appendix & References" (chunk 14)
Each chunk is converted to a semantic embedding:
- Model Choice: Configurable (OpenAI, open-source alternatives)
- Dimension: 1536 for OpenAI, configurable for others
- Caching: Embeddings cached to avoid re-computing
- Batching: Processed in batches for efficiency
- Error Recovery: Failed embeddings logged for retry
Cost Consideration: Embedding generation has minimal cost compared to LLM inference. Organizations can use cheaper embedding models.
Processed chunks stored with:
- Embedding vector
- Chunk text
- Metadata (source, document ID, timestamp, chunk index)
- Document URL (for retrieval)
- Workspace IDs (for multi-tenancy; will also support cross-workspace sharing in the future)
Metadata Stored:
{
"chunk_id": "doc-123-chunk-5",
"document_id": "doc-123",
"document_title": "Enterprise Architecture Guide",
"workspace_ids": ["workspace-eng"],
"source": "confluence",
"source_type": "connector",
"source_url": "https://wiki.company.com/pages/view/123456",
"chunk_index": 5,
"chunk_text": "...",
"embedding": [0.123, -0.456, ...],
"indexed_at": "2024-02-16T14:30:00Z"
}Note: workspace_ids is an array. A document can appear in multiple workspaces (e.g., when a source is mapped to multiple workspaces, or in the future via cross-workspace sharing). Permission enforcement uses this field as a metadata filter in the vector search (see Access Control — Query-Time Permission Enforcement).
Incremental processing:
- Only new/modified documents processed
- Changed chunks updated in vector store
- Deleted documents removed from index
- Full re-index available (force option)
OPAA supports multiple vector database backends. Organizations choose based on:
- Infrastructure constraints (on-premises vs. cloud)
- Scale requirements
- Cost considerations
- Integration with existing systems
- Self-hosted or managed
- Hybrid search (vector + keyword)
- Advanced filtering and aggregation
- Familiar to many ops teams
- Lightweight, runs in existing database
- No additional infrastructure
- Good for small to mid-size deployments
- SQL-native integration
- Open-source vector database
- Designed for large-scale similarity search
- Self-hosted, horizontally scalable
- Optimized for high throughput
- Pinecone, Weaviate, Qdrant (managed)
- Easy managed option
- Scalability built-in
- Can be combined with on-premises fallback
Vector database choice made at deployment time, not application design. No vendor lock-in. Switching databases requires re-indexing but no code changes.
OPAA uses Spring AI's VectorStore abstraction for all indexing and retrieval operations. Embedding generation, storage, and similarity search are delegated to the VectorStore interface, making the vector database backend interchangeable via configuration.
When a user asks a question:
- Workspace-IDs: Load all workspace IDs the user is a member of
- Embedding Generation: Question converted to embedding (same model as documents)
- Vector Search with Workspace Filter: Find top-K similar chunks, filtering by
workspace_ids— only chunks whoseworkspace_idsinclude at least one of the user's workspace IDs are searched. The permission filter is part of the vector search itself, not a post-processing step. - Deduplication: Remove duplicate information from same document
- Source Deduplication: When multiple chunks originate from the same file, only the chunk with the highest relevance score is kept as source reference (implemented in
QueryService.mapSources()) - Re-ranking: Score results by relevance
Retrieval:
similarity_threshold: 0.6
top_k: 20
apply_permissions: true
chunk_recency_boost: true
source_diversity: true
After initial retrieval, results scored by:
- Semantic Similarity: How close the embedding is to question
- Document Recency: Newer documents ranked higher (optional)
- Source Trust Score: Frequently updated sources ranked higher (optional)
- Keyword Overlap: Exact phrase matches in document (optional)
Score Combination:
final_score = (
0.6 * semantic_similarity +
0.2 * recency_boost +
0.1 * source_trust +
0.1 * keyword_overlap
)
System provides confidence for each retrieved document:
- High (> 0.85): Definitely relevant to question
- Medium (0.6 - 0.85): Probably relevant
- Low (< 0.6): Questionable relevance, marked as uncertain
Users see scores and can filter by confidence.
Documents in different languages indexed and searched:
- Each document tagged with language
- Embedding model must support language
- Queries in any language matched to documents
- Results returned in original language
From each document, system automatically extracts:
- Title
- Author (if available)
- Creation/modification date
- Document type (report, meeting notes, policy, etc.)
- Key topics/tags (via NLP)
This metadata enables:
- Better search filtering
- Trustworthiness signals
- Related document discovery
Frequently asked questions cached:
- Same question asked within N hours returns cached answer
- Cache aware of document updates (invalidates on source change)
- Reduces embedding & LLM calls
- User can force fresh answer
Documents can be marked:
- Active: Included in searches
- Archived: Searchable but flagged as older
- Expired: Removed from searches (but kept for audit)
- Sensitive: Restricted by permissions
Admins can see:
- Which sources are active, when last indexed
- Total documents in each source
- Failed documents and error logs
- Indexing queue status
- Resource usage (CPU, memory, disk)
System alerts admins on:
- Source connection failures (3 failed attempts)
- Large number of processing errors (> 10% of documents)
- Indexing taking longer than expected (> 2 hours)
- Vector database storage nearly full
Indexing can start:
- On schedule (daily, hourly, etc.)
- On demand (manual admin trigger)
- Via webhook (source system pings OPAA)
- On document change (streaming if supported)
- On user upload (immediate processing when user uploads a file)
Every indexed document belongs to exactly one home workspace (determined by the connector's source mapping or the upload target). Permissions are enforced at the workspace level:
- Users can only find documents in workspaces they are members of
- The workspace filter is integrated into the vector search (not a post-filter)
- Search results never leak across workspaces
Cross-workspace document sharing is planned as a future feature. When implemented, shared documents' chunks would gain additional workspace_ids entries, making them searchable in multiple workspaces without duplication. See Document Sharing for the current concept and open questions.
Documents uploaded by users follow a specific permission model:
- Default: Private to the uploading user (in their personal workspace)
- Direct upload to team workspace: Users with Editor role can upload directly to a team workspace — the document's home workspace is then the team workspace (see Access Control — Upload to Team Workspace)
- Owner: The uploading user is always the document owner
- Upload quotas: Configurable per user with a global default
- Cross-workspace sharing: Planned as a future feature — see Document Sharing
Connector-indexed documents inherit their workspace(s) from the source mapping:
- Each source sub-unit (e.g., Confluence space) can be mapped to one or more workspaces
- All documents from that source are indexed into all mapped workspaces
- Workspace Admins can exclude individual documents from the index (see Access Control — Exclude Mechanism)
When a user uploads a document, OPAA performs a similarity check against existing documents the user has access to. If similar documents are found, the user is notified before the upload completes — helping prevent duplicate indexing (e.g., two users uploading the same meeting notes).
- Small organization (100 documents): 5-10 minutes
- Mid-size (10,000 documents): 30-60 minutes
- Large (100,000+ documents): Parallel processing, as needed
- Vector search (incl. workspace filter): < 500ms for typical queries
- Re-ranking: + 50-100ms
- Total retrieval time: < 1 second
Note: Permission filtering is integrated into the vector search via metadata filter on workspace_ids and does not add a separate processing step.
System scales to:
- Millions of documents (via horizontal scaling)
- Thousands of concurrent users (via distributed vector DB)
- Multiple data sources simultaneously
- Large chunks or small chunks (configurable)
- User Frontends: Provide retrieved documents and answers
- LLM Integration: Feed retrieved documents to LLM
- Access Control: Enforce workspace/document permissions
- Deployment Infrastructure: Storage configuration, resource allocation
- Storage quotas: Yes, for manual uploads. Upload limit is configurable per user with a global default.
- Document versioning: Yes, ideally. Additionally, similar documents visible to the user are shown during upload to detect duplicates (see Duplicate Detection above).
- Should we support real-time indexing (as documents change) vs. scheduled batch?
- Should re-ranking use a learned model or simple scoring?
- Should we support document clustering (for discovering related docs automatically)?
- Should we offer semantic deduplication (remove redundant documents automatically)? (Note: basic source-reference deduplication by file name is already implemented — see Issue #42)
- How to handle very large documents (100K+ pages)?
- Should we support hybrid retrieval (vector + keyword search together)?
- Should we support bulk import from a user's local drive?
- Indexing Completeness: % of source documents successfully indexed
- Retrieval Latency: P95 search time < 500ms
- Relevance: % of retrieved documents actually used in final answer
- Coverage: Average # of relevant documents returned per query
- Freshness: Median time between document change and re-indexing