Local multimodal memory for AI agents — where retrieval meets intelligence.
Current version: v0.2.0
Status: Active development (solo/small team)
This roadmap is honest about where RecallForge is, where it's going, and what's explicitly not planned. We prioritize depth over breadth, local-first over cloud-everything, and correctness over speed.
Immediate gaps and quick wins that build on existing foundations.
- HTTP MCP transport — Move from stdio to a shared long-lived HTTP server to eliminate repeated model loading and enable persistent connections.
- SSE streaming for MCP — Add Server-Sent Events support for real-time result streaming and progress updates.
- Explain mode for search — Add
explain=Trueflag that returns per-stage contribution scores (vector, keyword, rerank) so users understand why results rank where they do. - Debug logging — Structured JSON logs at each pipeline stage for local tracing and performance analysis.
- Multimodal query expansion — When searching with image/video + text, expand the text portion through the query expander. Long-term: VLM-based expansion that captions visual content into searchable text.
- Intent-aware query steering — Allow optional
intentparameter that pre-weights retrieval strategies (e.g.,intent="exact_lookup"boosts BM25,intent="semantic"prioritizes vector). - Collection management API — Add, remove, and rename collections without rebuilding the entire index.
- Homebrew formula —
brew install recallforgefor macOS users. - PyPI package polish — Better metadata, type stubs, and documented API surface.
- Claude plugin marketplace — List RecallForge as an MCP plugin once HTTP transport is stable.
Bigger features that expand what RecallForge can do and who can use it.
- JS/TS SDK — Node.js/Bun-native client with full type safety. Brings RecallForge to the TypeScript agent ecosystem.
- REST API — Language-agnostic HTTP interface for polyglot environments.
- Audio modality — Ingest and retrieve audio via embeddings ( Whisper-style or dedicated audio encoders).
- OCR pipeline — Native scanned document support using a lightweight OCR stage before embedding extraction.
- Context tree for collections — Hierarchical metadata attached to collections, returned with results. Enables agent memory with parent/child relationships.
- Multi-get batch retrieval — Retrieve multiple documents by glob pattern or doc ID in a single call.
- REC-111: Cross-modal search mode — Skip reranker/BM25 for image results. Benchmark shows reranker drops text→image R@1 from 30% to 0%. Content-type-aware pipeline routing.
- REC-112:
search_batch— Parallel multi-query with RRF merge. Single MCP call, multiple queries, fused results. Reduces round trips for agentic patterns. - REC-113:
search_iterative— Server-side multi-round retrieval. Strategies: refine (query refinement per round), expand (parallel variants), multi_query (decompose complex queries). Inspired by NVIDIA NeMo agentic retrieval (Apache 2.0). - REC-114:
explain_results— Per-document relevance explanations (BM25/vector/reranker score breakdown). Transparency for agentic retrieval. - REC-115: Semantic query cache — Embedding-similarity cache (threshold 0.95) for cross-session acceleration.
- REC-116: Model backend swappability — Configurable embedding models with MRL dimension truncation, model registry, migration tooling. Current Qwen3-VL stack stays default.
- Conversation history memory — Built-in support for thread-aware retrieval with automatic turn-based indexing.
- Entity extraction & memory — Automatic named entity recognition and entity-centric retrieval paths.
- Index sharding — Horizontal split of large collections across multiple SQLite files for local scale-out.
- Incremental background indexing — Queue-based ingestion that doesn't block queries.
Vision-level items that require significant R&D or architectural shifts.
- ONNX Runtime backend — Lighter alternative to PyTorch for environments where torch is too heavy.
- Core ML backend — Native Apple Neural Engine acceleration via Core ML alongside existing MLX support.
- Plugin system — Third-party extensions for custom embedders, rerankers, and storage backends.
- Prebuilt model packs — Downloadable "knowledge bases" (legal, medical, code) with optimized embedders.
- Temporal memory — Time-decayed retrieval weights and "forgetting" curves for long-lived agent memory.
- Cross-session persistence patterns — Standards for agent memory portability across restarts and devices.
Things we've considered and decided against, with reasons.
- Managed cloud service — RecallForge is explicitly local-first. We won't compete with Pinecone, Weaviate Cloud, or similar. Self-hosting is the point.
- Multi-tenant SaaS — Out of scope. The architecture assumes single-user, single-machine deployment.
- Distributed/clustered mode — Billions of vectors is a non-goal. If you need that scale, use a cloud-native vector DB. RecallForge targets thousands to low-millions of documents.
- Native Windows support — Runs on macOS and Linux. Windows via WSL.
- Built-in LLM — We embed and retrieve; we don't generate. Bring your own LLM.
- Web UI — Out of scope. RecallForge is a library and MCP server, not a standalone application.
This is a solo/small-team project with ambitious technical goals. Issues and PRs welcome, but check the roadmap first. We prioritize:
- Correctness over speed
- Local-first over cloud
- API stability over feature count
Questions? Open an issue or reach out.
Last updated: 2026-03-13