Skip to content

Latest commit

 

History

History
38 lines (19 loc) · 1.41 KB

File metadata and controls

38 lines (19 loc) · 1.41 KB

Planned Feature

Image Embedding via OpenAI CLIP

Image embedding using OpenAI CLIP will be added alongside the current natural language descriptions generated by Gemini Pro Vision.

Since the project integrates multiple APIs, all new components will be thoroughly tested to ensure stability before release.

This update is scheduled for an upcoming version.

Full local pipeline support (no API key needed)

Currently, some components (e.g. OpenAI, MathPix) rely on external APIs. The final goal is to replace all of them with local alternatives. Planned replacements include:

• Tesseract or TrOCR for general OCR

• Pix2Struct, Donut, or DocTR for document layout analysis

• CLIP or similar models for image-text semantic alignment

• LLaMA, Gemma, Mistral, Phi, etc. for reasoning and QA

Prompt injection prevention & hallucination mitigation

To reduce risks from prompt injection and hallucinations common in LLMs, the system will adopt structured improvements:

• Input/output validation with JSON Schema or Pydantic

• Isolated inference per module and context separation

• Fact-checking pass to detect and filter hallucinated output

• Structural prompt design separating instruction from data

• Offline-friendly deployment

A fully self-contained version with all models and dependencies bundled will be released, allowing secure use in air-gapped or sensitive environments.