This project is a sophisticated Multimodal Recommender System that bridges the gap between natural language processing and computer vision. It allows users to describe a specific "vibe" or color palette for a bouquet, retrieves semantically similar flower images using vector search, and then utilizes a Large Multimodal Model (LMM) to generate expert floristry advice based on those specific visual inputs.
The system follows a Retrieval-Augmented Generation (RAG) pattern, specifically tailored for multimodal data.
- Data Layer: Utilizes the
huggan/flowers-102-categoriesdataset from Hugging Face. - Vector Database: Employs ChromaDB as a persistent local vector store.
- Embedding Engine: Uses OpenCLIP (Open Contrastive Language-Image Pre-training) to map both text queries and images into the same high-dimensional vector space.
- Reasoning Engine: Leverages GPT-4o (via OpenRouter) as the Multimodal LLM to "see" images and generate structured advice.
- Interface: A Streamlit web application for real-time user interaction.
Unlike traditional databases that search for exact keywords, this system uses Semantic Search.
- Persistent Storage:
PersistentClient(path="./data/flower_v2.db")ensures the database remains saved on disk between sessions. - OpenCLIP Embedding: This is the "magic" that allows a text query like "vibrant yellow" to find images of sunflowers or daffodils by understanding the visual concept of "vibrant" and "yellow" together.
The system features an intelligent "cold start" mechanism:
- It checks
flower_collection.count(). - If empty, it automatically downloads images from Hugging Face, saves them locally to
./data/flower_images, and indexes them. This creates a bridge between cloud datasets and local high-speed retrieval.
The core of the "Expert Florist" persona lies in the ChatPromptTemplate. It is structured to handle three distinct inputs simultaneously:
- Text Input: The user's original request (e.g., "romance").
- Image 1: The first visually retrieved flower.
- Image 2: The second visually retrieved flower.
The prompt instructs the model to use Visual Grounding, forcing it to reference the specific visual context provided rather than giving generic advice.
| Step | Action | Technology |
|---|---|---|
| 1. Input | User enters a natural language description. | Streamlit UI |
| 2. Retrieval | Text query is converted to a vector; ChromaDB finds the 4 most similar images. | OpenCLIP + ChromaDB |
| 3. Processing | Top images are converted to Base64 strings for transmission to the LLM. | Base64 Encoding |
| 4. Reasoning | LLM analyzes the text query + images to design a cohesive bouquet. | GPT-4o (OpenRouter) |
| 5. Output | Professional florist advice is rendered with Markdown formatting. | Streamlit Markdown |
- Python 3.10+
- OpenRouter API Key (stored in a
.envfile asOPENROUTER_API_KEY)
Bash
pip install streamlit datasets chromadb langchain-openai langchain-core open-clip-torch pillow python-dotenv
Bash
streamlit run your_filename.py
The system is designed for high readability, mimicking a professional blog or florist guide:
- Columnar Layout: Retrieved images are displayed in a clean grid using
st.columns. - Markdown Sections: The output is structured with specific headers:
Complementary Colors,Monochromatic Scheme, andTips for Arranging. - Real-time Feedback: Spinner components (
st.spinner) keep the user informed during the heavy embedding and LLM inference tasks.
We employ a Three-Tier Evaluation Framework to ensure the system is both technically accurate and semantically helpful.
Script: evaluate_retriever.py
Metric: Hit Rate @ 4
Goal: Does the system find the exact images defined in our ground_truth.json?
- Current Status: 100% Pass. This confirms that our indexing IDs and retrieval logic are perfectly synchronized.
Script: evaluate_visual_judge.py
Metric: Semantic Accuracy (Score 1-5)
Goal: Does a "judge" LLM (GPT-4o) agree that the retrieved images match the user's color/vibe request?
- Result: Average 4/5. The judge confirmed high relevance but successfully flagged a white flower retrieved for a yellow query, identifying a "Semantic Gap" for future optimization.
Script: evaluate_generator.py
Metric: Faithfulness & Answer Relevance
Goal: Does the Florist AI give advice based only on the images found, or does it hallucinate?
- Result: Pass. The judge verified that the generated text specifically referenced flower species present in the retrieved images.
| Step | Action | Technology |
|---|---|---|
| 1. Input | User enters a natural language description. | Streamlit UI |
| 2. Retrieval | Text query is converted to a vector; ChromaDB finds top matches. | OpenCLIP + ChromaDB |
| 3. Processing | Top images are converted to Base64 strings for the LLM. | Base64 Encoding |
| 4. Reasoning | LLM analyzes text + images to design a cohesive bouquet. | GPT-4o |
| 5. Output | Professional florist advice is rendered with Markdown. | Streamlit Markdown |
To maintain system integrity after any code changes, run the following suite:
Bash
python .\evaluate_retriever.py # Check technical matching python .\evaluate_visual_judge.py # Check visual relevance python .\evaluate_generator.py # Check advice faithfulness