💐 Multimodal-Flower-Recommender-System

INTRODUCTION

This project is a sophisticated Multimodal Recommender System that bridges the gap between natural language processing and computer vision. It allows users to describe a specific "vibe" or color palette for a bouquet, retrieves semantically similar flower images using vector search, and then utilizes a Large Multimodal Model (LMM) to generate expert floristry advice based on those specific visual inputs.

1. System Architecture

The system follows a Retrieval-Augmented Generation (RAG) pattern, specifically tailored for multimodal data.

Data Layer: Utilizes the huggan/flowers-102-categories dataset from Hugging Face.
Vector Database: Employs ChromaDB as a persistent local vector store.
Embedding Engine: Uses OpenCLIP (Open Contrastive Language-Image Pre-training) to map both text queries and images into the same high-dimensional vector space.
Reasoning Engine: Leverages GPT-4o (via OpenRouter) as the Multimodal LLM to "see" images and generate structured advice.
Interface: A Streamlit web application for real-time user interaction.

2. Key Components & Implementation

A. Multimodal Vector Search

Unlike traditional databases that search for exact keywords, this system uses Semantic Search.

Persistent Storage: PersistentClient(path="./data/flower_v2.db") ensures the database remains saved on disk between sessions.
OpenCLIP Embedding: This is the "magic" that allows a text query like "vibrant yellow" to find images of sunflowers or daffodils by understanding the visual concept of "vibrant" and "yellow" together.

B. Automated Data Ingestion

The system features an intelligent "cold start" mechanism:

It checks flower_collection.count().
If empty, it automatically downloads images from Hugging Face, saves them locally to ./data/flower_images, and indexes them. This creates a bridge between cloud datasets and local high-speed retrieval.

C. Multimodal Prompt Engineering

The core of the "Expert Florist" persona lies in the ChatPromptTemplate. It is structured to handle three distinct inputs simultaneously:

Text Input: The user's original request (e.g., "romance").
Image 1: The first visually retrieved flower.
Image 2: The second visually retrieved flower.

The prompt instructs the model to use Visual Grounding, forcing it to reference the specific visual context provided rather than giving generic advice.

3. The Multimodal Pipeline (Step-by-Step)

Step	Action	Technology
1. Input	User enters a natural language description.	Streamlit UI
2. Retrieval	Text query is converted to a vector; ChromaDB finds the 4 most similar images.	OpenCLIP + ChromaDB
3. Processing	Top images are converted to Base64 strings for transmission to the LLM.	Base64 Encoding
4. Reasoning	LLM analyzes the text query + images to design a cohesive bouquet.	GPT-4o (OpenRouter)
5. Output	Professional florist advice is rendered with Markdown formatting.	Streamlit Markdown

4. Setup & Usage

Prerequisites

Python 3.10+
OpenRouter API Key (stored in a .env file as OPENROUTER_API_KEY)

Installation

Bash

pip install streamlit datasets chromadb langchain-openai langchain-core open-clip-torch pillow python-dotenv

Running the System

Bash

streamlit run your_filename.py

5. Visual Formatting Features

The system is designed for high readability, mimicking a professional blog or florist guide:

Columnar Layout: Retrieved images are displayed in a clean grid using st.columns.
Markdown Sections: The output is structured with specific headers: Complementary Colors, Monochromatic Scheme, and Tips for Arranging.
Real-time Feedback: Spinner components (st.spinner) keep the user informed during the heavy embedding and LLM inference tasks.

6. Evaluation (EVALS)

We employ a Three-Tier Evaluation Framework to ensure the system is both technically accurate and semantically helpful.

Tier 1: Technical Retrieval (Hit Rate)

Script: evaluate_retriever.py

Metric: Hit Rate @ 4

Goal: Does the system find the exact images defined in our ground_truth.json?

Current Status: 100% Pass. This confirms that our indexing IDs and retrieval logic are perfectly synchronized.

Tier 2: Visual Audit (LLM-as-a-Judge)

Script: evaluate_visual_judge.py

Metric: Semantic Accuracy (Score 1-5)

Goal: Does a "judge" LLM (GPT-4o) agree that the retrieved images match the user's color/vibe request?

Result: Average 4/5. The judge confirmed high relevance but successfully flagged a white flower retrieved for a yellow query, identifying a "Semantic Gap" for future optimization.

Tier 3: Generation Faithfulness

Script: evaluate_generator.py

Metric: Faithfulness & Answer Relevance

Goal: Does the Florist AI give advice based only on the images found, or does it hallucinate?

Result: Pass. The judge verified that the generated text specifically referenced flower species present in the retrieved images.

7. The Multimodal Pipeline (Step-by-Step)

Step	Action	Technology
1. Input	User enters a natural language description.	Streamlit UI
2. Retrieval	Text query is converted to a vector; ChromaDB finds top matches.	OpenCLIP + ChromaDB
3. Processing	Top images are converted to Base64 strings for the LLM.	Base64 Encoding
4. Reasoning	LLM analyzes text + images to design a cohesive bouquet.	GPT-4o
5. Output	Professional florist advice is rendered with Markdown.	Streamlit Markdown

8. Usage & Maintenance

Running Evaluations

To maintain system integrity after any code changes, run the following suite:

Bash

python .\evaluate_retriever.py # Check technical matching python .\evaluate_visual_judge.py # Check visual relevance python .\evaluate_generator.py # Check advice faithfulness

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
check_db.py		check_db.py
create_ground_truth.py		create_ground_truth.py
evaluate_generator.py		evaluate_generator.py
evaluate_retriever.py		evaluate_retriever.py
evaluate_visual_judge.py		evaluate_visual_judge.py
ground_truth.json		ground_truth.json
multimodal-ui.py		multimodal-ui.py
multimodal.py		multimodal.py
reindex_data.py		reindex_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💐 Multimodal-Flower-Recommender-System

INTRODUCTION

1. System Architecture

2. Key Components & Implementation

A. Multimodal Vector Search

B. Automated Data Ingestion

C. Multimodal Prompt Engineering

3. The Multimodal Pipeline (Step-by-Step)

4. Setup & Usage

Prerequisites

Installation

Running the System

5. Visual Formatting Features

6. Evaluation (EVALS)

Tier 1: Technical Retrieval (Hit Rate)

Tier 2: Visual Audit (LLM-as-a-Judge)

Tier 3: Generation Faithfulness

7. The Multimodal Pipeline (Step-by-Step)

8. Usage & Maintenance

Running Evaluations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

💐 Multimodal-Flower-Recommender-System

INTRODUCTION

1. System Architecture

2. Key Components & Implementation

A. Multimodal Vector Search

B. Automated Data Ingestion

C. Multimodal Prompt Engineering

3. The Multimodal Pipeline (Step-by-Step)

4. Setup & Usage

Prerequisites

Installation

Running the System

5. Visual Formatting Features

6. Evaluation (EVALS)

Tier 1: Technical Retrieval (Hit Rate)

Tier 2: Visual Audit (LLM-as-a-Judge)

Tier 3: Generation Faithfulness

7. The Multimodal Pipeline (Step-by-Step)

8. Usage & Maintenance

Running Evaluations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages