Skip to content

grascya/Flower-Arrangement-Bouquet-Designer

Repository files navigation

💐 Multimodal-Flower-Recommender-System

INTRODUCTION

This project is a sophisticated Multimodal Recommender System that bridges the gap between natural language processing and computer vision. It allows users to describe a specific "vibe" or color palette for a bouquet, retrieves semantically similar flower images using vector search, and then utilizes a Large Multimodal Model (LMM) to generate expert floristry advice based on those specific visual inputs.


1. System Architecture

The system follows a Retrieval-Augmented Generation (RAG) pattern, specifically tailored for multimodal data.

  • Data Layer: Utilizes the huggan/flowers-102-categories dataset from Hugging Face.
  • Vector Database: Employs ChromaDB as a persistent local vector store.
  • Embedding Engine: Uses OpenCLIP (Open Contrastive Language-Image Pre-training) to map both text queries and images into the same high-dimensional vector space.
  • Reasoning Engine: Leverages GPT-4o (via OpenRouter) as the Multimodal LLM to "see" images and generate structured advice.
  • Interface: A Streamlit web application for real-time user interaction.

2. Key Components & Implementation

A. Multimodal Vector Search

Unlike traditional databases that search for exact keywords, this system uses Semantic Search.

  • Persistent Storage: PersistentClient(path="./data/flower_v2.db") ensures the database remains saved on disk between sessions.
  • OpenCLIP Embedding: This is the "magic" that allows a text query like "vibrant yellow" to find images of sunflowers or daffodils by understanding the visual concept of "vibrant" and "yellow" together.

B. Automated Data Ingestion

The system features an intelligent "cold start" mechanism:

  • It checks flower_collection.count().
  • If empty, it automatically downloads images from Hugging Face, saves them locally to ./data/flower_images, and indexes them. This creates a bridge between cloud datasets and local high-speed retrieval.

C. Multimodal Prompt Engineering

The core of the "Expert Florist" persona lies in the ChatPromptTemplate. It is structured to handle three distinct inputs simultaneously:

  1. Text Input: The user's original request (e.g., "romance").
  2. Image 1: The first visually retrieved flower.
  3. Image 2: The second visually retrieved flower.

The prompt instructs the model to use Visual Grounding, forcing it to reference the specific visual context provided rather than giving generic advice.


3. The Multimodal Pipeline (Step-by-Step)

Step Action Technology
1. Input User enters a natural language description. Streamlit UI
2. Retrieval Text query is converted to a vector; ChromaDB finds the 4 most similar images. OpenCLIP + ChromaDB
3. Processing Top images are converted to Base64 strings for transmission to the LLM. Base64 Encoding
4. Reasoning LLM analyzes the text query + images to design a cohesive bouquet. GPT-4o (OpenRouter)
5. Output Professional florist advice is rendered with Markdown formatting. Streamlit Markdown

4. Setup & Usage

Prerequisites

  • Python 3.10+
  • OpenRouter API Key (stored in a .env file as OPENROUTER_API_KEY)

Installation

Bash

pip install streamlit datasets chromadb langchain-openai langchain-core open-clip-torch pillow python-dotenv

Running the System

Bash

streamlit run your_filename.py


5. Visual Formatting Features

The system is designed for high readability, mimicking a professional blog or florist guide:

  • Columnar Layout: Retrieved images are displayed in a clean grid using st.columns.
  • Markdown Sections: The output is structured with specific headers: Complementary Colors, Monochromatic Scheme, and Tips for Arranging.
  • Real-time Feedback: Spinner components (st.spinner) keep the user informed during the heavy embedding and LLM inference tasks.

6. Evaluation (EVALS)

We employ a Three-Tier Evaluation Framework to ensure the system is both technically accurate and semantically helpful.

Tier 1: Technical Retrieval (Hit Rate)

Script: evaluate_retriever.py

Metric: Hit Rate @ 4

Goal: Does the system find the exact images defined in our ground_truth.json?

  • Current Status: 100% Pass. This confirms that our indexing IDs and retrieval logic are perfectly synchronized.

Tier 2: Visual Audit (LLM-as-a-Judge)

Script: evaluate_visual_judge.py

Metric: Semantic Accuracy (Score 1-5)

Goal: Does a "judge" LLM (GPT-4o) agree that the retrieved images match the user's color/vibe request?

  • Result: Average 4/5. The judge confirmed high relevance but successfully flagged a white flower retrieved for a yellow query, identifying a "Semantic Gap" for future optimization.

Tier 3: Generation Faithfulness

Script: evaluate_generator.py

Metric: Faithfulness & Answer Relevance

Goal: Does the Florist AI give advice based only on the images found, or does it hallucinate?

  • Result: Pass. The judge verified that the generated text specifically referenced flower species present in the retrieved images.

7. The Multimodal Pipeline (Step-by-Step)

Step Action Technology
1. Input User enters a natural language description. Streamlit UI
2. Retrieval Text query is converted to a vector; ChromaDB finds top matches. OpenCLIP + ChromaDB
3. Processing Top images are converted to Base64 strings for the LLM. Base64 Encoding
4. Reasoning LLM analyzes text + images to design a cohesive bouquet. GPT-4o
5. Output Professional florist advice is rendered with Markdown. Streamlit Markdown

8. Usage & Maintenance

Running Evaluations

To maintain system integrity after any code changes, run the following suite:

Bash

python .\evaluate_retriever.py # Check technical matching python .\evaluate_visual_judge.py # Check visual relevance python .\evaluate_generator.py # Check advice faithfulness

About

This project is a sophisticated Multimodal Recommender System that bridges the gap between natural language processing and computer vision

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages