This project presents a complete pipeline that takes an image as input, generates a descriptive caption in English, and then translates that caption into Farsi.
It serves as a practical example of combining state-of-the-art computer vision and natural language processing models.
This pipeline is built upon two powerful deep learning models:
-
Image Captioning (ClipCap)
Uses the ClipCap model architecture, which connects the visual understanding of OpenAI's CLIP model with the text-generation capabilities of a GPT-2 language model.
It translates the image's content into a meaningful prefix that guides the language model to generate a relevant caption. -
Translation (SeamlessM4T v2)
For translation, the project leverages Meta AI's SeamlessM4T v2, a multilingual and multitask model.
It is highly effective for translating text between numerous languages. Here, it is used to convert generated English captions into Farsi.
The process is orchestrated by the main script and can be broken down into the following steps:
- Input – The user provides an image via a command-line argument (URL or local file path).
- Image Loading – The script fetches and loads the image into a format suitable for processing.
- Caption Generation – The
ImageCaptionerextracts visual features using CLIP, passes them through the ClipCap projection network, and generates an English caption with GPT-2. - Translation – The
TranslationModeluses SeamlessM4T to translate the English caption into Farsi. - Output – Both captions are printed to the console.
Follow these steps to get the project running on your local machine.
git clone https://github.com/zedsharifi/Farsi-Image-Captioner-Translator.git
cd Farsi-Image-Captioner-Translatorpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtRun the script from your terminal.
The first time you run it:
- The captioner weights (coco_weights.pkl, ~235MB) will be downloaded.
- The translation model will also be downloaded automatically by the transformers library.
python main.py "https://i.ytimg.com/vi/vEyP6J61H4s/maxresdefault.jpg"python main.py "./images/my_photo.jpg"Prevent the script from opening a window to display the input image:
python main.py "path/to/your/image.jpg" --no-display> python main.py "https://i.ytimg.com/vi/vEyP6J61H4s/maxresdefault.jpg"
Loading image from: https://i.ytimg.com/vi/vEyP6J61H4s/maxresdefault.jpg
Model weights already exist at coco_weights.pkl.
Loading translation model...
Translation model loaded.
Generating caption...
[English Caption]: a cat sitting on a couch with a remote control
Translating caption to Farsi...
[Farsi Translation]: یه گربه روی مبل با کنترل از راه دور نشستهHere are 8 examples of the pipeline in action.
This project is released under the MIT License.