This repository contains a plugin for OCRmyPDF that allows using the Google Cloud Vision API as the OCR (Optical Character Recognition) engine instead of the default Tesseract engine.
Status: This plugin is functional but should be considered experimental. It relies on external components and specific configurations. Use at your own discretion.
Origin: This repository was originally forked from kkrell2016/ocrmypdf_plugin_GoogleVision. Significant modifications have been made since the fork to adapt to newer OCRmyPDF plugin interfaces, improve coordinate handling, and add features like baseline/font-size hints. Changes made here are specific to this repository and are not merged back into the original kkrell2016 repository.
Based On: This work also incorporates code adapted from dinosauria123/gcv2hocr for converting Google Cloud Vision API output to the hOCR format needed by OCRmyPDF.
- Performs OCR using Google Cloud Vision API’s
DOCUMENT_TEXT_DETECTION. - Integrates with OCRmyPDF via its plugin system (
--pluginargument). - Generates searchable PDF text layers based on GCV results.
- Attempts to map Tesseract language codes (used by OCRmyPDF's
-lflag) to Google Cloud Vision language hints. - Includes baseline and font size hints in the generated hOCR to improve text placement.
- Tesseract Still Required: This plugin currently relies on a separate installation of the Tesseract OCR engine for auxiliary functions like page orientation detection and deskewing. OCRmyPDF must be able to find your Tesseract installation for these features to work correctly. If Tesseract is not found, orientation/deskew steps will be skipped, potentially leading to incorrect page rotation or skewed text layers.
- hOCR Conversion Quality: Uses the included
gcv2hocr2.pyscript for conversion. The accuracy of the final text placement depends heavily on the quality of this conversion and how well OCRmyPDF's renderer interprets the generated hOCR (including bounding boxes, baselines, and font sizes). While functional, further refinements might be needed for complex layouts or unusual fonts. Text placement may not be perfect. - Cost: Using the Google Cloud Vision API incurs costs based on usage, although Google Cloud offers a free tier which may cover limited use. Please review the Vision API Pricing page.
- Compatibility: Developed and tested against OCRmyPDF v16.10.0. Compatibility with significantly older or newer versions is not guaranteed due to potential changes in OCRmyPDF's plugin API.
Before installing and using this plugin, ensure you have the following:
- OCRmyPDF: A working installation of OCRmyPDF (v16.10.0 or compatible recommended). If you don't have it, follow the official OCRmyPDF installation guide.
- Python: Python 3.10 or newer (check with
python3 --version). - Tesseract: The Tesseract OCR engine must be installed and findable in your system’s PATH. This is required for orientation and deskew functions used by the plugin. See OCRmyPDF Documentation - Installing Tesseract.
- Google Cloud Account & Setup:
- A Google Cloud Platform (GCP) project. How-to Guide: Creating and Managing Projects.
- The Cloud Vision API must be enabled for your project. How-to Guide: Enabling and Disabling APIs (Search for "Cloud Vision API"). Direct link to enable: Enable Vision API.
- Billing must be enabled for your project. How-to Guide: Modify a Project's Billing Settings.
- Clone the Repository: Open your terminal or command prompt and run:
git clone [https://github.com/grantbarrett/ocrmypdf_plugin_GoogleVision.git](https://github.com/grantbarrett/ocrmypdf_plugin_GoogleVision.git) cd ocrmypdf_plugin_GoogleVision - Create & Activate Virtual Environment (Highly Recommended): This isolates dependencies and avoids conflicts with system packages.
Your terminal prompt should now start with
# Navigate into the cloned directory first if you haven't already python3 -m venv venv source venv/bin/activate # On macOS/Linux # Or: venv\Scripts\activate.bat # On Windows CMD # Or: .\venv\Scripts\Activate.ps1 # On Windows PowerShell
(venv). - Install Dependencies: Install the required Python libraries into your active virtual environment:
pip install -U pip pip install ocrmypdf google-cloud-vision Pillow reportlab
The plugin needs credentials to access the Google Cloud Vision API. Choose one of the following methods:
Method 1: Application Default Credentials (ADC) - Recommended
This is generally the easiest and most secure method for local use and many deployment types.
- Install Google Cloud CLI (
gcloud): If you haven't already, follow the official instructions: Install the gcloud CLI. - Log in and Set ADC: Run this command in your terminal (make sure your virtual environment from Step 2 above is still active) and follow the browser prompts to authenticate with the Google account linked to your GCP project:
The plugin (
gcloud auth application-default login
gvision.py) will automatically detect and use these credentials when you runocrmypdf.
Method 2: Service Account Key File
Use this method if ADC is not suitable for your environment (e.g., some automated systems).
- Create a Service Account Key:
- Go to the Google Cloud Console -> IAM & Admin -> Service Accounts.
- Select your GCP project.
- Choose an existing service account or create a new one.
- Ensure the service account has the necessary permissions to use the Vision API (e.g., the predefined
Cloud Vision AI Userrole). How-to Guide: Granting Roles to Service Accounts. - Create a JSON key for the service account and download it to a secure location on your computer. How-to Guide: Creating and managing service account keys. Treat this file like a password; do not commit it to your repository.
- Use the Key File: When running
ocrmypdf, use the--gcv-keyfileargument (added by this plugin) and provide the full path to your downloaded JSON key file.
- Make sure your virtual environment (e.g.,
venv) is activated. - Run
ocrmypdffrom your terminal. - Use the
--pluginargument, providing the full path to thegvision.pyscript within the cloned repository directory. - Use the
-l LANG1[+LANG2...]argument to specify the language(s) in the document using Tesseract's 3-letter codes (e.g.,eng,deu,ara,chi_sim). Separate multiple languages with+. The plugin will attempt to map these to appropriate Google Vision language hints for the API call. - Recommended: Use
--pdf-renderer hocr. This explicitly tells OCRmyPDF to use its hOCR-specific rendering pipeline. This seems necessary for reliable text placement with the hOCR generated by this plugin, especially compared to the defaultsandwichrenderer which might be chosen automatically for certain languages (like RTL). - Optional: If using a service account key file (Method 2 for Authentication), add the
--gcv-keyfile /path/to/your/keyfile.jsonargument. - Optional: Use
--force-ocrif your input PDF might already contain some text, to ensure OCR is performed anyway.
ocrmypdf \
--plugin /path/to/cloned/repo/gvision.py \
-l eng+fra \
--pdf-renderer hocr \
my_document.pdf \
my_document_ocr.pdf \
--force-ocr
Example Command (using Service Account Key):
ocrmypdf \
--plugin /path/to/cloned/repo/gvision.py \
--gcv-keyfile /secure/path/to/my-gcp-key.json \
-l ara+eng \
--pdf-renderer hocr \
arabic_doc.pdf \
arabic_doc_ocr.pdf \
--force-ocr
(Remember to replace /path/to/cloned/repo/ and /secure/path/to/my-gcp-key.json with your actual paths)
- OCRmyPDF starts processing the input PDF.
- When it needs to perform OCR on a page image, it calls the GVisionOcrEngine provided by the gvision.py plugin (because you specified --plugin).
- The plugin authenticates with Google Cloud (using ADC or the key file).
- It sends the page image to the Google Cloud Vision API (document_text_detection), including mapped language hints based on your -l argument.
- It receives a detailed JSON response containing the recognized text and its coordinates (as pixel vertices).
- The plugin uses the included gcv2hocr2.py script to:
- Detect the image DPI using the Pillow library.
- Convert the GCV pixel coordinates to PDF points (1/72 inch) using the detected DPI.
- Transform Y-coordinates to a bottom-left origin system suitable for PDF/hOCR.
- Generate an hOCR file (HTML format) embedding the text and its position information (bounding boxes in points, calculated baseline hints, estimated font size hints).
- OCRmyPDF's rendering pipeline (specifically the hocr renderer, when selected via --pdf-renderer hocr) reads this hOCR file.
- The renderer creates an invisible text layer in the output PDF, attempting to match the position and scale specified in the hOCR.
- Auxiliary steps like orientation detection and deskewing are delegated to the installed Tesseract engine.
google.auth.exceptions.DefaultCredentialsError: Your Application Default Credentials are not set up correctly or cannot be found.- Ensure you have run
gcloud auth application-default loginin the same terminal session where you are runningocrmypdf(and where your virtual environment is active). - Make sure you authenticated with the Google account linked to the correct GCP project (the one with Vision API enabled).
- Alternatively, switch to using the
--gcv-keyfilemethod.
- Ensure you have run
ValueError: GCV key file not found: The path provided to--gcv-keyfileis incorrect or the file is not readable. Double-check the path and file permissions.- Text Layer Missing in Output PDF:
- Verify the Google Cloud Vision API call succeeded. Check the console output for any errors reported by
gvision.pyor messages starting withgoogle.api_core.exceptions. - Run with
--keep-temporary-filesand check the temporary directory. Inside the subdirectories for each page (e.g.,page_001), ensure both an.hocrfile (e.g.,ocr.hocr) and a.txtfile (e.g.,000001_ocr_tess.txt) were created. - Open the
.hocrfile. Does it contain valid HTML withocr_page,ocr_line, andocrx_wordelements? Does it include the recognized text? - Confirm you are using the
--pdf-renderer hocrargument when runningocrmypdf. Thesandwichrenderer may not correctly process the hOCR from this plugin.
- Verify the Google Cloud Vision API call succeeded. Check the console output for any errors reported by
- Text Misaligned / Incorrect Position:
- This is the most common known issue with the current version. The text layer exists, but highlighting it shows it doesn't precisely overlay the text in the image.
- Ensure
--pdf-renderer hocris used. - The misalignment likely stems from inaccuracies in converting the precise geometry (bounding boxes, font size, baseline) from the GCV response into the hOCR format, and how OCRmyPDF's renderer interprets these hints. The calculations in
gcv2hocr2.pyare heuristics and may not perfectly match the original font metrics or layout. - Check the console output for any "invalid line box" warnings during the run - if these reappear, there might still be coordinate calculation issues in
gcv2hocr2.py. - Further improvements would likely require more sophisticated analysis of the GCV response or adjustments to the hOCR generation in
gcv2hocr2.py.
- Tesseract Errors (Orientation/Deskew):
- Ensure Tesseract is installed correctly and its executable is in your system's PATH environment variable. OCRmyPDF (and this plugin) needs to be able to run the
tesseractcommand. - The plugin logs errors if it cannot find or execute Tesseract for these steps. The main OCR process will still use Google Vision, but pages might not be correctly rotated or deskewed.
- Ensure Tesseract is installed correctly and its executable is in your system's PATH environment variable. OCRmyPDF (and this plugin) needs to be able to run the
- Plugin Not Found / Import Errors:
- Make sure your virtual environment is active when running
ocrmypdf. - Ensure all dependencies (
ocrmypdf,google-cloud-vision,Pillow,reportlab) were installed correctly within the active virtual environment (pip list). - Verify the path provided to
--pluginpoints correctly to thegvision.pyfile.
- Make sure your virtual environment is active when running