A GTK4 application that converts document images to organized PDFs using OCR technology. It automatically detects page numbers, organizes documents, and allows manual corrections when OCR fails.
- Parallel OCR Processing: Uses multiple CPU cores for faster image processing
- Automatic Page Detection: Extracts page numbers using Tesseract OCR
- Manual Correction: Interactive dialog for correcting OCR failures
- Smart Organization: Automatically organizes PDFs by page numbers
- Cache System: Skips already processed images to avoid reprocessing
- Modern UI: Built with GTK4 and Libadwaita for a native Linux experience
- Real-time Logs: Live monitoring of processing status and errors
- Configurable Settings: Adjustable maximum pages and processing threads
- Linux operating system
- Python 3.8 or higher
- GTK4 development libraries
- Tesseract OCR engine
sudo apt update
sudo apt install python3 python3-pip tesseract-ocr tesseract-ocr-por libgtk-4-dev libadwaita-1-devsudo dnf install python3 python3-pip tesseract tesseract-langpack-por gtk4-devel libadwaita-develsudo pacman -S python python-pip tesseract tesseract-data-por gtk4 libadwaita- Clone the repository:
git clone https://github.com/EmanuProds/ncx-book-organizer.git
cd img2doc- Create a virtual environment (recommended):
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install Python dependencies:
pip install pytesseract pillow pygobject- Activate the virtual environment (if created):
source venv/bin/activate- Run the application:
python main.py- Select Input Directory: Choose the folder containing your document images (JPG/JPEG)
- Select Output Directory: Choose where the organized PDFs will be saved
- Configure Settings (optional):
- Maximum pages: Set the total number of pages in your document
- Number of processes: Adjust parallel processing (0 = auto-detect)
- Start Processing: Click "Start Processing" and monitor progress in the Logs tab
- Manual Corrections: If OCR fails, the app will prompt for manual page number input
The application creates organized PDFs with the following naming convention:
FL. 001.pdf,FL. 002.pdf, etc. - Regular pagesFL. 001-verso.pdf- Back sides of pagesTERMO DE ABERTURA.pdf- Opening termsTERMO DE ENCERRAMENTO.pdf- Closing termsERRO_OCR_filename.pdf- Files that couldn't be processed
- Language: Portuguese (por)
- PSM Mode: 6 (Uniform block of text)
- ROI: Configurable region of interest for page number detection
- Maximum Pages: Default 300 pages
- Parallel Processes: Default 4 workers
- Cache System: Automatically detects and skips already processed files
The application follows a modern, service-oriented architecture with clear separation of concerns:
src/
├── models.py # Data models and domain entities (dataclasses & enums)
├── exceptions.py # Custom exception hierarchy
├── config.py # Application configuration
├── core.py # Legacy processing logic (backward compatibility)
├── services/ # Modern service layer
│ ├── file_service.py # File operations and caching
│ ├── ocr_service.py # OCR processing and image manipulation
│ └── processing_service.py # Main processing coordination
├── interface/ # GTK4 UI layer
│ ├── entrypoint.py # Application initialization
│ ├── gui.py # Main window and navigation
│ ├── home.py # Processing interface
│ ├── pref.py # Preferences/settings page
│ ├── logs.py # Logging interface
│ └── about.py # About dialog
├── ocr.py # Legacy OCR functions (deprecated)
└── __init__.py # Package initialization
- Use SSD storage for faster I/O
- Increase parallel processes for multi-core systems
- Process images in batches for better cache utilization
This project is licensed under the MIT License.