A Python script that converts PDF files to text using the docling library. This tool is designed to batch process PDF files, making it easy to extract text content from multiple documents at once.
- Automatically processes all PDF files in the
assetsdirectory - Creates text output files in the
outputdirectory - Maintains the original filename (with .txt extension)
- Handles conversion errors gracefully
- Reports conversion progress and status
- Python 3.x
- docling library
- Virtual environment (included in the repository)
docling_converter/
├── assets/ # Place PDF files here
├── output/ # Converted text files appear here
├── env/ # Virtual environment
├── main.py # Main conversion script
└── README.md # This file
- Ensure Python 3.x is installed on your system
- Activate the virtual environment:
- Windows:
.\env\Scripts\activate - Unix/MacOS:
source env/bin/activate
- Windows:
- Place your PDF files in the
assetsdirectory - Run the script:
python main.py - Find the converted text files in the
outputdirectory
The script uses the docling library's DocumentConverter to process PDF files. For each PDF file in the assets directory, it:
- Creates an output directory if it doesn't exist
- Converts the PDF content to text
- Saves the text with the same filename (but .txt extension) in the output directory
- Reports success or any errors encountered
The script includes error handling for individual file conversions. If a file fails to convert, the script:
- Prints an error message with the specific file name
- Continues processing remaining files
- Maintains a log of any conversion failures