The project was turned into a library without abandoning the original flow. The architecture keeps the implementation small and easy to understand.
config.py- Holds
ProcessorConfigandDatabaseConfig. - Used by both the API and the CLI.
parser.py- Reads PDF text from the first page.
- Applies regex extraction for invoice number and date.
processor.py- Discovers files.
- Coordinates parsing.
- Handles per-file failures without aborting the whole batch.
- Delegates Excel export and optional database persistence.
excel.py- Converts the records into an
.xlsxfile.
database.py- Manages the MySQL connection and inserts records.
cli.py: terminal entry point.__init__.py: public package API.legacy.py: compatibility helper for the original project style.
CLI or Python API
|
v
ProcessorConfig / DatabaseConfig
|
v
InvoiceProcessor.process()
|
+--> discover PDF files
+--> parse each PDF
+--> optionally insert into MySQL
+--> export records to Excel
v
ProcessingResult
- Keep batch processing resilient: one bad file should not stop the others.
- Keep parsing simple: first page + regex, matching the original project spirit.
- Keep extension points obvious: regex, recursion, database persistence, and public helper functions.