PaddleOCR-VL-1.5.mp4
PaddleOCR-VL-1.5 is a unified Vision-Language OCR model that supports multiple document understanding tasks through a single model architecture. Built upon powerful vision-language foundations, it can handle diverse OCR scenarios including text recognition, table extraction, formula recognition, chart understanding, seal recognition, and text spotting with bounding boxes.
Here, we will show you how to use PaddleOCR-VL-1.5 on X-AnyLabeling to perform various OCR and document understanding tasks.
Let's get started!
PaddleOCR-VL-1.5 supports six distinct tasks:
| Task | Description | Output |
|---|---|---|
| OCR | Optical Character Recognition for text extraction | Text content |
| Table Recognition | Extract table structure and content | HTML/Markdown table |
| Formula Recognition | Recognize mathematical formulas | LaTeX format |
| Chart Recognition | Extract information from charts and graphs | Structured data |
| Text Spotting | Detect and recognize text with bounding boxes | Polygon shapes with text |
| Seal Recognition | Recognize seal stamps and chop marks | Text content |
When the PPOCR panel uses PP-DocLayoutV3 for layout detection and then sends each cropped block to PaddleOCR-VL-1.5, labels are routed as follows:
| Routed Task | PP-DocLayoutV3 Labels |
|---|---|
| OCR | doc_title, paragraph_title, header, footer, content, reference, reference_content, text, vertical_text, aside_text, abstract, footnote, vision_footnote, figure_title, number |
| Table Recognition | table |
| Formula Recognition | inline_formula, display_formula, formula_number, algorithm |
| Chart Recognition | chart |
| Seal Recognition | seal |
| Image Only (no text task) | image, header_image, footer_image |
Note
PP-DocLayoutV3 officially provides 25 labels. X-AnyLabeling also routes an additional compatibility label formula to Formula Recognition when it appears in layout output.
You'll need to get X-AnyLabeling-Server up and running first. Check out the installation guide for the details. Make sure you're running at least v0.0.7 of the server and v3.3.9 of the X-AnyLabeling client, otherwise you might run into compatibility issues.
Important
PaddleOCR-VL-1.5 requires transformers>=5.0.0. Install it with:
python -m pip install "transformers>=5.0.0"For more details, see the official model page.
Tip
We highly recommend installing flash-attn to boost performance and reduce memory usage:
pip install flash-attn --no-build-isolationOnce that's done, head over to configs/models.yaml and enable paddleocr_vl_1_5. There's an example config you can reference if you're not sure how to set it up.
You can tweak the settings in paddleocr_vl_1_5.yaml to fit your needs.
| Parameter | Default | Description |
|---|---|---|
model_path |
PaddlePaddle/PaddleOCR-VL-1.5 |
HuggingFace model path |
device |
cuda:0 |
Device for inference |
torch_dtype |
bfloat16 |
Model precision |
max_new_tokens |
512 |
Maximum tokens for generation |
max_pixels |
1605632 |
Maximum pixels for text tasks (1280×28×28) |
spotting_max_pixels |
1605632 |
Maximum pixels for spotting task (2048×28×28) |
spotting_upscale_threshold |
1500 |
Threshold for image upscaling in spotting |
Note
If inference times out, try adjusting max_new_tokens, max_pixels, spotting_max_pixels, and spotting_upscale_threshold based on your GPU memory.
Launch the X-AnyLabeling client, press Ctrl+A or click the AI button in the left menu bar to open the auto-labeling panel. In the model dropdown list, select Remote-Server, then choose PaddleOCR-VL-1.5.
The OCR task extracts text content from images.
Usage:
- Select the "OCR" task from the task dropdown
- Click the "Run" button to extract text
- The recognized text will be displayed in the description field
The Table Recognition task extracts table structure and content from document images.
Usage:
- Select the "Table Recognition" task from the task dropdown
- Click the "Run" button to extract table content
- The result will be formatted as HTML/Markdown table structure
The Formula Recognition task recognizes mathematical formulas and converts them to LaTeX format.
Usage:
- Select the "Formula Recognition" task from the task dropdown
- Click the "Run" button to recognize formulas
- The result will be in LaTeX format
The Chart Recognition task extracts information from charts and graphs.
Usage:
- Select the "Chart Recognition" task from the task dropdown
- Click the "Run" button to analyze the chart
- The extracted data will be displayed in structured format
The Text Spotting task detects text regions and recognizes their content with polygon bounding boxes.
Usage:
- Select the "Text Spotting" task from the task dropdown
- Click the "Run" button to detect and recognize text
- Polygon shapes with recognized text will be created on the canvas
Tip
For small images (width and height both less than 1500 pixels), the model automatically upscales the image by 2x for better detection accuracy. You can adjust this threshold via spotting_upscale_threshold.
The Seal Recognition task recognizes text from seal stamps and chop marks.
Usage:
- Select the "Seal Recognition" task from the task dropdown
- Click the "Run" button to recognize seal text
- The recognized text will be displayed in the description field
Tip
All tasks support batch processing. You can run inference on the entire dataset with a single click using Ctrl+M or the batch processing feature in X-AnyLabeling.
- PP-DocLayoutV3: Document layout analysis model that works well with PaddleOCR-VL-1.5