Skip to content

feat!: improve speed of pages/sec and scalability and decoupling compute with materialization in pydantic obejects#278

Open
PeterStaar-IBM wants to merge 8 commits into
mainfrom
feat/performance-optimizations
Open

feat!: improve speed of pages/sec and scalability and decoupling compute with materialization in pydantic obejects#278
PeterStaar-IBM wants to merge 8 commits into
mainfrom
feat/performance-optimizations

Conversation

@PeterStaar-IBM

@PeterStaar-IBM PeterStaar-IBM commented May 29, 2026

Copy link
Copy Markdown
Member

Original tests indicated low scaling due to incorrect use of Blend2D library (parsing showed good scaling),

Original code with parse only

./run_scaling.exe /Users/taa/Documents/projects/_data/bo767/pdf --threads=1,4,8 --mode=parse --keep-
  char-cells=true

  Benchmark: 767 documents, 54730 total pages
  Mode: parse
  Thread counts to test: 1,4,8
  Max concurrent results: 64

  Decode config:
  parameter                                       value
  ----------------------------------------------------------------
  page_boundary                                   crop_box
  do_sanitization                                 true
  keep_char_cells                                 true
  keep_shapes                                     false
  keep_bitmaps                                    false
  max_num_lines                                   -1
  max_num_bitmaps                                 -1
  create_word_cells                               false
  create_line_cells                               false
  enforce_same_font                               true
  horizontal_cell_tolerance                       1
  word_space_width_factor_for_merge               0.33
  line_space_width_factor_for_merge               1
  line_space_width_factor_for_merge_with_space    0.33
  populate_json_objects                           false
  release_native_memory_every_n_pages             0
  keep_glyphs                                     false
  keep_qpdf_warnings                              false
  materialize_bitmap_bytes                        false

  === PARSE (decode only) ===
  backend              threads     wall_time (s)    vs threaded(1)     pages/sec     ms/page    errors
  ----------------------------------------------------------------------------------------------------
  docling threaded           1           683.076             1.00x          80.1       12.48         0
  docling threaded           4           183.119             3.73x         298.9        3.35         0
  docling threaded           8            95.967             7.12x         570.3        1.75         0

Original code with parse and render

taa@Munlochy build % ./run_scaling.exe /Users/taa/Documents/projects/_data/bo767/pdf --threads=1,4,8 --mode=render
--keep-char-cells=true

Benchmark: 767 documents, 54730 total pages
Mode: render
Thread counts to test: 1,4,8
Max concurrent results: 64
Render scale: 1

Decode config:
parameter                                       value
----------------------------------------------------------------
page_boundary                                   crop_box
do_sanitization                                 true
keep_char_cells                                 true
keep_shapes                                     false
keep_bitmaps                                    false
max_num_lines                                   -1
max_num_bitmaps                                 -1
create_word_cells                               false
create_line_cells                               false
enforce_same_font                               true
horizontal_cell_tolerance                       1
word_space_width_factor_for_merge               0.33
line_space_width_factor_for_merge               1
line_space_width_factor_for_merge_with_space    0.33
populate_json_objects                           false
release_native_memory_every_n_pages             0
keep_glyphs                                     false
keep_qpdf_warnings                              false
materialize_bitmap_bytes                        false

Render config:
parameter                       value
--------------------------------------------
render_text                     1
draw_text_bbox                  0
draw_text_basepoint             0
fit_glyph_bbox_to_target        0
resolve_fonts                   1
font_similarity_cutoff          0.75
scale                           1
canvas_width                    -1
canvas_height                   -1

backend              threads     wall_time (s)    vs threaded(1)     pages/sec     ms/page    errors
----------------------------------------------------------------------------------------------------
docling threaded           1          1370.575             1.00x          39.9       25.04         0
docling threaded           4           915.237             1.50x          59.8       16.72         0
docling threaded           8          1085.766             1.26x          50.4       19.84         0

New code with parse and render

./run_scaling.exe /Users/taa/Documents/projects/_data/bo767/pdf --threads=1,2,4,8,12,16 --mode=render --keep-char-cells=true --create-word-cells=true --create-line-cells=true --keep-shapes=true --keep-bitmaps=true

Benchmark: 767 documents, 54730 total pages
Mode: render
Thread counts to test: 1,2,4,8,12,16
Max concurrent results: 64
Render scale: 1

Decode config:
parameter                                       value
----------------------------------------------------------------
page_boundary                                   crop_box
do_sanitization                                 true
keep_char_cells                                 true
keep_shapes                                     true
keep_bitmaps                                    true
max_num_lines                                   -1
max_num_bitmaps                                 -1
create_word_cells                               true
create_line_cells                               true
enforce_same_font                               true
horizontal_cell_tolerance                       1
word_space_width_factor_for_merge               0.33
line_space_width_factor_for_merge               1
line_space_width_factor_for_merge_with_space    0.33
populate_json_objects                           false
release_native_memory_every_n_pages             0
keep_glyphs                                     false
keep_qpdf_warnings                              false
materialize_bitmap_bytes                        false

Render config:
parameter                       value
--------------------------------------------
render_text                     1
draw_text_bbox                  0
draw_text_basepoint             0
fit_glyph_bbox_to_target        0
resolve_fonts                   1
font_similarity_cutoff          0.75
scale                           1
canvas_width                    -1
canvas_height                   -1

backend              threads     wall_time (s)    vs threaded(1)     pages/sec     ms/page    errors
----------------------------------------------------------------------------------------------------
docling threaded           1          1862.194             1.00x          29.4       34.03         0
docling threaded           2           951.721             1.96x          57.5       17.39         0
docling threaded           4           498.223             3.74x         109.9        9.10         0
docling threaded           8           269.480             6.91x         203.1        4.92         0
docling threaded          12           196.294             9.49x         278.8        3.59         0
docling threaded          16           189.389             9.83x         289.0        3.46         0

Same c++-code, but running under python

uv run python ./perf/run_scaling.py \
    --threads "1,2,4,8,12,16" \
    --mode=render \
    --keep-char-cells=true \
    --create-word-cells=true \
    --create-line-cells=true \
    --keep-shapes=true \
    --keep-bitmaps=true
    

backend           threads      wall_time (s)  vs threaded(1)    vs pypdfium2 (1t)      pages/sec    ms/page
----------------  ---------  ---------------  ----------------  -------------------  -----------  ---------
pypdfium2 (1t)    -                  1085.07  1.78x             1.00x                       50.3      19.88
docling threaded  1                  1930.87  1.00x             0.56x                       28.3      35.37
docling threaded  2                  1233.64  1.57x             0.88x                       44.2      22.6
docling threaded  4                  1107.07  1.74x             0.98x                       49.3      20.28
docling threaded  8                  1024.17  1.89x             1.06x                       53.3      18.76
docling threaded  12                 1005.44  1.92x             1.08x                       54.3      18.42
docling threaded  16                  995.58  1.94x             1.09x                       54.8      18.24  

After refactoring the python and C++ to seperate out the computation from the materialization/serialization of the page, we have,

taa@Munlochy docling-parse % uv run python ./perf/run_scaling.py \
    --threads "1,4,8,12" \
    --mode=render \
    --keep-char-cells=true \
    --create-word-cells=true \
    --create-line-cells=true \
    --keep-shapes=true \
    --keep-bitmaps=true \
    --materialize-char-cells=false \
    --materialize-word-cells=false \
    --materialize-line-cells=true \
    --materialize-shapes=false \
    --materialize-bitmaps=true --other ""

Mode: render
Thread counts to test: [1, 4, 8, 12]
Max concurrent results: 64
Other backends: (none)
Render scale: 1.0

Decode config:
parameter                                     value
--------------------------------------------  --------
page_boundary                                 crop_box
do_sanitization                               True
keep_char_cells                               True
keep_shapes                                   True
keep_bitmaps                                  True
max_num_lines                                 -1
max_num_bitmaps                               -1
create_word_cells                             True
create_line_cells                             True
enforce_same_font                             True
horizontal_cell_tolerance                     1.0
word_space_width_factor_for_merge             0.33
line_space_width_factor_for_merge             1.0
line_space_width_factor_for_merge_with_space  0.33
do_thread_safe                                True
release_native_memory_every_n_pages           0
keep_glyphs                                   False
keep_qpdf_warnings                            False
materialize_bitmap_bytes                      False

Materialization config:
parameter                 value
------------------------  -------
materialize_char_cells    False
materialize_word_cells    False
materialize_line_cells    True
materialize_shapes        False
materialize_bitmaps       True
materialize_bitmap_bytes  False

Render config:
parameter                   value
------------------------  -------
render_text                  1
draw_text_bbox               0
draw_text_basepoint          0
fit_glyph_bbox_to_target     0
resolve_fonts                1
font_similarity_cutoff       0.75
scale                        1
canvas_width                -1
canvas_height               -1

backend             threads    wall_time (s)  vs threaded(1)      pages/sec    ms/page
----------------  ---------  ---------------  ----------------  -----------  ---------
docling threaded          1         1912.81   1.00x                    28.5      35.04
docling threaded          4          499.067  3.83x                   109.4       9.14
docling threaded          8          271.89   7.04x                   200.8       4.98
docling threaded         12          202.108  9.46x                   270.1       3.7  

refactored the caching of fonts and with words extraction

docling-parse % uv run python ./perf/run_scaling.py \
    --threads "1,4,8,12" \
    --mode=render \
    --keep-char-cells=true \
    --create-word-cells=true \
    --create-line-cells=true \
    --keep-shapes=true \
    --keep-bitmaps=true \
    --materialize-char-cells=false \
    --materialize-word-cells=true \
    --materialize-line-cells=true \
    --materialize-shapes=false \
    --materialize-bitmaps=true --other "pypdfium2"

Mode: render
Thread counts to test: [1, 4, 8, 12]
Max concurrent results: 64
Other backends: ['pypdfium2']
Render scale: 1.0

Decode config:
parameter                                     value
--------------------------------------------  --------
page_boundary                                 crop_box
do_sanitization                               True
keep_char_cells                               True
keep_shapes                                   True
keep_bitmaps                                  True
max_num_lines                                 -1
max_num_bitmaps                               -1
create_word_cells                             True
create_line_cells                             True
enforce_same_font                             True
horizontal_cell_tolerance                     1.0
word_space_width_factor_for_merge             0.33
line_space_width_factor_for_merge             1.0
line_space_width_factor_for_merge_with_space  0.33
do_thread_safe                                True
release_native_memory_every_n_pages           0
keep_glyphs                                   False
keep_qpdf_warnings                            False

Materialization config:
parameter                 value
------------------------  -------
materialize_char_cells    False
materialize_word_cells    True
materialize_line_cells    True
materialize_shapes        False
materialize_bitmaps       True
materialize_bitmap_bytes  False

Render config:
parameter                   value
------------------------  -------
render_text                  1
draw_text_bbox               0
draw_text_basepoint          0
fit_glyph_bbox_to_target     0
resolve_fonts                1
font_similarity_cutoff       0.75
scale                        1
canvas_width                -1
canvas_height               -1

backend           threads      wall_time (s)  vs threaded(1)    vs pypdfium2 (1t)      pages/sec    ms/page
----------------  ---------  ---------------  ----------------  -------------------  -----------  ---------
pypdfium2 (1t)    -                 1034.8    1.74x             1.00x                       52.7      18.96
docling threaded  1                 1799.07   1.00x             0.58x                       30.3      32.96
docling threaded  4                  483.095  3.72x             2.14x                      113         8.85
docling threaded  8                  265.482  6.78x             3.90x                      205.6       4.86
docling threaded  12                 215.234  8.36x             4.81x                      253.6       3.94

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@mergify

mergify Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@PeterStaar-IBM PeterStaar-IBM requested a review from cau-git May 29, 2026 13:23
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@PeterStaar-IBM PeterStaar-IBM changed the title feat: improve speed of pages/sec and scalability feat!: improve speed of pages/sec and scalability and decoupling compute with materialization in pydantic obejects Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant