[`auto_docstring`] needs to be only run on __doc__ by ArthurZucker · Pull Request #45056 · huggingface/transformers

ArthurZucker · 2026-03-27T11:36:10Z

What does this PR do?

This is mega long due I wanted to check benches.
Its not super super huge but a win is a win

HuggingFaceDocBuilderDev · 2026-03-27T11:44:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2026-03-27T13:37:24Z

Benchmark Update 4 — Decoration speedup (warm process, without PyTorch)

Setup: same Python process, all imports and caches already warm (inspect signature cache, regex, auto-module). Both branches measured in the same process using explicit sys.path injection to bypass the editable install. 50 rounds × 3 real config classes.

Decoration cost per class

	`@auto_docstring` call cost	what it does
branch	~0.35 µs / class	stores a `_LazyDocClass` closure
main	~1 106 µs / class	generates the full docstring eagerly
ratio	~3 160×	—

branch: 0.001 ms / 3 classes  =  0.35 µs/class   ← just stores a closure
main:   3.317 ms / 3 classes  = 1106 µs/class    ← full generation happens here

Cached cls.__doc__ access after generation: ~60 ns/class on both (identical).

What this means for inference / training

	main	branch
`from transformers import LlamaConfig`	pays ~1 ms to generate doc immediately	pays ~0.35 µs to store a closure
`model.forward(inputs)`	`__doc__` never touched	`__doc__` never touched
`LlamaConfig.__doc__` (explicit access)	~0 ns (already done)	~1 ms (generated once, then cached)
`LlamaConfig.__doc__` again	~60 ns	~60 ns

Inference and training never read __doc__. On main, each from transformers import Xxx pays ~1 ms to generate the docstring whether or not it is ever used. On branch, that cost is deferred and only paid if .__doc__ is explicitly accessed.

Why this does not show up in cold-process import benchmarks

The ~1 ms generation cost is negligible compared to Python startup (~200 ms) + transformers package init (~600 ms) + optional PyTorch import (~1 500 ms). The cold-process noise floor is ~50 ms, so a ~1–5 ms per-class saving is invisible there. The benefit accumulates across all decorated classes but is swamped by startup variance in single-class measurements.

long due

4803b72

ArthurZucker marked this pull request as ready for review March 27, 2026 11:36

ArthurZucker marked this pull request as draft March 27, 2026 13:09

ArthurZucker marked this pull request as ready for review March 27, 2026 13:47

ArthurZucker added 2 commits March 27, 2026 15:10

fix

13f5646

styling

0806275

ArthurZucker requested a review from Cyrilvallez March 27, 2026 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`auto_docstring`] needs to be only run on doc #45056

[`auto_docstring`] needs to be only run on doc #45056
ArthurZucker wants to merge 3 commits intomainfrom
fix-auto-doc

ArthurZucker commented Mar 27, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 27, 2026

Uh oh!

ArthurZucker commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 27, 2026

Uh oh!

ArthurZucker commented Mar 27, 2026

Benchmark Update 4 — Decoration speedup (warm process, without PyTorch)

Decoration cost per class

What this means for inference / training

Why this does not show up in cold-process import benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArthurZucker commented Mar 27, 2026 •

edited

Loading