gh-146393: Optimize float division operations by mutating uniquely-referenced operands in place (JIT only) by eendebakpt · Pull Request #146397 · python/cpython

eendebakpt · 2026-03-24T22:40:05Z

We optimize float divisions for the case where one of the operands is a unique reference. This is similar to #146307, but with a guard for division by zero.

We do not add opcodes in tier 1
For tier 2 we can specialize for the case when one of the operations is a unique reference and for the case when there are no unique references. The case _BINARY_TRUEDIV_FLOAT where there are no unique references (or we miss information about the uniqueness) has no performance improvement in itself, but is to propagate types better. This opcode has guards, so that even with input from locals the type is propagated.

Micro-benchmarks (min of 3 runs, 2M iterations)

update benchmark no longer valid

Pattern	main (ns/iter)	branch (ns/iter)	Speedup	Notes
`(a+b) * c`	10.8	10.9	--	baseline (multiply, already optimized)
`(a+b) + (c+d)`	18.0	18.1	--	baseline (add, already optimized)
`a / b`	20.6	10.8	1.9x	speculative guards + truediv specialization
`(a+b) / c`	26.4	11.0	2.4x	inplace LHS, guard inserted for `c`
`(2.0+x) / y`	25.1	10.9	2.3x	inplace LHS, guard inserted for `y`
`c / (a+b)`	26.0	11.2	2.3x	inplace RHS, guard inserted for `c`
`(a/b) / (c/d)`	41.3	19.1	2.2x	speculative guards enable inplace chain
`(a/b) + (c/d)`	29.1	19.0	1.5x	speculative guards enable inplace add

All patterns are total += <expr> in a tight loop.

Benchmark script

"""Benchmark for float true division tier 2 specialization.

Usage:
    ./python bench_truediv.py
"""
import timeit

N = 2_000_000
INNER = 1000


def bench(label, fn):
    iters = N // INNER
    times = [timeit.timeit(fn, number=iters) for _ in range(3)]
    t = min(times)
    print(f"  {label}: {t/N*1e9:.1f} ns/iter")


def f_chain_mul(n, a, b, c):
    t = 0.0
    for i in range(n):
        t += (a + b) * c
    return t


def f_div(n, a, b):
    t = 0.0
    for i in range(n):
        t += a / b
    return t


def f_chain_div(n, a, b, c):
    t = 0.0
    for i in range(n):
        t += (a + b) / c
    return t


def f_2px_div_y(n, x, y):
    t = 0.0
    for i in range(n):
        t += (2.0 + x) / y
    return t


def f_div_rhs(n, a, b, c):
    t = 0.0
    for i in range(n):
        t += c / (a + b)
    return t


def f_ab_div_cd(n, a, b, c, d):
    t = 0.0
    for i in range(n):
        t += (a / b) / (c / d)
    return t


def f_ab_add_cd(n, a, b, c, d):
    t = 0.0
    for i in range(n):
        t += (a / b) + (c / d)
    return t


def f_add_chain(n, a, b, c, d):
    t = 0.0
    for i in range(n):
        t += (a + b) + (c + d)
    return t


# Warmup
f_chain_mul(10000, 2.0, 3.0, 4.0)
f_div(10000, 10.0, 3.0)
f_chain_div(10000, 2.0, 3.0, 4.0)
f_2px_div_y(10000, 3.0, 4.0)
f_div_rhs(10000, 2.0, 3.0, 4.0)
f_ab_div_cd(10000, 10.0, 3.0, 4.0, 5.0)
f_ab_add_cd(10000, 10.0, 3.0, 4.0, 5.0)
f_add_chain(10000, 1.0, 2.0, 3.0, 4.0)

print("Float truediv benchmark (min of 3 runs):")
bench("(a+b) * c              (baseline) ", lambda: f_chain_mul(INNER, 2.0, 3.0, 4.0))
bench("(a+b) + (c+d)          (baseline) ", lambda: f_add_chain(INNER, 1.0, 2.0, 3.0, 4.0))
bench("a / b                  (spec div) ", lambda: f_div(INNER, 10.0, 3.0))
bench("(a+b) / c              (inplace L)", lambda: f_chain_div(INNER, 2.0, 3.0, 4.0))
bench("(2.0+x) / y            (inplace L)", lambda: f_2px_div_y(INNER, 3.0, 4.0))
bench("c / (a+b)              (inplace R)", lambda: f_div_rhs(INNER, 2.0, 3.0, 4.0))
bench("(a/b) / (c/d)          (spec div) ", lambda: f_ab_div_cd(INNER, 10.0, 3.0, 4.0, 5.0))
bench("(a/b) + (c/d)          (spec div) ", lambda: f_ab_add_cd(INNER, 10.0, 3.0, 4.0, 5.0))

Analysis

The inplace truediv kicks in when at least one operand is a uniquely-referenced float (e.g. the result of a prior add/multiply). The optimizer emits _BINARY_OP_TRUEDIV_FLOAT_INPLACE or _INPLACE_RIGHT, saving one PyFloat_FromDouble allocation + deallocation per iteration.

The optimization works well for several cases. For some (e.g. (a/b) + (c/d) ) the performance gain is not due to an inplace division, but by better type propagation allowing the + to be specialized inplace. The a / b is also faster because of better type propagation and a += in the test script.

In typical code intermediate results are often stored in local variables. For these cases it is important pick up (speculative) type information as soon as possible.

Issue: Optimize float division operations by mutating uniquely-referenced operands in place (JIT only) #146393

…izer Add inplace float true division ops that the tier 2 optimizer emits when at least one operand is a known float: - _BINARY_OP_TRUEDIV_FLOAT_INPLACE (unique LHS) - _BINARY_OP_TRUEDIV_FLOAT_INPLACE_RIGHT (unique RHS) The optimizer inserts _GUARD_TOS_FLOAT / _GUARD_NOS_FLOAT for operands not yet known to be float, enabling specialization in expressions like `(a + b) / c`. Also marks the result of all NB_TRUE_DIVIDE operations as unique float in the abstract interpreter, enabling downstream inplace ops even for generic `a / b` (the `+=` can reuse the division result). Speeds up chain division patterns by ~2.3x and simple `total += a/b` by ~1.5x. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Operations that always return a new float (true division, float**int, int**negative_int, mixed int/float arithmetic) now mark their result as PyJitRef_MakeUnique. This enables downstream operations to mutate the result in place instead of allocating a new float. Int results are NOT marked unique because small ints are cached/immortal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Only set the result of NB_TRUE_DIVIDE to float when both operands are known int/float. Types like Fraction and Decimal override __truediv__ and return non-float results. The unconditional type propagation caused _POP_TOP_FLOAT to be emitted for Fraction results, crashing with an assertion failure. Fixes the segfault in test_math.testRemainder and test_random.test_binomialvariate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fidget-Spinner · 2026-03-29T18:28:01Z

@eendebakpt we shouldn't speculatively add guards if we don't have history that they are actually floats, otherwise this will cause bad perf for overloaded binary ops.

The fix is to record the binop types at trace recording time to speculate on. Check out for example _RECORD_TOS_TYPE. The problem now however is that we currently only support one recording op per macro. There are two possible solutions:

Support multiple _RECORD functions in a single macro, and update tooling to support it. (my preferred solution)
Have one _RECORD_BINARY_OP record uop, and add another RECORD_VALUE1() macro that is rewritten to write to operand1. This is a little less nice, as it causes more code duplication.

eendebakpt · 2026-03-29T20:10:45Z

I removed the speculative guards. Updated benchmarks:

Pattern	main (ns/iter)	branch (ns/iter)	Speedup	Notes
`(a+b) * c`	10.7	10.8	--	baseline (multiply, already optimized)
`(a+b) + (c+d)`	18.0	18.0	--	baseline (add, already optimized)
`a / b`	20.5	20.8	--	generic (needs tier 1 or _RECORD)
`(a+b) / (c+d)`	32.0	18.6	1.7x	both operands known float, inplace LHS
`(a+b)/(c+d)/(e+f)`	49.8	25.4	2.0x	chained divisions, all inplace
`(c*d) / (a+b)`	31.7	18.5	1.7x	both operands known float, inplace LHS
`(a+b)/(c+d)-(e+f)/(g+h)`	53.9	31.5	1.7x	divisions + inplace subtract

Improving the _RECORD macros I will leave for another PR.

We could remove the _BINARY_OP_TRUEDIV_FLOAT itself (leaving just the inplace versions) as the performance gain from that is limited.

eendebakpt requested review from Fidget-Spinner, markshannon, savannahostrowski and tomasr8 as code owners March 24, 2026 22:40

bedevere-app bot added the awaiting review label Mar 24, 2026

bedevere-app bot mentioned this pull request Mar 24, 2026

Optimize float division operations by mutating uniquely-referenced operands in place (JIT only) #146393

Open

eendebakpt marked this pull request as draft March 24, 2026 22:56

bedevere-app bot removed the awaiting review label Mar 24, 2026

eendebakpt and others added 5 commits March 25, 2026 00:01

add guards

5c4e3bf

news entry

228bfa9

Merge branch 'main' into jit_float_truediv

8bf12bf

eendebakpt marked this pull request as ready for review March 25, 2026 12:02

bedevere-app bot added the awaiting review label Mar 25, 2026

Merge remote-tracking branch 'upstream/main' into pr/146397

46c241e

eendebakpt added 2 commits March 29, 2026 21:44

avoid speculative guards

53dd383

fix test

44ee7f0

eendebakpt mentioned this pull request Mar 30, 2026

gh-146640: Optimize int operations by mutating uniquely-referenced operands in place (JIT only) #146641

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-146393: Optimize float division operations by mutating uniquely-referenced operands in place (JIT only)#146397

gh-146393: Optimize float division operations by mutating uniquely-referenced operands in place (JIT only)#146397
eendebakpt wants to merge 9 commits intopython:mainfrom
eendebakpt:jit_float_truediv

eendebakpt commented Mar 24, 2026 •

edited

Loading

Uh oh!

Fidget-Spinner commented Mar 29, 2026 •

edited

Loading

Uh oh!

eendebakpt commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

eendebakpt commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Micro-benchmarks (min of 3 runs, 2M iterations)

Analysis

Uh oh!

Fidget-Spinner commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eendebakpt commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eendebakpt commented Mar 24, 2026 •

edited

Loading

Fidget-Spinner commented Mar 29, 2026 •

edited

Loading