Skip to content

gh-146393: Optimize float division operations by mutating uniquely-referenced operands in place (JIT only)#146397

Open
eendebakpt wants to merge 9 commits intopython:mainfrom
eendebakpt:jit_float_truediv
Open

gh-146393: Optimize float division operations by mutating uniquely-referenced operands in place (JIT only)#146397
eendebakpt wants to merge 9 commits intopython:mainfrom
eendebakpt:jit_float_truediv

Conversation

@eendebakpt
Copy link
Copy Markdown
Contributor

@eendebakpt eendebakpt commented Mar 24, 2026

We optimize float divisions for the case where one of the operands is a unique reference. This is similar to #146307, but with a guard for division by zero.

  • We do not add opcodes in tier 1
  • For tier 2 we can specialize for the case when one of the operations is a unique reference and for the case when there are no unique references. The case _BINARY_TRUEDIV_FLOAT where there are no unique references (or we miss information about the uniqueness) has no performance improvement in itself, but is to propagate types better. This opcode has guards, so that even with input from locals the type is propagated.

Micro-benchmarks (min of 3 runs, 2M iterations)

update benchmark no longer valid

Pattern main (ns/iter) branch (ns/iter) Speedup Notes
(a+b) * c 10.8 10.9 -- baseline (multiply, already optimized)
(a+b) + (c+d) 18.0 18.1 -- baseline (add, already optimized)
a / b 20.6 10.8 1.9x speculative guards + truediv specialization
(a+b) / c 26.4 11.0 2.4x inplace LHS, guard inserted for c
(2.0+x) / y 25.1 10.9 2.3x inplace LHS, guard inserted for y
c / (a+b) 26.0 11.2 2.3x inplace RHS, guard inserted for c
(a/b) / (c/d) 41.3 19.1 2.2x speculative guards enable inplace chain
(a/b) + (c/d) 29.1 19.0 1.5x speculative guards enable inplace add

All patterns are total += <expr> in a tight loop.

Benchmark script
"""Benchmark for float true division tier 2 specialization.

Usage:
    ./python bench_truediv.py
"""
import timeit

N = 2_000_000
INNER = 1000


def bench(label, fn):
    iters = N // INNER
    times = [timeit.timeit(fn, number=iters) for _ in range(3)]
    t = min(times)
    print(f"  {label}: {t/N*1e9:.1f} ns/iter")


def f_chain_mul(n, a, b, c):
    t = 0.0
    for i in range(n):
        t += (a + b) * c
    return t


def f_div(n, a, b):
    t = 0.0
    for i in range(n):
        t += a / b
    return t


def f_chain_div(n, a, b, c):
    t = 0.0
    for i in range(n):
        t += (a + b) / c
    return t


def f_2px_div_y(n, x, y):
    t = 0.0
    for i in range(n):
        t += (2.0 + x) / y
    return t


def f_div_rhs(n, a, b, c):
    t = 0.0
    for i in range(n):
        t += c / (a + b)
    return t


def f_ab_div_cd(n, a, b, c, d):
    t = 0.0
    for i in range(n):
        t += (a / b) / (c / d)
    return t


def f_ab_add_cd(n, a, b, c, d):
    t = 0.0
    for i in range(n):
        t += (a / b) + (c / d)
    return t


def f_add_chain(n, a, b, c, d):
    t = 0.0
    for i in range(n):
        t += (a + b) + (c + d)
    return t


# Warmup
f_chain_mul(10000, 2.0, 3.0, 4.0)
f_div(10000, 10.0, 3.0)
f_chain_div(10000, 2.0, 3.0, 4.0)
f_2px_div_y(10000, 3.0, 4.0)
f_div_rhs(10000, 2.0, 3.0, 4.0)
f_ab_div_cd(10000, 10.0, 3.0, 4.0, 5.0)
f_ab_add_cd(10000, 10.0, 3.0, 4.0, 5.0)
f_add_chain(10000, 1.0, 2.0, 3.0, 4.0)

print("Float truediv benchmark (min of 3 runs):")
bench("(a+b) * c              (baseline) ", lambda: f_chain_mul(INNER, 2.0, 3.0, 4.0))
bench("(a+b) + (c+d)          (baseline) ", lambda: f_add_chain(INNER, 1.0, 2.0, 3.0, 4.0))
bench("a / b                  (spec div) ", lambda: f_div(INNER, 10.0, 3.0))
bench("(a+b) / c              (inplace L)", lambda: f_chain_div(INNER, 2.0, 3.0, 4.0))
bench("(2.0+x) / y            (inplace L)", lambda: f_2px_div_y(INNER, 3.0, 4.0))
bench("c / (a+b)              (inplace R)", lambda: f_div_rhs(INNER, 2.0, 3.0, 4.0))
bench("(a/b) / (c/d)          (spec div) ", lambda: f_ab_div_cd(INNER, 10.0, 3.0, 4.0, 5.0))
bench("(a/b) + (c/d)          (spec div) ", lambda: f_ab_add_cd(INNER, 10.0, 3.0, 4.0, 5.0))

Analysis

The inplace truediv kicks in when at least one operand is a uniquely-referenced float (e.g. the result of a prior add/multiply). The optimizer emits _BINARY_OP_TRUEDIV_FLOAT_INPLACE or _INPLACE_RIGHT, saving one PyFloat_FromDouble allocation + deallocation per iteration.

The optimization works well for several cases. For some (e.g. (a/b) + (c/d) ) the performance gain is not due to an inplace division, but by better type propagation allowing the + to be specialized inplace. The a / b is also faster because of better type propagation and a += in the test script.

In typical code intermediate results are often stored in local variables. For these cases it is important pick up (speculative) type information as soon as possible.

…izer

Add inplace float true division ops that the tier 2 optimizer emits
when at least one operand is a known float:

- _BINARY_OP_TRUEDIV_FLOAT_INPLACE (unique LHS)
- _BINARY_OP_TRUEDIV_FLOAT_INPLACE_RIGHT (unique RHS)

The optimizer inserts _GUARD_TOS_FLOAT / _GUARD_NOS_FLOAT for
operands not yet known to be float, enabling specialization in
expressions like `(a + b) / c`.

Also marks the result of all NB_TRUE_DIVIDE operations as unique
float in the abstract interpreter, enabling downstream inplace ops
even for generic `a / b` (the `+=` can reuse the division result).

Speeds up chain division patterns by ~2.3x and simple `total += a/b`
by ~1.5x.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eendebakpt and others added 5 commits March 25, 2026 00:01
Operations that always return a new float (true division, float**int,
int**negative_int, mixed int/float arithmetic) now mark their result
as PyJitRef_MakeUnique. This enables downstream operations to mutate
the result in place instead of allocating a new float.

Int results are NOT marked unique because small ints are cached/immortal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only set the result of NB_TRUE_DIVIDE to float when both operands
are known int/float. Types like Fraction and Decimal override
__truediv__ and return non-float results. The unconditional type
propagation caused _POP_TOP_FLOAT to be emitted for Fraction results,
crashing with an assertion failure.

Fixes the segfault in test_math.testRemainder and
test_random.test_binomialvariate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eendebakpt eendebakpt marked this pull request as ready for review March 25, 2026 12:02
@Fidget-Spinner
Copy link
Copy Markdown
Member

Fidget-Spinner commented Mar 29, 2026

@eendebakpt we shouldn't speculatively add guards if we don't have history that they are actually floats, otherwise this will cause bad perf for overloaded binary ops.

The fix is to record the binop types at trace recording time to speculate on. Check out for example _RECORD_TOS_TYPE. The problem now however is that we currently only support one recording op per macro. There are two possible solutions:

  1. Support multiple _RECORD functions in a single macro, and update tooling to support it. (my preferred solution)
  2. Have one _RECORD_BINARY_OP record uop, and add another RECORD_VALUE1() macro that is rewritten to write to operand1. This is a little less nice, as it causes more code duplication.

@eendebakpt
Copy link
Copy Markdown
Contributor Author

I removed the speculative guards. Updated benchmarks:

Pattern main (ns/iter) branch (ns/iter) Speedup Notes
(a+b) * c 10.7 10.8 -- baseline (multiply, already optimized)
(a+b) + (c+d) 18.0 18.0 -- baseline (add, already optimized)
a / b 20.5 20.8 -- generic (needs tier 1 or _RECORD)
(a+b) / (c+d) 32.0 18.6 1.7x both operands known float, inplace LHS
(a+b)/(c+d)/(e+f) 49.8 25.4 2.0x chained divisions, all inplace
(c*d) / (a+b) 31.7 18.5 1.7x both operands known float, inplace LHS
(a+b)/(c+d)-(e+f)/(g+h) 53.9 31.5 1.7x divisions + inplace subtract

Improving the _RECORD macros I will leave for another PR.

We could remove the _BINARY_OP_TRUEDIV_FLOAT itself (leaving just the inplace versions) as the performance gain from that is limited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants