increasing precision tolerance by francesco-bertolotti · Pull Request #3060 · NVIDIA/TransformerEngine

francesco-bertolotti · 2026-05-29T17:22:11Z

Here’s a cleaner Markdown rewrite of your PR description:

Description

test_kv_cache compares full-sequence attention against incremental KV-cache decoding. In the TransformerLayer configuration where head_dim > 128 in fp16, these two execution paths use different kernels and masking strategies (e.g., causal vs. padding_causal_bottom_right, and full-matrix vs. single-query-row kernels). As a result, their outputs diverge slightly due to accumulated fp16 rounding differences.

On Ampere, this divergence can reach the current tolerance threshold in rare cases, producing a spurious failure. In one observed instance, a single element out of 4096 showed an absolute difference of ~0.0107, which narrowly exceeds the existing 1e-2 tolerance.

This change slightly relaxes the fp16 tolerance for the affected configuration to make the test robust across architectures. No runtime or library code is modified.

Fixes: N/A (spurious test_kv_cache failure on sm80 / fp16 / head_dim=256)

Type of change

Checklist

I have read and followed the contributing guidelines (https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (N/A — test-only change)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works (N/A — test-only tolerance adjustment)
New and existing unit tests pass locally with my changes

Notes

Contributing guidelines and full test runs still need to be verified before final submission.
bfloat16 tolerance in this branch is unchanged, as it was not failing.
The Pyright import warnings are unrelated noise from missing stubs in the local environment.
Potential reviewer feedback: this could be further refined to apply the tolerance only conditionally by compute capability (e.g., sm < 90), rather than broadly for the branch.

Diff summary

# tests/pytorch/attention/test_kv_cache.py (get_tols)
- torch.half: (1e-2, 1e-2),
+ torch.half: (1.5e-2, 1.5e-2),

# plus explanatory comment added

greptile-apps · 2026-05-29T17:23:39Z

Greptile Summary

This PR widens the fp16 numerical tolerance in test_kv_cache from 1e-2 to 1.5e-2 for the TransformerLayer path when head_dim > 128 and the backend is not UnfusedAttention. The change is narrowly scoped and accompanied by a clear comment explaining that sm80 + older cuDNN can produce a single-element divergence of ~0.0107 due to different kernels and masking strategies between the full-sequence and incremental KV-cache paths.

The new tolerance (1.5e-2) stays below the existing UnfusedAttention ceiling (1.6e-2) for the same dtype/config, keeping the hierarchy coherent.
bfloat16 tolerances are intentionally left unchanged, consistent with the PR description stating they were not observed to fail.

Confidence Score: 5/5

Safe to merge — the change is limited to a test tolerance value with no effect on library or runtime code.

The only modification is a slight relaxation of a numerical tolerance in a test helper, accompanied by a well-reasoned comment. The new value stays below the existing tolerance for the analogous UnfusedAttention path, the narrowly scoped condition (TransformerLayer + head_dim > 128 + fused/flash backend + fp16) is correct, and bfloat16 is intentionally left untouched.

No files require special attention.

Important Files Changed

Filename	Overview
tests/pytorch/attention/test_kv_cache.py	fp16 tolerance in get_tols widened from 1e-2 to 1.5e-2 for TransformerLayer + head_dim > 128 + non-UnfusedAttention path; adds explanatory comment. No runtime code changed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_tols called] --> B{module == TransformerLayer?}
    B -- No --> C[DotProductAttention tolerances\ntorch.half: 1e-3, 1e-3]
    B -- Yes --> D{head_dim_qk <= 128?}
    D -- Yes --> E[torch.half: 5e-3, 5e-3\ntorch.bfloat16: 3.5e-2, 3.5e-2]
    D -- No --> F{backend == UnfusedAttention?}
    F -- Yes --> G[torch.half: 1.6e-2, 1.6e-2\ntorch.bfloat16: 1.2e-1, 1e-1]
    F -- No --> H["torch.half: 1.5e-2, 1.5e-2 ← CHANGED\ntorch.bfloat16: 8e-2, 7e-2 (unchanged)"]
    style H fill:#fffbe6,stroke:#f0ad4e

_{Reviews (2): Last reviewed commit: "increasing precision tolerance" | Re-trigger Greptile}

cyanguwa · 2026-05-29T19:59:15Z

/te-ci pytorch L0

Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 29, 2026

cyanguwa self-requested a review May 29, 2026 19:58

increasing precision tolerance

2c76593

Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>

francesco-bertolotti force-pushed the f14-increased-tolerance branch from 3695fe4 to 2c76593 Compare May 30, 2026 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increasing precision tolerance#3060

increasing precision tolerance#3060
francesco-bertolotti wants to merge 1 commit into
NVIDIA:mainfrom
francesco-bertolotti:f14-increased-tolerance

francesco-bertolotti commented May 29, 2026

Uh oh!

greptile-apps Bot commented May 29, 2026 •

edited

Loading

Uh oh!

cyanguwa commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

francesco-bertolotti commented May 29, 2026

Description

Type of change

Checklist

Notes

Diff summary

Uh oh!

greptile-apps Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

cyanguwa commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 29, 2026 •

edited

Loading