Skip to content

increasing precision tolerance#3060

Open
francesco-bertolotti wants to merge 1 commit into
NVIDIA:mainfrom
francesco-bertolotti:f14-increased-tolerance
Open

increasing precision tolerance#3060
francesco-bertolotti wants to merge 1 commit into
NVIDIA:mainfrom
francesco-bertolotti:f14-increased-tolerance

Conversation

@francesco-bertolotti
Copy link
Copy Markdown
Contributor

Here’s a cleaner Markdown rewrite of your PR description:


Description

test_kv_cache compares full-sequence attention against incremental KV-cache decoding. In the TransformerLayer configuration where head_dim > 128 in fp16, these two execution paths use different kernels and masking strategies (e.g., causal vs. padding_causal_bottom_right, and full-matrix vs. single-query-row kernels). As a result, their outputs diverge slightly due to accumulated fp16 rounding differences.

On Ampere, this divergence can reach the current tolerance threshold in rare cases, producing a spurious failure. In one observed instance, a single element out of 4096 showed an absolute difference of ~0.0107, which narrowly exceeds the existing 1e-2 tolerance.

This change slightly relaxes the fp16 tolerance for the affected configuration to make the test robust across architectures. No runtime or library code is modified.

Fixes: N/A (spurious test_kv_cache failure on sm80 / fp16 / head_dim=256)


Type of change

  • Documentation change
  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change
  • Infra/Build change
  • Code refactoring

Checklist

  • I have read and followed the contributing guidelines (https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (N/A — test-only change)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works (N/A — test-only tolerance adjustment)
  • New and existing unit tests pass locally with my changes

Notes

  • Contributing guidelines and full test runs still need to be verified before final submission.
  • bfloat16 tolerance in this branch is unchanged, as it was not failing.
  • The Pyright import warnings are unrelated noise from missing stubs in the local environment.
  • Potential reviewer feedback: this could be further refined to apply the tolerance only conditionally by compute capability (e.g., sm < 90), rather than broadly for the branch.

Diff summary

# tests/pytorch/attention/test_kv_cache.py (get_tols)
- torch.half: (1e-2, 1e-2),
+ torch.half: (1.5e-2, 1.5e-2),

# plus explanatory comment added

@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 29, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 29, 2026

Greptile Summary

This PR widens the fp16 numerical tolerance in test_kv_cache from 1e-2 to 1.5e-2 for the TransformerLayer path when head_dim > 128 and the backend is not UnfusedAttention. The change is narrowly scoped and accompanied by a clear comment explaining that sm80 + older cuDNN can produce a single-element divergence of ~0.0107 due to different kernels and masking strategies between the full-sequence and incremental KV-cache paths.

  • The new tolerance (1.5e-2) stays below the existing UnfusedAttention ceiling (1.6e-2) for the same dtype/config, keeping the hierarchy coherent.
  • bfloat16 tolerances are intentionally left unchanged, consistent with the PR description stating they were not observed to fail.

Confidence Score: 5/5

Safe to merge — the change is limited to a test tolerance value with no effect on library or runtime code.

The only modification is a slight relaxation of a numerical tolerance in a test helper, accompanied by a well-reasoned comment. The new value stays below the existing tolerance for the analogous UnfusedAttention path, the narrowly scoped condition (TransformerLayer + head_dim > 128 + fused/flash backend + fp16) is correct, and bfloat16 is intentionally left untouched.

No files require special attention.

Important Files Changed

Filename Overview
tests/pytorch/attention/test_kv_cache.py fp16 tolerance in get_tols widened from 1e-2 to 1.5e-2 for TransformerLayer + head_dim > 128 + non-UnfusedAttention path; adds explanatory comment. No runtime code changed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_tols called] --> B{module == TransformerLayer?}
    B -- No --> C[DotProductAttention tolerances\ntorch.half: 1e-3, 1e-3]
    B -- Yes --> D{head_dim_qk <= 128?}
    D -- Yes --> E[torch.half: 5e-3, 5e-3\ntorch.bfloat16: 3.5e-2, 3.5e-2]
    D -- No --> F{backend == UnfusedAttention?}
    F -- Yes --> G[torch.half: 1.6e-2, 1.6e-2\ntorch.bfloat16: 1.2e-1, 1e-1]
    F -- No --> H["torch.half: 1.5e-2, 1.5e-2 ← CHANGED\ntorch.bfloat16: 8e-2, 7e-2 (unchanged)"]
    style H fill:#fffbe6,stroke:#f0ad4e
Loading

Reviews (2): Last reviewed commit: "increasing precision tolerance" | Re-trigger Greptile

@cyanguwa cyanguwa self-requested a review May 29, 2026 19:58
@cyanguwa
Copy link
Copy Markdown
Collaborator

/te-ci pytorch L0

Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants