Skip to content

TEP-0167: Actionable Failure Diagnostics for TaskRuns#1266

Open
waveywaves wants to merge 1 commit intotektoncd:mainfrom
waveywaves:tep-0167-actionable-failure-diagnostics
Open

TEP-0167: Actionable Failure Diagnostics for TaskRuns#1266
waveywaves wants to merge 1 commit intotektoncd:mainfrom
waveywaves:tep-0167-actionable-failure-diagnostics

Conversation

@waveywaves
Copy link
Copy Markdown
Member

Summary

Add structured failure classification and diagnostic context to TaskRun status, replacing the generic reason: Failed with specific machine-readable reasons and a new failureInfo field.

Today, all non-timeout TaskRun failures produce reason: Failed regardless of root cause — OOM, pod eviction, init container crash, CRI-O timeout, or non-zero exit code all look the same. Users must kubectl describe pod and manually correlate container statuses. Containers that never started (kubelet-to-CRI timeout) produce no error message at all.

Key Design Points

  • 9 classified failure reasons: StepOOM, StepFailed, SidecarOOM, SidecarFailed, InitContainerOOM, InitContainerFailed, PodEvicted, ContainerCreationFailed, and Failed (fallthrough)
  • Priority-ordered classification via separate iteration passes (sidecar OOM surfaces before step OOM, as it's likely the root cause)
  • Waiting state handling: Containers that never started (CRI-O timeout) are now detected and classified as ContainerCreationFailed
  • failureInfo status field: Structured diagnostic context including failing container, exit code, node conditions, pod events, and human-readable suggestion
  • finally task access: $(tasks.<name>.failureInfo.reason) variable interpolation for conditional diagnostic logic in finally tasks
  • Zero overhead on the success path — failureInfo only populated on failure

Related TEPs

TEP Relationship
TEP-0042 Interactive debugging (breakpoints) — complements this TEP
TEP-0097 Extended breakpoints — orthogonal
TEP-0103 Skipping Reason — established the pattern of specific Condition reasons
TEP-0149 CLI Local Data Upload — extends debug with CLI interaction
TEP-0151 Error Attribution via Conditions Status — this TEP implements the goals of TEP-0151
TEP-0166 Task Notices and Warnings — complements (notices for successful tasks, this for failed tasks)

Implementation Status

PR #9368 implements Phase 1 (failure classification) and is approved, pending /lgtm.

Upstream Issues

  • #7396 — Primary tracking issue
  • #9718 — Debug scripts volume ReadOnly (found during research)
  • #9719 — beforeSteps validation bug (found during research)
  • #9720 — beforeSteps name validation (found during research)

/kind tep

This TEP proposes structured failure classification and diagnostic
context for TaskRun failures, replacing the generic "Failed" reason
with specific machine-readable reasons (StepOOM, PodEvicted,
InitContainerFailed, ContainerCreationFailed, etc.) and adding a
failureInfo field to TaskRun status.

/kind tep

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Apr 2, 2026
@tekton-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign chitrangpatel after the PR has been reviewed.
You can assign the PR to them by writing /assign @chitrangpatel in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 2, 2026
@tekton-robot
Copy link
Copy Markdown
Contributor

The following Tekton test failed:

Test name Commit Details Required Rerun command
pull-community-teps-lint c88ab3f link true /test pull-community-teps-lint

@vdemeester vdemeester self-assigned this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants