Add GCP structured log labels for task_id/run_id on backend logs by camd · Pull Request #9666 · mozilla/treeherder

camd · 2026-07-05T14:47:15Z

What & why

Backend log lines (failure-line processing, Taskcluster ingestion, and everything logged in between) go to stdout as plaintext and are scraped into GCP Cloud Logging. The only queryable dimensions today are the Kubernetes resource labels (prototype / stage / deployment). Where an identifier appears at all, it's usually the internal Treeherder job_id interpolated into the message string — not the Taskcluster task_id engineers actually reason about, and not a filterable field.

This PR adds a centralized logging context that attaches task_id, run_id, job_id, and component as queryable Cloud Logging labels on those lines, so an investigator can run:

labels.task_id="V3SVuxO8TFy37En_6HcXLp"

and see every Treeherder log line emitted while processing that task/run.

How it works

treeherder/utils/logging_context.py — a contextvars-based log_context(**labels) (context manager / decorator), get_log_labels(), a job_log_labels(job) resolver (pulls task_id/run_id from TaskclusterMetadata, omitting them for non-Taskcluster jobs), and the pure apply_context_labels(record) helper. No third-party imports.
treeherder/utils/gcp_logging.py — ContextLabelStructuredLogHandler, a google-cloud-logging StructuredLogHandler subclass. Metadata is set once at a task/ingestion boundary and propagates to every log line in scope, so no individual logger.*() call sites change.
config/settings.py — new GCP_STRUCTURED_LOGGING env flag (default off). When on, the treeherder logger is routed through the structured handler. Local/dev keeps the current human-readable plaintext.
Wiring — log_parser/tasks.py::parse_logs (component="log_parser") and etl/taskcluster_pulse/handler.py::handle_message (component="ingestion").

Implementation note (why a handler subclass, not a filter)

StructuredLogHandler.__init__ installs its own CloudLoggingFilter, which reads record.labels before any filter added afterwards runs — so labels added by a later filter are silently dropped. The subclass stamps labels in handle(), i.e. before the filter phase, which is the only reliable point. Verified empirically.

GCP configuration changes needed

None are required. StructuredLogHandler writes GCP-structured JSON to stdout; the GKE/Cloud Run logging agent parses JSON on stdout and promotes the reserved logging.googleapis.com/labels field to LogEntry.labels by default. No new dependency credentials, no API calls, no agent config, no sink/index setup. labels.task_id=... becomes queryable automatically once the flag is on.

Note

One thing to check before flipping the flag (not a GCP change, but an audit): any existing log-based metrics, alerting policies, or sinks that regex-match textPayload. When the flag is on, the treeherder logger's output moves from textPayload (one string) to jsonPayload (structured), so a metric/alert matching on textPayload for those lines would stop matching. It would need to point at jsonPayload.message (and can now match labels.task_id etc.). If no such log-based metrics exist over app logs, there is nothing to do.

Transition period & rollout

The flag defaults off, so merging this changes nothing. Rollout is per-deployment:

Merge — no behavior change anywhere (flag off).
Enable on stage (GCP_STRUCTURED_LOGGING=true) — confirm labels.task_id is queryable, confirm the Logs Explorer still reads well, and audit any log-based metrics/alerts as above.
Enable on prod.

There is no dual-format period within a single deployment: when the flag flips, all treeherder-logger output switches from plaintext to structured JSON at once (the plaintext console handler is replaced, so there's no double-logging). During the staged rollout, stage emits structured JSON while prod is still plaintext — worth knowing if any query/metric spans both environments.

Will logs be unreadable with no GCP changes? What changes for the viewer?

No — logs stay readable, and in the Cloud Logging UI they arguably read better. The Logs Explorer uses jsonPayload.message as the summary line, so the human-readable message is front-and-center, now with severity, sourceLocation, and the new labels as first-class structured fields.

It is structurally different (payload type changes), but the visible message is the same text — minus the bracketed prefix, which becomes structured fields:

Before (legacy — textPayload):

[2026-07-05 12:00:00,000] INFO [treeherder.log_parser.tasks:84] Running store_failure_lines for job 12345

After (flag on — jsonPayload):

{
  "message": "Running store_failure_lines for job 12345",
  "severity": "INFO",
  "logging.googleapis.com/labels": {
    "python_logger": "treeherder.log_parser.tasks",
    "task_id": "V3SVuxO8TFy37En_6HcXLp",
    "run_id": "0",
    "job_id": "12345",
    "component": "log_parser"
  },
  "logging.googleapis.com/sourceLocation": { "file": ".../tasks.py", "line": "84", "function": "store_failure_lines" }
}

So it's not just "the old line plus extra info" — the timestamp/level/logger/line that were baked into the text string become proper structured fields, and the new labels are added. In the Logs Explorer that's a net improvement (filter by severity, jump to source, filter by task). The one genuine downside: raw kubectl logs / raw pod stdout for the treeherder logger is now JSON rather than pretty text, which is less pleasant to eyeball outside the Cloud Logging UI. request.summary (MozLog JSON) and the django/kombu loggers are unchanged.

Testing

New unit tests for the context module, label resolver, and handler (tests/utils/test_logging_context.py, tests/utils/test_gcp_logging.py), plus a wiring test proving handle_message sets the context (tests/etl/taskcluster_pulse/test_handler.py). All written test-first.
Affected suites green (tests/utils, tests/log_parser, tests/etl/taskcluster_pulse). One unrelated pre-existing failure in test_job_loader.py::test_ingest_pulse_jobs_with_missing_push (fails identically on a clean checkout).
End-to-end check: the real settings LOGGING → dictConfig → handler path emits a JSON line carrying the labels.
ruff, ruff-format, isort clean.

Attach task_id, run_id, job_id and component as queryable GCP Cloud Logging labels to backend log lines, so CI failures can be investigated per Taskcluster task/run rather than only per prototype/stage. Today most lines are plaintext with, at best, an internal job_id interpolated into the message string; task_id is not queryable. - treeherder/utils/logging_context.py: contextvars-based log_context() context manager/decorator, get_log_labels(), a job_log_labels() resolver (task_id/run_id from TaskclusterMetadata) and the apply_context_labels() helper. No third-party imports. - treeherder/utils/gcp_logging.py: ContextLabelStructuredLogHandler, a StructuredLogHandler subclass that stamps context labels onto each record before the handler's filter phase. google-cloud-logging's built-in CloudLoggingFilter reads record.labels, so a filter added after it (the usual pattern) is too late and the labels are dropped. - config/settings.py: GCP_STRUCTURED_LOGGING env flag (default off) routes the treeherder logger through the structured handler. Off by default so local/dev keeps human-readable plaintext. - log_parser/tasks.py and etl/taskcluster_pulse/handler.py: wrap the log-parsing and pulse-ingestion entry points in log_context so every related line -- including deeper failure-line processing -- inherits the labels without editing individual log calls. - requirements: add google-cloud-logging (structured stdout mode; no GCP credentials or API calls).

codecov-commenter · 2026-07-05T14:56:16Z

Codecov Report

❌ Patch coverage is 87.92271% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.65%. Comparing base (740daac) to head (1d7592d).

Files with missing lines	Patch %	Lines
treeherder/log_parser/tasks.py	60.71%	11 Missing ⚠️
treeherder/etl/taskcluster_pulse/handler.py	66.66%	10 Missing ⚠️
tests/utils/test_gcp_logging.py	95.12%	2 Missing ⚠️
treeherder/config/settings.py	50.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #9666      +/-   ##
==========================================
+ Coverage   82.59%   82.65%   +0.06%     
==========================================
  Files         622      626       +4     
  Lines       36580    36733     +153     
  Branches     3279     3279              
==========================================
+ Hits        30213    30362     +149     
- Misses       6217     6221       +4     
  Partials      150      150

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

camd added the back-end label Jul 5, 2026

camd self-assigned this Jul 5, 2026

camd requested a review from Archaeopteryx July 5, 2026 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GCP structured log labels for task_id/run_id on backend logs#9666

Add GCP structured log labels for task_id/run_id on backend logs#9666
camd wants to merge 1 commit into
masterfrom
camd/gcp-log-task-labels

camd commented Jul 5, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

camd commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

How it works

Implementation note (why a handler subclass, not a filter)

GCP configuration changes needed

Transition period & rollout

Will logs be unreadable with no GCP changes? What changes for the viewer?

Testing

Uh oh!

codecov-commenter commented Jul 5, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

camd commented Jul 5, 2026 •

edited

Loading