Skip to content

Add GCP structured log labels for task_id/run_id on backend logs#9666

Open
camd wants to merge 1 commit into
masterfrom
camd/gcp-log-task-labels
Open

Add GCP structured log labels for task_id/run_id on backend logs#9666
camd wants to merge 1 commit into
masterfrom
camd/gcp-log-task-labels

Conversation

@camd

@camd camd commented Jul 5, 2026

Copy link
Copy Markdown
Collaborator

What & why

Backend log lines (failure-line processing, Taskcluster ingestion, and everything logged in between) go to stdout as plaintext and are scraped into GCP Cloud Logging. The only queryable dimensions today are the Kubernetes resource labels (prototype / stage / deployment). Where an identifier appears at all, it's usually the internal Treeherder job_id interpolated into the message string — not the Taskcluster task_id engineers actually reason about, and not a filterable field.

This PR adds a centralized logging context that attaches task_id, run_id, job_id, and component as queryable Cloud Logging labels on those lines, so an investigator can run:

labels.task_id="V3SVuxO8TFy37En_6HcXLp"

and see every Treeherder log line emitted while processing that task/run.

How it works

  • treeherder/utils/logging_context.py — a contextvars-based log_context(**labels) (context manager / decorator), get_log_labels(), a job_log_labels(job) resolver (pulls task_id/run_id from TaskclusterMetadata, omitting them for non-Taskcluster jobs), and the pure apply_context_labels(record) helper. No third-party imports.
  • treeherder/utils/gcp_logging.pyContextLabelStructuredLogHandler, a google-cloud-logging StructuredLogHandler subclass. Metadata is set once at a task/ingestion boundary and propagates to every log line in scope, so no individual logger.*() call sites change.
  • config/settings.py — new GCP_STRUCTURED_LOGGING env flag (default off). When on, the treeherder logger is routed through the structured handler. Local/dev keeps the current human-readable plaintext.
  • Wiringlog_parser/tasks.py::parse_logs (component="log_parser") and etl/taskcluster_pulse/handler.py::handle_message (component="ingestion").

Implementation note (why a handler subclass, not a filter)

StructuredLogHandler.__init__ installs its own CloudLoggingFilter, which reads record.labels before any filter added afterwards runs — so labels added by a later filter are silently dropped. The subclass stamps labels in handle(), i.e. before the filter phase, which is the only reliable point. Verified empirically.

GCP configuration changes needed

None are required. StructuredLogHandler writes GCP-structured JSON to stdout; the GKE/Cloud Run logging agent parses JSON on stdout and promotes the reserved logging.googleapis.com/labels field to LogEntry.labels by default. No new dependency credentials, no API calls, no agent config, no sink/index setup. labels.task_id=... becomes queryable automatically once the flag is on.

Note

One thing to check before flipping the flag (not a GCP change, but an audit): any existing log-based metrics, alerting policies, or sinks that regex-match textPayload. When the flag is on, the treeherder logger's output moves from textPayload (one string) to jsonPayload (structured), so a metric/alert matching on textPayload for those lines would stop matching. It would need to point at jsonPayload.message (and can now match labels.task_id etc.). If no such log-based metrics exist over app logs, there is nothing to do.

Transition period & rollout

The flag defaults off, so merging this changes nothing. Rollout is per-deployment:

  1. Merge — no behavior change anywhere (flag off).
  2. Enable on stage (GCP_STRUCTURED_LOGGING=true) — confirm labels.task_id is queryable, confirm the Logs Explorer still reads well, and audit any log-based metrics/alerts as above.
  3. Enable on prod.

There is no dual-format period within a single deployment: when the flag flips, all treeherder-logger output switches from plaintext to structured JSON at once (the plaintext console handler is replaced, so there's no double-logging). During the staged rollout, stage emits structured JSON while prod is still plaintext — worth knowing if any query/metric spans both environments.

Will logs be unreadable with no GCP changes? What changes for the viewer?

No — logs stay readable, and in the Cloud Logging UI they arguably read better. The Logs Explorer uses jsonPayload.message as the summary line, so the human-readable message is front-and-center, now with severity, sourceLocation, and the new labels as first-class structured fields.

It is structurally different (payload type changes), but the visible message is the same text — minus the bracketed prefix, which becomes structured fields:

Before (legacy — textPayload):

[2026-07-05 12:00:00,000] INFO [treeherder.log_parser.tasks:84] Running store_failure_lines for job 12345

After (flag on — jsonPayload):

{
  "message": "Running store_failure_lines for job 12345",
  "severity": "INFO",
  "logging.googleapis.com/labels": {
    "python_logger": "treeherder.log_parser.tasks",
    "task_id": "V3SVuxO8TFy37En_6HcXLp",
    "run_id": "0",
    "job_id": "12345",
    "component": "log_parser"
  },
  "logging.googleapis.com/sourceLocation": { "file": ".../tasks.py", "line": "84", "function": "store_failure_lines" }
}

So it's not just "the old line plus extra info" — the timestamp/level/logger/line that were baked into the text string become proper structured fields, and the new labels are added. In the Logs Explorer that's a net improvement (filter by severity, jump to source, filter by task). The one genuine downside: raw kubectl logs / raw pod stdout for the treeherder logger is now JSON rather than pretty text, which is less pleasant to eyeball outside the Cloud Logging UI. request.summary (MozLog JSON) and the django/kombu loggers are unchanged.

Testing

  • New unit tests for the context module, label resolver, and handler (tests/utils/test_logging_context.py, tests/utils/test_gcp_logging.py), plus a wiring test proving handle_message sets the context (tests/etl/taskcluster_pulse/test_handler.py). All written test-first.
  • Affected suites green (tests/utils, tests/log_parser, tests/etl/taskcluster_pulse). One unrelated pre-existing failure in test_job_loader.py::test_ingest_pulse_jobs_with_missing_push (fails identically on a clean checkout).
  • End-to-end check: the real settings LOGGINGdictConfig → handler path emits a JSON line carrying the labels.
  • ruff, ruff-format, isort clean.

Attach task_id, run_id, job_id and component as queryable GCP Cloud
Logging labels to backend log lines, so CI failures can be investigated
per Taskcluster task/run rather than only per prototype/stage. Today most
lines are plaintext with, at best, an internal job_id interpolated into
the message string; task_id is not queryable.

- treeherder/utils/logging_context.py: contextvars-based log_context()
  context manager/decorator, get_log_labels(), a job_log_labels() resolver
  (task_id/run_id from TaskclusterMetadata) and the apply_context_labels()
  helper. No third-party imports.
- treeherder/utils/gcp_logging.py: ContextLabelStructuredLogHandler, a
  StructuredLogHandler subclass that stamps context labels onto each record
  before the handler's filter phase. google-cloud-logging's built-in
  CloudLoggingFilter reads record.labels, so a filter added after it (the
  usual pattern) is too late and the labels are dropped.
- config/settings.py: GCP_STRUCTURED_LOGGING env flag (default off) routes
  the treeherder logger through the structured handler. Off by default so
  local/dev keeps human-readable plaintext.
- log_parser/tasks.py and etl/taskcluster_pulse/handler.py: wrap the
  log-parsing and pulse-ingestion entry points in log_context so every
  related line -- including deeper failure-line processing -- inherits the
  labels without editing individual log calls.
- requirements: add google-cloud-logging (structured stdout mode; no GCP
  credentials or API calls).
@camd camd added the back-end label Jul 5, 2026
@camd camd self-assigned this Jul 5, 2026
@camd camd requested a review from Archaeopteryx July 5, 2026 14:50
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.92271% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.65%. Comparing base (740daac) to head (1d7592d).

Files with missing lines Patch % Lines
treeherder/log_parser/tasks.py 60.71% 11 Missing ⚠️
treeherder/etl/taskcluster_pulse/handler.py 66.66% 10 Missing ⚠️
tests/utils/test_gcp_logging.py 95.12% 2 Missing ⚠️
treeherder/config/settings.py 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9666      +/-   ##
==========================================
+ Coverage   82.59%   82.65%   +0.06%     
==========================================
  Files         622      626       +4     
  Lines       36580    36733     +153     
  Branches     3279     3279              
==========================================
+ Hits        30213    30362     +149     
- Misses       6217     6221       +4     
  Partials      150      150              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants