Add GCP structured log labels for task_id/run_id on backend logs#9666
Open
camd wants to merge 1 commit into
Open
Add GCP structured log labels for task_id/run_id on backend logs#9666camd wants to merge 1 commit into
camd wants to merge 1 commit into
Conversation
Attach task_id, run_id, job_id and component as queryable GCP Cloud Logging labels to backend log lines, so CI failures can be investigated per Taskcluster task/run rather than only per prototype/stage. Today most lines are plaintext with, at best, an internal job_id interpolated into the message string; task_id is not queryable. - treeherder/utils/logging_context.py: contextvars-based log_context() context manager/decorator, get_log_labels(), a job_log_labels() resolver (task_id/run_id from TaskclusterMetadata) and the apply_context_labels() helper. No third-party imports. - treeherder/utils/gcp_logging.py: ContextLabelStructuredLogHandler, a StructuredLogHandler subclass that stamps context labels onto each record before the handler's filter phase. google-cloud-logging's built-in CloudLoggingFilter reads record.labels, so a filter added after it (the usual pattern) is too late and the labels are dropped. - config/settings.py: GCP_STRUCTURED_LOGGING env flag (default off) routes the treeherder logger through the structured handler. Off by default so local/dev keeps human-readable plaintext. - log_parser/tasks.py and etl/taskcluster_pulse/handler.py: wrap the log-parsing and pulse-ingestion entry points in log_context so every related line -- including deeper failure-line processing -- inherits the labels without editing individual log calls. - requirements: add google-cloud-logging (structured stdout mode; no GCP credentials or API calls).
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #9666 +/- ##
==========================================
+ Coverage 82.59% 82.65% +0.06%
==========================================
Files 622 626 +4
Lines 36580 36733 +153
Branches 3279 3279
==========================================
+ Hits 30213 30362 +149
- Misses 6217 6221 +4
Partials 150 150 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Backend log lines (failure-line processing, Taskcluster ingestion, and everything logged in between) go to stdout as plaintext and are scraped into GCP Cloud Logging. The only queryable dimensions today are the Kubernetes resource labels (prototype / stage / deployment). Where an identifier appears at all, it's usually the internal Treeherder
job_idinterpolated into the message string — not the Taskclustertask_idengineers actually reason about, and not a filterable field.This PR adds a centralized logging context that attaches
task_id,run_id,job_id, andcomponentas queryable Cloud Logging labels on those lines, so an investigator can run:and see every Treeherder log line emitted while processing that task/run.
How it works
treeherder/utils/logging_context.py— acontextvars-basedlog_context(**labels)(context manager / decorator),get_log_labels(), ajob_log_labels(job)resolver (pullstask_id/run_idfromTaskclusterMetadata, omitting them for non-Taskcluster jobs), and the pureapply_context_labels(record)helper. No third-party imports.treeherder/utils/gcp_logging.py—ContextLabelStructuredLogHandler, agoogle-cloud-loggingStructuredLogHandlersubclass. Metadata is set once at a task/ingestion boundary and propagates to every log line in scope, so no individuallogger.*()call sites change.config/settings.py— newGCP_STRUCTURED_LOGGINGenv flag (default off). When on, thetreeherderlogger is routed through the structured handler. Local/dev keeps the current human-readable plaintext.log_parser/tasks.py::parse_logs(component="log_parser") andetl/taskcluster_pulse/handler.py::handle_message(component="ingestion").Implementation note (why a handler subclass, not a filter)
StructuredLogHandler.__init__installs its ownCloudLoggingFilter, which readsrecord.labelsbefore any filter added afterwards runs — so labels added by a later filter are silently dropped. The subclass stamps labels inhandle(), i.e. before the filter phase, which is the only reliable point. Verified empirically.GCP configuration changes needed
None are required.
StructuredLogHandlerwrites GCP-structured JSON to stdout; the GKE/Cloud Run logging agent parses JSON on stdout and promotes the reservedlogging.googleapis.com/labelsfield toLogEntry.labelsby default. No new dependency credentials, no API calls, no agent config, no sink/index setup.labels.task_id=...becomes queryable automatically once the flag is on.Note
One thing to check before flipping the flag (not a GCP change, but an audit): any existing log-based metrics, alerting policies, or sinks that regex-match
textPayload. When the flag is on, thetreeherderlogger's output moves fromtextPayload(one string) tojsonPayload(structured), so a metric/alert matching ontextPayloadfor those lines would stop matching. It would need to point atjsonPayload.message(and can now matchlabels.task_idetc.). If no such log-based metrics exist over app logs, there is nothing to do.Transition period & rollout
The flag defaults off, so merging this changes nothing. Rollout is per-deployment:
GCP_STRUCTURED_LOGGING=true) — confirmlabels.task_idis queryable, confirm the Logs Explorer still reads well, and audit any log-based metrics/alerts as above.There is no dual-format period within a single deployment: when the flag flips, all
treeherder-logger output switches from plaintext to structured JSON at once (the plaintextconsolehandler is replaced, so there's no double-logging). During the staged rollout, stage emits structured JSON while prod is still plaintext — worth knowing if any query/metric spans both environments.Will logs be unreadable with no GCP changes? What changes for the viewer?
No — logs stay readable, and in the Cloud Logging UI they arguably read better. The Logs Explorer uses
jsonPayload.messageas the summary line, so the human-readable message is front-and-center, now withseverity,sourceLocation, and the newlabelsas first-class structured fields.It is structurally different (payload type changes), but the visible message is the same text — minus the bracketed prefix, which becomes structured fields:
Before (legacy —
textPayload):After (flag on —
jsonPayload):{ "message": "Running store_failure_lines for job 12345", "severity": "INFO", "logging.googleapis.com/labels": { "python_logger": "treeherder.log_parser.tasks", "task_id": "V3SVuxO8TFy37En_6HcXLp", "run_id": "0", "job_id": "12345", "component": "log_parser" }, "logging.googleapis.com/sourceLocation": { "file": ".../tasks.py", "line": "84", "function": "store_failure_lines" } }So it's not just "the old line plus extra info" — the timestamp/level/logger/line that were baked into the text string become proper structured fields, and the new labels are added. In the Logs Explorer that's a net improvement (filter by severity, jump to source, filter by task). The one genuine downside: raw
kubectl logs/ raw pod stdout for thetreeherderlogger is now JSON rather than pretty text, which is less pleasant to eyeball outside the Cloud Logging UI.request.summary(MozLog JSON) and thedjango/kombuloggers are unchanged.Testing
tests/utils/test_logging_context.py,tests/utils/test_gcp_logging.py), plus a wiring test provinghandle_messagesets the context (tests/etl/taskcluster_pulse/test_handler.py). All written test-first.tests/utils,tests/log_parser,tests/etl/taskcluster_pulse). One unrelated pre-existing failure intest_job_loader.py::test_ingest_pulse_jobs_with_missing_push(fails identically on a clean checkout).LOGGING→dictConfig→ handler path emits a JSON line carrying the labels.ruff,ruff-format,isortclean.