perf: fuse to_bigquery_column_spec into a single binary walk#3490
Draft
djwhitt wants to merge 6 commits into
Draft
perf: fuse to_bigquery_column_spec into a single binary walk#3490djwhitt wants to merge 6 commits into
djwhitt wants to merge 6 commits into
Conversation
The previous implementation built a full metadata flat typemap from
the body via SchemaUtils.to_typemap/1 + SchemaUtils.flatten_typemap/1,
then Map.merged it against the cached schema flat map purely to detect
type conflicts. The flat map was discarded immediately after.
Walk the body once and check each leaf's inferred type against the
schema flat map key-by-key. Build dot-delimited paths inline, raise
on type mismatch, return :ok if no conflict. No intermediate typemap,
no flatten pass, no allocation of the throwaway map.
Behavior changes / fixes exposed by the rewrite:
* validate/2 was previously passing the raw %SourceSchema{} struct
(not its schema_flat_map field) into the merge, so atom struct
keys never overlapped with string flat-map keys and type
conflicts were silently ignored. Now we extract :schema_flat_map
and actually compare. Mismatches that production used to swallow
will now surface as {:error, _} at validate time (previously they
failed downstream at BigQuery insert time anyway).
* "timestamp" is skipped at the top level because LogEvent.mapper/1
injects it as integer microseconds, while the BQ schema types it
as TIMESTAMP (:datetime). Without the skip, every event with a
cached schema would now fail validation.
* Short-circuit when schema_flat_map is empty (cold-start sources)
so we don't walk the body just to find nothing.
* try_merge / merge_flat_typemaps removed — they were internal
helpers tied to the old shape.
Test file un-failed: the @moduletag :failing was hiding pre-existing
correctness bugs (valid? was called with metadata-only maps against
full schemas, so no key ever matched). Tests rewritten to wrap with
%{"metadata" => ...} so they actually exercise the validator paths
they claim to test; the rewrite makes them all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Baseline (main, 8e03624) and the single-pass walk (640151a). Bench scenarios don't populate a source_schema, so the new empty-schema short-circuit fires; production deltas with a cached schema will be smaller but on the same order, since the dominant savings come from eliminating the transient flat-map allocation. edge log: -21% wall / -41% mem / -42% reds otel trace: -36% wall / -44% mem / -42% reds Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Direct unit tests for the metadata-flavor to_typemap/1 clause: binary/atom keys, nested-map keys, Latin1 normalization, and arbitrary high-byte fallback. Adapted from PR #3487's coverage to assert the current atom-key contract. After the validate/2 rewrite this clause has no caller in lib/ (the remaining to_typemap/1 callers all take %TS{} schemas and hit the schema-flavor clause), so we previously relied on the validator tests to exercise it indirectly. Direct coverage protects the clause's contract in case future refactors touch it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lock in the observable behavior of the `:to_bigquery_column_spec` key transformation before fusing the five sub-passes into a single binary walk. Covers each rule in isolation, cumulative leading-`_` prepending across rules, suppression of leading-digit prepend when a prefix or dash already prepended, pass-through cases (atom/int/ empty-string keys, already-valid keys), PCRE byte-mode Latin-1 word classification on multibyte input, and the 128-byte field-length boundary including prefix-induced length growth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The legacy implementation piped each map key through five sequential sub-passes (strip_bq_prefix → dashes_to_underscores → alter_leading_number → alphanumeric_only → enforce_field_length). Most keys hit none of the rewrite branches but still paid all five scans, with `:re.run/3` and `:binary.match/2` showing up at ~13% combined wall time in the edge log fixture. Walk the binary once, tracking the dash/non-alnum flags and detecting the bq-reserved-prefix and leading-digit cases upfront. Each rule still contributes at most one leading underscore; the digit rule is suppressed when a prefix or dash has already prepended, preserving the order of the original pipeline. The byte classifier mirrors PCRE's byte-mode \w table (incl. Latin-1 letter ranges 192..214, 216..246, 248..255 and the lone-letter bytes 170/181/186) so non-ASCII keys produce identical column names to the prior `~r/\W/` replacement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the LogEvent.make benchmark on top of the o11y-1829 baseline (640151a). Single-pass binary walk over each ingest key replaces the five sub-pass pipeline, cutting wall time roughly in half on both the edge log and otel trace fixtures across every scenario. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
130140f to
19c1aa6
Compare
Base automatically changed from
davidwhittington/o11y-1829-bigqueryschemachangevalidate-walk-body-once-and-validate
to
main
May 20, 2026 20:37
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the five-stage
to_bigquery_column_spec/1pipeline (strip_bq_prefix→dashes_to_underscores→alter_leading_number→alphanumeric_only→enforce_field_length) with a single binary walk that tracks the dash/non-alnum flags inline and detects the reserved-prefix and leading-digit cases up front. Each rule still contributes at most one leading underscore; the digit rule is suppressed when a prefix or dash has already prepended, preserving the order of the original pipeline.Stacks on top of #3489.
Combined perf delta (
LogEvent.makemedian wall time, edge log fixture)Wall time roughly halves on every benched scenario (edge log + otel trace, with/without copy/kv/drop transforms). Reductions stay flat to within ±5%, which is the strongest sanity check that the new walk isn't skipping work — the win is almost entirely the eliminated
:re.run/3and:binary.match/2NIF round-trips, which don't count toward reductions but consume wall-clock time. The previous bench's per-call profile attributed ~20% to the column-spec pipeline; halving the end-to-end wall says the NIF crossing overhead per key was larger thantprofcould attribute.Captured snapshot is appended to
test/profiling/log_event_make_bench.history.exsagainst SHA9760d277d.Quirky behavior preserved (worth a careful look)
The legacy
alphanumeric_onlystep used~r/\W/without theunicodemodifier. PCRE in byte mode classifies bytes using a Latin-1 word-character table, not ASCII-only[a-zA-Z0-9_]. That means:ª), 181 (µ), 186 (º) are treated as word characters and pass through unchanged×) and 247 (÷) are non-wordSo a UTF-8
"é"(bytes<<0xC3, 0xA9>>) becomes<<?_, 0xC3, ?_>>after the rule fires — only the trailing0xA9is replaced;0xC3is kept verbatim because PCRE considers it a Latin-1 letter byte. This is almost certainly not what the original author intended, but it has been the observable behavior shipping to BigQuery, so column names already exist that include those bytes. Changing it would silently rename columns and break downstream tables.The new walk replicates the PCRE table exactly (see the third
walk_bq_key/4clause) so non-ASCII keys produce identical column names. If we ever want to switch to a sensible UTF-8-aware classification (or strict ASCII), it should be a deliberate, separately benchmarked migration with a migration plan for the affected columns — not a quiet side-effect of this PR.Tests
test/logflare/logs/ingest/ingest_transformer_test.exsgains a newdescribe ":to_bigquery_column_spec fused pipeline"block (~18 tests). These were authored against the legacy implementation and pass against both — they're a strict characterization layer covering each rule in isolation, cumulative_prepending, leading-digit suppression when a prefix or dash already prepended, pass-through cases (atom/int/empty-string keys, already-valid keys), the PCRE Latin-1 word quirk on multibyte input, and the 128-byte field-length boundary including prefix-induced truncation.Test plan
mix test test/logflare/logs/ingest/ingest_transformer_test.exsmix test test/logflare/log_event_test.exsmix lint.diffmix test.typings🤖 Generated with Claude Code