Skip to content

perf: fuse to_bigquery_column_spec into a single binary walk#3490

Draft
djwhitt wants to merge 6 commits into
mainfrom
davidwhittington/o11y-1828-fuse-to_bigquery_column_spec-into-a-single-binary-walk
Draft

perf: fuse to_bigquery_column_spec into a single binary walk#3490
djwhitt wants to merge 6 commits into
mainfrom
davidwhittington/o11y-1828-fuse-to_bigquery_column_spec-into-a-single-binary-walk

Conversation

@djwhitt

@djwhitt djwhitt commented May 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Replaces the five-stage to_bigquery_column_spec/1 pipeline (strip_bq_prefixdashes_to_underscoresalter_leading_numberalphanumeric_onlyenforce_field_length) with a single binary walk that tracks the dash/non-alnum flags inline and detects the reserved-prefix and leading-digit cases up front. Each rule still contributes at most one leading underscore; the digit rule is suppressed when a prefix or dash has already prepended, preserving the order of the original pipeline.

Stacks on top of #3489.

Combined perf delta (LogEvent.make median wall time, edge log fixture)

Wall time roughly halves on every benched scenario (edge log + otel trace, with/without copy/kv/drop transforms). Reductions stay flat to within ±5%, which is the strongest sanity check that the new walk isn't skipping work — the win is almost entirely the eliminated :re.run/3 and :binary.match/2 NIF round-trips, which don't count toward reductions but consume wall-clock time. The previous bench's per-call profile attributed ~20% to the column-spec pipeline; halving the end-to-end wall says the NIF crossing overhead per key was larger than tprof could attribute.

Captured snapshot is appended to test/profiling/log_event_make_bench.history.exs against SHA 9760d277d.

Quirky behavior preserved (worth a careful look)

The legacy alphanumeric_only step used ~r/\W/ without the unicode modifier. PCRE in byte mode classifies bytes using a Latin-1 word-character table, not ASCII-only [a-zA-Z0-9_]. That means:

  • bytes 192..214, 216..246, 248..255, plus the singletons 170 (ª), 181 (µ), 186 (º) are treated as word characters and pass through unchanged
  • byte 215 (×) and 247 (÷) are non-word
  • bytes 128..169, 171..180, 182..185, 187..191 are non-word

So a UTF-8 "é" (bytes <<0xC3, 0xA9>>) becomes <<?_, 0xC3, ?_>> after the rule fires — only the trailing 0xA9 is replaced; 0xC3 is kept verbatim because PCRE considers it a Latin-1 letter byte. This is almost certainly not what the original author intended, but it has been the observable behavior shipping to BigQuery, so column names already exist that include those bytes. Changing it would silently rename columns and break downstream tables.

The new walk replicates the PCRE table exactly (see the third walk_bq_key/4 clause) so non-ASCII keys produce identical column names. If we ever want to switch to a sensible UTF-8-aware classification (or strict ASCII), it should be a deliberate, separately benchmarked migration with a migration plan for the affected columns — not a quiet side-effect of this PR.

Tests

  • test/logflare/logs/ingest/ingest_transformer_test.exs gains a new describe ":to_bigquery_column_spec fused pipeline" block (~18 tests). These were authored against the legacy implementation and pass against both — they're a strict characterization layer covering each rule in isolation, cumulative _ prepending, leading-digit suppression when a prefix or dash already prepended, pass-through cases (atom/int/empty-string keys, already-valid keys), the PCRE Latin-1 word quirk on multibyte input, and the 128-byte field-length boundary including prefix-induced truncation.

Test plan

  • mix test test/logflare/logs/ingest/ingest_transformer_test.exs
  • mix test test/logflare/log_event_test.exs
  • mix lint.diff
  • mix test.typings
  • Eyeball the PCRE Latin-1 word table comment + matching guard clauses against the byte enumeration in the PR description

🤖 Generated with Claude Code

djwhitt and others added 6 commits May 16, 2026 16:07
The previous implementation built a full metadata flat typemap from
the body via SchemaUtils.to_typemap/1 + SchemaUtils.flatten_typemap/1,
then Map.merged it against the cached schema flat map purely to detect
type conflicts. The flat map was discarded immediately after.

Walk the body once and check each leaf's inferred type against the
schema flat map key-by-key. Build dot-delimited paths inline, raise
on type mismatch, return :ok if no conflict. No intermediate typemap,
no flatten pass, no allocation of the throwaway map.

Behavior changes / fixes exposed by the rewrite:

  * validate/2 was previously passing the raw %SourceSchema{} struct
    (not its schema_flat_map field) into the merge, so atom struct
    keys never overlapped with string flat-map keys and type
    conflicts were silently ignored. Now we extract :schema_flat_map
    and actually compare. Mismatches that production used to swallow
    will now surface as {:error, _} at validate time (previously they
    failed downstream at BigQuery insert time anyway).
  * "timestamp" is skipped at the top level because LogEvent.mapper/1
    injects it as integer microseconds, while the BQ schema types it
    as TIMESTAMP (:datetime). Without the skip, every event with a
    cached schema would now fail validation.
  * Short-circuit when schema_flat_map is empty (cold-start sources)
    so we don't walk the body just to find nothing.
  * try_merge / merge_flat_typemaps removed — they were internal
    helpers tied to the old shape.

Test file un-failed: the @moduletag :failing was hiding pre-existing
correctness bugs (valid? was called with metadata-only maps against
full schemas, so no key ever matched). Tests rewritten to wrap with
%{"metadata" => ...} so they actually exercise the validator paths
they claim to test; the rewrite makes them all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Baseline (main, 8e03624) and the single-pass walk (640151a).
Bench scenarios don't populate a source_schema, so the new
empty-schema short-circuit fires; production deltas with a cached
schema will be smaller but on the same order, since the dominant
savings come from eliminating the transient flat-map allocation.

edge log: -21% wall / -41% mem / -42% reds
otel trace: -36% wall / -44% mem / -42% reds

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Direct unit tests for the metadata-flavor to_typemap/1 clause:
binary/atom keys, nested-map keys, Latin1 normalization, and
arbitrary high-byte fallback. Adapted from PR #3487's coverage to
assert the current atom-key contract.

After the validate/2 rewrite this clause has no caller in lib/ (the
remaining to_typemap/1 callers all take %TS{} schemas and hit the
schema-flavor clause), so we previously relied on the validator
tests to exercise it indirectly. Direct coverage protects the
clause's contract in case future refactors touch it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lock in the observable behavior of the `:to_bigquery_column_spec`
key transformation before fusing the five sub-passes into a single
binary walk. Covers each rule in isolation, cumulative leading-`_`
prepending across rules, suppression of leading-digit prepend when
a prefix or dash already prepended, pass-through cases (atom/int/
empty-string keys, already-valid keys), PCRE byte-mode Latin-1 word
classification on multibyte input, and the 128-byte field-length
boundary including prefix-induced length growth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The legacy implementation piped each map key through five sequential
sub-passes (strip_bq_prefix → dashes_to_underscores →
alter_leading_number → alphanumeric_only → enforce_field_length).
Most keys hit none of the rewrite branches but still paid all five
scans, with `:re.run/3` and `:binary.match/2` showing up at ~13%
combined wall time in the edge log fixture.

Walk the binary once, tracking the dash/non-alnum flags and detecting
the bq-reserved-prefix and leading-digit cases upfront. Each rule
still contributes at most one leading underscore; the digit rule is
suppressed when a prefix or dash has already prepended, preserving the
order of the original pipeline. The byte classifier mirrors PCRE's
byte-mode \w table (incl. Latin-1 letter ranges 192..214, 216..246,
248..255 and the lone-letter bytes 170/181/186) so non-ASCII keys
produce identical column names to the prior `~r/\W/` replacement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the LogEvent.make benchmark on top of the o11y-1829 baseline
(640151a). Single-pass binary walk over each ingest key replaces the
five sub-pass pipeline, cutting wall time roughly in half on both the
edge log and otel trace fixtures across every scenario.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@djwhitt djwhitt force-pushed the davidwhittington/o11y-1829-bigqueryschemachangevalidate-walk-body-once-and-validate branch 8 times, most recently from 130140f to 19c1aa6 Compare May 20, 2026 20:01
Base automatically changed from davidwhittington/o11y-1829-bigqueryschemachangevalidate-walk-body-once-and-validate to main May 20, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant