perf: fuse to_bigquery_column_spec into a single binary walk by djwhitt · Pull Request #3490 · Logflare/logflare

djwhitt · 2026-05-16T21:44:07Z

Summary

Replaces the five-stage to_bigquery_column_spec/1 pipeline (strip_bq_prefix → dashes_to_underscores → alter_leading_number → alphanumeric_only → enforce_field_length) with a single binary walk that tracks the dash/non-alnum flags inline and detects the reserved-prefix and leading-digit cases up front. Each rule still contributes at most one leading underscore; the digit rule is suppressed when a prefix or dash has already prepended, preserving the order of the original pipeline.

Stacks on top of #3489.

Combined perf delta (`LogEvent.make` median wall time, edge log fixture)

pre-perf: rewrite BigQuerySchemaChange.validate as single-pass walk #3489 baseline: 77.25 µs
after perf: rewrite BigQuerySchemaChange.validate as single-pass walk #3489 single-pass validate: 57.75 µs (-25%)
after this PR fused column-spec walk: 25.71 µs (-67% combined, -55% from the perf: rewrite BigQuerySchemaChange.validate as single-pass walk #3489 baseline)

Wall time roughly halves on every benched scenario (edge log + otel trace, with/without copy/kv/drop transforms). Reductions stay flat to within ±5%, which is the strongest sanity check that the new walk isn't skipping work — the win is almost entirely the eliminated :re.run/3 and :binary.match/2 NIF round-trips, which don't count toward reductions but consume wall-clock time. The previous bench's per-call profile attributed ~20% to the column-spec pipeline; halving the end-to-end wall says the NIF crossing overhead per key was larger than tprof could attribute.

Captured snapshot is appended to test/profiling/log_event_make_bench.history.exs against SHA 9760d277d.

Quirky behavior preserved (worth a careful look)

The legacy alphanumeric_only step used ~r/\W/ without the unicode modifier. PCRE in byte mode classifies bytes using a Latin-1 word-character table, not ASCII-only [a-zA-Z0-9_]. That means:

bytes 192..214, 216..246, 248..255, plus the singletons 170 (ª), 181 (µ), 186 (º) are treated as word characters and pass through unchanged
byte 215 (×) and 247 (÷) are non-word
bytes 128..169, 171..180, 182..185, 187..191 are non-word

So a UTF-8 "é" (bytes <<0xC3, 0xA9>>) becomes <<?_, 0xC3, ?_>> after the rule fires — only the trailing 0xA9 is replaced; 0xC3 is kept verbatim because PCRE considers it a Latin-1 letter byte. This is almost certainly not what the original author intended, but it has been the observable behavior shipping to BigQuery, so column names already exist that include those bytes. Changing it would silently rename columns and break downstream tables.

The new walk replicates the PCRE table exactly (see the third walk_bq_key/4 clause) so non-ASCII keys produce identical column names. If we ever want to switch to a sensible UTF-8-aware classification (or strict ASCII), it should be a deliberate, separately benchmarked migration with a migration plan for the affected columns — not a quiet side-effect of this PR.

Tests

test/logflare/logs/ingest/ingest_transformer_test.exs gains a new describe ":to_bigquery_column_spec fused pipeline" block (~18 tests). These were authored against the legacy implementation and pass against both — they're a strict characterization layer covering each rule in isolation, cumulative _ prepending, leading-digit suppression when a prefix or dash already prepended, pass-through cases (atom/int/empty-string keys, already-valid keys), the PCRE Latin-1 word quirk on multibyte input, and the 128-byte field-length boundary including prefix-induced truncation.

Test plan

mix test test/logflare/logs/ingest/ingest_transformer_test.exs
mix test test/logflare/log_event_test.exs
mix lint.diff
mix test.typings
Eyeball the PCRE Latin-1 word table comment + matching guard clauses against the byte enumeration in the PR description

🤖 Generated with Claude Code

The previous implementation built a full metadata flat typemap from the body via SchemaUtils.to_typemap/1 + SchemaUtils.flatten_typemap/1, then Map.merged it against the cached schema flat map purely to detect type conflicts. The flat map was discarded immediately after. Walk the body once and check each leaf's inferred type against the schema flat map key-by-key. Build dot-delimited paths inline, raise on type mismatch, return :ok if no conflict. No intermediate typemap, no flatten pass, no allocation of the throwaway map. Behavior changes / fixes exposed by the rewrite: * validate/2 was previously passing the raw %SourceSchema{} struct (not its schema_flat_map field) into the merge, so atom struct keys never overlapped with string flat-map keys and type conflicts were silently ignored. Now we extract :schema_flat_map and actually compare. Mismatches that production used to swallow will now surface as {:error, _} at validate time (previously they failed downstream at BigQuery insert time anyway). * "timestamp" is skipped at the top level because LogEvent.mapper/1 injects it as integer microseconds, while the BQ schema types it as TIMESTAMP (:datetime). Without the skip, every event with a cached schema would now fail validation. * Short-circuit when schema_flat_map is empty (cold-start sources) so we don't walk the body just to find nothing. * try_merge / merge_flat_typemaps removed — they were internal helpers tied to the old shape. Test file un-failed: the @moduletag :failing was hiding pre-existing correctness bugs (valid? was called with metadata-only maps against full schemas, so no key ever matched). Tests rewritten to wrap with %{"metadata" => ...} so they actually exercise the validator paths they claim to test; the rewrite makes them all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Baseline (main, 8e03624) and the single-pass walk (640151a). Bench scenarios don't populate a source_schema, so the new empty-schema short-circuit fires; production deltas with a cached schema will be smaller but on the same order, since the dominant savings come from eliminating the transient flat-map allocation. edge log: -21% wall / -41% mem / -42% reds otel trace: -36% wall / -44% mem / -42% reds Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Direct unit tests for the metadata-flavor to_typemap/1 clause: binary/atom keys, nested-map keys, Latin1 normalization, and arbitrary high-byte fallback. Adapted from PR #3487's coverage to assert the current atom-key contract. After the validate/2 rewrite this clause has no caller in lib/ (the remaining to_typemap/1 callers all take %TS{} schemas and hit the schema-flavor clause), so we previously relied on the validator tests to exercise it indirectly. Direct coverage protects the clause's contract in case future refactors touch it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lock in the observable behavior of the `:to_bigquery_column_spec` key transformation before fusing the five sub-passes into a single binary walk. Covers each rule in isolation, cumulative leading-`_` prepending across rules, suppression of leading-digit prepend when a prefix or dash already prepended, pass-through cases (atom/int/ empty-string keys, already-valid keys), PCRE byte-mode Latin-1 word classification on multibyte input, and the 128-byte field-length boundary including prefix-induced length growth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The legacy implementation piped each map key through five sequential sub-passes (strip_bq_prefix → dashes_to_underscores → alter_leading_number → alphanumeric_only → enforce_field_length). Most keys hit none of the rewrite branches but still paid all five scans, with `:re.run/3` and `:binary.match/2` showing up at ~13% combined wall time in the edge log fixture. Walk the binary once, tracking the dash/non-alnum flags and detecting the bq-reserved-prefix and leading-digit cases upfront. Each rule still contributes at most one leading underscore; the digit rule is suppressed when a prefix or dash has already prepended, preserving the order of the original pipeline. The byte classifier mirrors PCRE's byte-mode \w table (incl. Latin-1 letter ranges 192..214, 216..246, 248..255 and the lone-letter bytes 170/181/186) so non-ASCII keys produce identical column names to the prior `~r/\W/` replacement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures the LogEvent.make benchmark on top of the o11y-1829 baseline (640151a). Single-pass binary walk over each ingest key replaces the five sub-pass pipeline, cutting wall time roughly in half on both the edge log and otel trace fixtures across every scenario. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

djwhitt and others added 6 commits May 16, 2026 16:07

github-actions Bot assigned djwhitt May 16, 2026

djwhitt force-pushed the davidwhittington/o11y-1829-bigqueryschemachangevalidate-walk-body-once-and-validate branch 8 times, most recently from 130140f to 19c1aa6 Compare May 20, 2026 20:01

Base automatically changed from davidwhittington/o11y-1829-bigqueryschemachangevalidate-walk-body-once-and-validate to main May 20, 2026 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fuse to_bigquery_column_spec into a single binary walk#3490

perf: fuse to_bigquery_column_spec into a single binary walk#3490
djwhitt wants to merge 6 commits into
mainfrom
davidwhittington/o11y-1828-fuse-to_bigquery_column_spec-into-a-single-binary-walk

djwhitt commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

djwhitt commented May 16, 2026

Summary

Combined perf delta (LogEvent.make median wall time, edge log fixture)

Quirky behavior preserved (worth a careful look)

Tests

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Combined perf delta (`LogEvent.make` median wall time, edge log fixture)