perf(eap-items): Encode RowBinary inserts in the processor#7979
Open
phacops wants to merge 7 commits into
Open
perf(eap-items): Encode RowBinary inserts in the processor#7979phacops wants to merge 7 commits into
phacops wants to merge 7 commits into
Conversation
The RowBinary consumer kept Vec<EAPItemRow> alive in every batch all the way from the processor to the writer. Each row has ~80 attribute-bucket Vecs plus per-bucket String allocations, so consumers ran with 2-3x the in-flight memory of the JSONEachRow path. The clickhouse-rs Insert builder also double-buffered the encoded bytes during writes. Serialize each EAPItemRow to RowBinary bytes inside the processor and carry only the resulting RowData<Vec<u8>> downstream. The typed struct drops as soon as the bytes exist; the writer just POSTs them with INSERT INTO ... FORMAT RowBinary, mirroring the JSON path. Vendor a small (~280 line) RowBinary serde Serializer plus the UUID adapter from clickhouse-rs, since the upstream serializer is pub(crate). With that in place, drop the clickhouse-rs dependency entirely - the integration test now talks to ClickHouse via reqwest + FORMAT JSON. The RowBinary and JSONEachRow pipelines now share the same Reduce -> Writer shape; use_row_binary just selects the processor function and the FORMAT clause. TypedInsertBatch, EstimatedSize, the BytesInsertBatch<Vec<T>> impl, make_rust_processor_row_binary, and ClickhouseRowBinaryWriterStep are all gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/39S080clqOozGCknLHK2EbqiXwoO9gjlPnmgL_y0M0Q
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5c262e1. Configure here.
When use_row_binary=true and use_rust_processor=false, the factory's match fell through to the Python processor (which emits JSONEachRow bytes) while the writer was already configured with FORMAT RowBinary - ClickHouse would reject the batch. The pre-collapse code avoided this by early-returning from create_row_binary_pipeline, which always used the Rust processor regardless of use_rust_processor. Coalesce use_rust_processor || use_row_binary at the dispatch site so the override path is always reachable when RowBinary is requested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without a column list, ClickHouse maps the RowBinary stream to columns using the table's on-disk positional order, which does not match the EAPItemRow struct's wire order in two places: * `client_sample_rate` and `server_sample_rate` were added by migration 0048 with identical `AFTER sampling_factor`, so on disk the pair is reversed (server precedes client) while the struct keeps client first. * The initial 0024 migration places all `attributes_string_*` columns before all `attributes_float_*`, but the struct interleaves them per the `seq_attrs!` expansion. The integration test exposed this as `CANNOT_READ_ALL_DATA` deep inside the bucket maps. clickhouse-rs masked the same hazard by always emitting the column list via the `Row` derive; our vendored serializer has to opt back in. Add `EAPItemRow::COLUMN_NAMES` in struct order, thread an `Option<&'static [&'static str]>` through `ClickhouseWriterStep` and `ClickhouseClient`, and expand it as `INSERT INTO t (col1, ...) FORMAT RowBinary`. JSONEachRow stays column-list-less. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/xYvkk5ydje_vV3XmhCpsmSV44ohsxFeKHfKRg-ZUaIk
ClickhouseWriterStep::new took 7 args before the column-list parameter landed; the 8th tipped it over the clippy::too_many_arguments threshold (set repo-wide via -D warnings) and broke "Linting - Rust" in CI. The arguments map 1:1 to ClickhouseClient/ClickhouseConfig knobs the factory already owns separately, so bundling them just to satisfy the lint would add an indirection without a real abstraction. Allow the lint at the call site instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The integration test's `assert!(status.is_success())` for the read-back SELECT was hiding ClickHouse's error message — only the assertion's static "Select failed" string surfaced in CI logs. Capture the status and response body up front and include both in the assertion message so the next CI failure tells us what ClickHouse actually rejected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/fQtr5P5jc3905fJGpnXjV3rTvV3OFF6pHP9bx72Meo4
ClickHouse 25.x rejects bodyless POST requests with HTTP 411
HTTP_LENGTH_REQUIRED ("Transfer-Encoding is not chunked and there is
no Content-Length header for POST request"). reqwest only emits
Content-Length when a body is set, so POST + URL-encoded query +
no body trips the check.
GET fits the SELECT semantically anyway. For the ALTER TABLE cleanup
keep POST but attach an empty body so reqwest emits Content-Length: 0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The JSON processor builds its batch via `InsertBatch::from_rows(...)?` and then sets `cogs_data` / `item_type_metrics` on the returned value. The RowBinary processor was hand-constructing the `InsertBatch` literal because there was no parallel "I already have the wire bytes" constructor. The two paths now look the same: build the batch, set the two metric fields, return. Pure refactor — no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

The RowBinary consumer kept
Vec<EAPItemRow>alive in every batch fromthe processor all the way to the writer. Each row carries ~80 attribute
buckets and their String allocations, so consumers ran with 2-3× the
in-flight memory of the JSONEachRow path and
clickhouse-rs'sInsertbuilder double-buffered the encoded bytes during the write itself.
Each row is now serialized to RowBinary bytes inside
process_message_row_binaryand the typed struct drops on the spot;the pipeline carries only
RowData { encoded_rows: Vec<u8>, num_rows }the rest of the way. The writer is the same
ClickhouseWriterSteptheJSON path uses, just with
FORMAT RowBinary.Vendored RowBinary serializer
clickhouse-rs's serializer ispub(crate)andInsert::writeties itto an HTTP request, so there is no public surface to call it as
"serialize one row into a
Vec<u8>". The newstrategies/clickhouse/rowbinary/module is a ~280-line serdeSerializercovering exactly whatEAPItemRowemits (LE primitives,LEB128 string/seq length prefixes, raw-byte UUID with the half-reversal
ClickHouse's UUID type expects), with unit tests for each case.
Pipeline collapses to one shape
use_row_binaryno longer takes a separate pipeline branch.factory_v2.rspicks the processor function and theInsertFormatonce and the standard
Reduce → ClickhouseWriterStepflow handlesboth.
TypedInsertBatch,EstimatedSize, theBytesInsertBatch<Vec<T>>impl,
make_rust_processor_row_binary, andClickhouseRowBinaryWriterStepare gone.clickhouse-rsdependency droppedThe integration test (
test_row_binary_clickhouse_insert) now exercisesthe production path: it takes the processor's encoded bytes and POSTs
them with
reqwest, then reads back viaFORMAT JSON. With that, thecrate is removed from
Cargo.tomland ~95 lines disappear fromCargo.lock.Agent transcript: https://claudescope.sentry.dev/share/ocWUXnBe7IomlfhNcOgtWgTf1Za2xq3Y5XMtqWqjSuo