chore: Refactoring all_fields iterator. Using OwnedTargetPath instead of KeyString.#25059
chore: Refactoring all_fields iterator. Using OwnedTargetPath instead of KeyString.#25059lololozhkin wants to merge 6 commits intovectordotdev:masterfrom
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
Hi @lololozhkin, please follow the instructions above and we will take a look. @codex review |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Please sign the CLA and we will take a look at the PR afterwards, thanks. |
|
recheck |
|
I have read the CLA Document and I hereby sign the CLA |
|
recheck |
|
@pront done! Sorry for long wait, didn't understand where should I leave comment :) |
There was a problem hiding this comment.
Thank you for this PR @lololozhkin! With this PR, paths returned by the iterator can be used directly for lookups/inserts without re-parsing.
VRL's serialize_field wraps any field containing characters outside
[A-Za-z0-9_@] in quotes and escapes inner " / \ chars. So a field like my-field becomes "my-field" (with literal quote chars) when .to_string() is called.
This matters in three places:
-
InfluxDB sink
(logs.rs#L293-L300) calls
key.path.to_string()directly. The old code emitted the rawKeyStringwithout quoting. This is a wire-format change that would silently break
downstream queries matching on field names with special chars. -
Honeycomb sink (encoder.rs#L42) passes
log.convert_to_fields()intoserde_json::json!(), which hits the newSerializeimpl that also calls.to_string()for event-prefix keys. Same quoting issue for JSON keys. -
FieldsIterSerialize impl ([all_fields.rs#L176-L182](https://github.com/lololozhkin/vector/blob/34078ba63a38b48332cd91b0aaeda1087ab4887c/lib/
vector-core/src/event/util/log/all_fields.rs#L176-L182)) is a third place building flat string keys, with the same quoting behavior.
The New Relic sink handles this correctly with [build_unquoted_value_path](https://github.com/lololozhkin/vector/blob/34078ba63a38b48332cd91b0aaeda
1087ab4887c/src/sinks/new_relic/model.rs#L280-L294), but it's local to that sink. Since InfluxDB, Honeycomb (via Serialize), and New Relic all need the same "flat unquoted path string" logic, it would be worth centralizing this helper (the TODO in the code already suggests doing it in VRL code, but it's fine to put it a util in Vector for now) and reusing it across all three sites.
One small thing on the helper: the match only handles Field and Index. An explicit _ => unreachable!() arm would guard against future OwnedSegment variants.
| fn make_path(&mut self, component: PathComponent<'a>) -> OwnedTargetPath { | ||
| let segments = self | ||
| .path | ||
| .iter() | ||
| .chain(iter::once(&component)) | ||
| .map(|val| match val { | ||
| PathComponent::Key(key_string) => OwnedSegment::Field((*key_string).to_owned()), | ||
| PathComponent::Index(idx) => OwnedSegment::Index(*idx as isize), | ||
| }) | ||
| .collect(); | ||
|
|
||
| OwnedTargetPath { | ||
| prefix: self.path_prefix, | ||
| path: OwnedValuePath { segments }, |
There was a problem hiding this comment.
The problem
Currently make_path iterates self.path and calls .to_owned() on every ancestor field name for every leaf. For a field at depth D, that's D string clones per leaf.
Perf optimization idea
Maintain a Vec<OwnedSegment> on the iterator, pushing/popping in tandem with push() / pop(). When we hit a leaf, we can do:
self.segments.push(leaf_segment);
let path = OwnedTargetPath {
prefix: self.path_prefix,
path: OwnedValuePath { segments: self.segments.clone() },
};
self.segments.pop();Would help reduce allocation in hot paths that require iterating every field of every event.
There was a problem hiding this comment.
Wait, cannot understand. Check my logic for validity: vec.clone() will clone all the paths in the same way. It will call clone() on each element, right? So for each leaf we will copy all the vec, and all the D predecessor, right? Moreover, we will clone each key one more time in that vec.
As i believe, real economy could be achieved if we will store and return not owned versions of paths. But that's a lot of work...
Please, tell me if my thoughts are not correct, I really want to understand.
|
@pront thanks for your reply! I'll fix your comments. |
|
@pront Hi again! I have some questions about your reply. First of all, need to mention, that i've tried to not change the logic of serialization/deserialization of the fields. I've used docs and code as a source of truth. The questions are:y
|
Hey @lololozhkin, I think you are right about
I agree, let's make the quoting decision visible at the call site instead of hidden in the iterator. |
Summary
all_fields iterator now emits OwnedTargetPath instead of KeyString.
Vector configuration
--
How did you test this PR?
Run all the tests: both unit and integration. Added some more tests myself.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References