Support nested sourced_from fields end-to-end#1252
Conversation
| String encodeKey(List parts) { | ||
| StringBuilder sb = new StringBuilder(); | ||
| for (String part : parts) { | ||
| sb.append(part.length()); | ||
| sb.append(':'); | ||
| sb.append(part); | ||
| } | ||
| return sb.toString(); | ||
| } | ||
|
|
||
| // Inverse of `encodeKey`. | ||
| List decodeKey(String key) { | ||
| List parts = new ArrayList(); | ||
| int i = 0; | ||
| while (i < key.length()) { | ||
| int colonPos = key.indexOf(":", i); | ||
| int length = Integer.parseInt(key.substring(i, colonPos)); | ||
| int valueStart = colonPos + 1; | ||
| parts.add(key.substring(valueStart, valueStart + length)); | ||
| i = valueStart + length; | ||
| } | ||
| return parts; | ||
| } |
There was a problem hiding this comment.
| String encodeKey(List parts) { | |
| StringBuilder sb = new StringBuilder(); | |
| for (String part : parts) { | |
| sb.append(part.length()); | |
| sb.append(':'); | |
| sb.append(part); | |
| } | |
| return sb.toString(); | |
| } | |
| // Inverse of `encodeKey`. | |
| List decodeKey(String key) { | |
| List parts = new ArrayList(); | |
| int i = 0; | |
| while (i < key.length()) { | |
| int colonPos = key.indexOf(":", i); | |
| int length = Integer.parseInt(key.substring(i, colonPos)); | |
| int valueStart = colonPos + 1; | |
| parts.add(key.substring(valueStart, valueStart + length)); | |
| i = valueStart + length; | |
| } | |
| return parts; | |
| } | |
| String encodeKey(List parts) { | |
| StringBuilder sb = new StringBuilder(); | |
| for (String part : parts) { | |
| sb.append(part.length()).append(':').append(part).append('|'); | |
| } | |
| return sb.toString(); | |
| } | |
| List decodeKey(String key) { | |
| List parts = new ArrayList(); | |
| int i = 0; | |
| while (i < key.length()) { | |
| int colonPos = key.indexOf(":", i); | |
| int length = Integer.parseInt(key.substring(i, colonPos)); | |
| int valueStart = colonPos + 1; | |
| parts.add(key.substring(valueStart, valueStart + length)); | |
| i = valueStart + length + 1; // +1 skips the '|' terminator | |
| } | |
| return parts; | |
| } |
Consider a list of parts like [widget-42, parts, gizmo-7]. With your algorithm, it would encode as:
9:widget-425:parts7:gizmo-7
the 425 runs together and looks like one number even though the 42 is part of the widget-42 identifier and the 5 is the length of the next part. So it's hard to read/scan at a glance.
My suggestion above adds | as a separator, so that this becomes:
9:widget-42|5:parts|7:gizmo-7|
...which is way easier to read/scan. It's worth noting that | could show up with an identifier, but that still works just fine even though it looks odd to a human:
10:wid|get-42|5:parts|7:gizmo-7|
That's unlikely to happen in practice though--| isn't a common character in identifiers.
The alternative is to split/join on a character like | but that adds a lot of complexity because you have to deal with escaping and painless disable regex support by default. Using length-prefixing is super simple.
f4549d9 to
4e853ea
Compare
4e853ea to
b4b5815
Compare
myronmarston
left a comment
There was a problem hiding this comment.
This is looking good @ellisandrews-toast! Left some feedback.
| "id" => "17", | ||
| "sourcedFromNestedFields" => {}, | ||
| "sourcedFromNestedPathIdentifiers" => {}, | ||
| "sourcedFromNestedPaths" => {}, |
There was a problem hiding this comment.
It would be good to add a test in which these sourcedFromNested* params are non-empty so that the tests can't be satisfied by the implementation hard-coding {}.
|
|
||
| # Verify that it does not have a property for `size` or `options.size` | ||
| expect(mapping.fetch("properties").keys).to contain_exactly("id", "options", "__sources", "__versions", "__typename") | ||
| expect(mapping.fetch("properties").keys).to contain_exactly("id", "options", "__nested_sourced_data", "__sources", "__versions", "__typename") |
There was a problem hiding this comment.
Can you also update this spec on line 105?
| } | ||
| List parts = decodeKey(nestedElementKey); | ||
| parts.add(0, relationship); | ||
| return encodeKey(parts); |
There was a problem hiding this comment.
It's a little inefficient to decode nestedElementKey only to add a prefix to it since it requires re-encoding the full thing. To make it more efficient, can you you extract an appendKeyPart helper function? I'm imagining something like this:
diff --git a/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/scripting/scripts/update/index_data.painless b/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/scripting/scripts/update/index_data.painless
index b99baa9d..e34f6757 100644
--- a/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/scripting/scripts/update/index_data.painless
+++ b/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/scripting/scripts/update/index_data.painless
@@ -1,12 +1,17 @@
// --- Helper Functions --- //
+void appendKeyPart(StringBuider sb, String part) {
+ sb.append(part.length());
+ sb.append(':');
+ sb.append(part);
+ sb.append('|');
+}
+
// Encodes a list of strings into an unambiguous, length-prefixed string ("len:value" parts concatenated).
String encodeKey(List parts) {
StringBuilder sb = new StringBuilder();
for (String part : parts) {
- sb.append(part.length());
- sb.append(':');
- sb.append(part);
+ appendKeyPart(sb, part);
}
return sb.toString();
}
@@ -49,9 +54,11 @@ String buildVersionsKey(String relationship, String nestedElementKey) {
if (nestedElementKey.isEmpty()) {
return relationship;
}
- List parts = decodeKey(nestedElementKey);
- parts.add(0, relationship);
- return encodeKey(parts);
+
+ StringBuilder sb = new StringBuilder();
+ appendKeyPart(sb, relationship);
+ sb.append(nestedElementKey);
+ return sb.toString();
}
// Finds the element of `elements` whose `id` equals `matchValue`, or null.
| // --- Main script body --- // | ||
| Map source = ctx._source; | ||
| String id = params.id; | ||
| String relationship = params.relationship; |
There was a problem hiding this comment.
Will relationship here wind up being the qualified nested relationship for a nested sourced_from event? That seems important to get right.
| source.__nested_sourced_data = [:]; | ||
| } | ||
| if (source.__nested_sourced_data[relationship] == null) { | ||
| source.__nested_sourced_data[relationship] = [:]; |
There was a problem hiding this comment.
It's a little funny that you check if nestedElementKey is non-empty, but then put relationship in the map rather than nestedElementKey. I see why: nestedElementKey doesn't have relationship in it. But versionsKey does (it's basically nestedElementKey prefixed with relationship, right?). The result is a bit of a mismatch between __versions and __nested_sourced_data:
- Both are keyed by the relationship + nestedElementKey
- ...but
__versionsis flattened with relationship and nestedElementKey encoded into one string key where as__nested_sourced_datahas nested maps (outer map usesrelationshipkey, inner map usesnestedElementKey).
I think there's an opportunity to simplify here: can we use the combined encoded key (currently called versionsKey, but will have to be renamed) for both things?
If you did that, then this function wouldn't need the 3 separate String versionsKey, String relationship, String nestedElementKey args (something which is a bit duplicative since versionsKey contains the latter two!)--it could just be one.
That feels simpler and more consistent to me.
OTOH, the nested map approach (with outer relationship key) supports a lookup of all the nested data for a relationship under a single relationship key--something which can't efficiently be done if we use an encoded compound key. I'm not sure if we need that, though.
| List pathSegments = (List) sourcedFromNestedPaths.get(relationship); | ||
|
|
||
| if (pathSegments == null) { | ||
| continue; |
There was a problem hiding this comment.
This situation kinda feels like an error, and not something to silently ignore. Should we throw in that case? If __nested_sourced_data has data for a relationship and sourcedFromNestedPaths doesn't have the nested paths for that relationship, it seems like something has gone wrong...
Although, now that I think about it, maybe not: here's a case I can think of where this could occur:
- A
Teamindex has a nestedroster.players.statLinerelationship - At some point that relationship (and the
sourced_fromPlayer.statsfield) gets dropped from the schema - The relationship no longer exists, no data is coming in for it, and the fields aren't exposed in GraphQL, but the previously stored source data is still there.
sourcedFromNestedPathswould not have an entry for the relationship and it's correct to skip the sourced data.
Still, it would be nice to surface via a logged warning, or something...not sure if that's possible, though.
Regardless, can you document this case (and why you skip) via a comment here?
| // Finds the element of `elements` whose `id` equals `matchValue`, or null. | ||
| def findInList(List elements, String matchValue) { | ||
| for (Map element : elements) { | ||
| if (matchValue.equals(element.id)) { |
There was a problem hiding this comment.
| if (matchValue.equals(element.id)) { | |
| if (matchValue == element.id) { |
As per https://www.elastic.co/docs/reference/scripting-languages/painless/painless-operators-boolean, I think == just calls equals but IMO it's easier to read with the operator.
| } | ||
|
|
||
| if (segment.containsKey("sourceField")) { | ||
| current = (Map) findInList((List) current.get(field), (String) keyParts.get(i)); |
There was a problem hiding this comment.
I find it odd that this branch is triggered when segment has a sourceField key--but then it doesn't actually do segment.get("sourceField"). Should sourceField be used here some how? Or is this a sine that we don't actually need sourceField? If we're just using it as a signal to tell if the segment is an object or a list, we could make that more explicit via segmentType: [list or map].
|
|
||
| if (current == null) { | ||
| return null; | ||
| } |
There was a problem hiding this comment.
| } | |
| def subField = current.get(field); | |
| if (subField instanceof List) { | |
| current = (Map) findInList((List) subField, (String) keyParts.get(i)); | |
| } else if (subField instanceof Map) { | |
| current = subField; | |
| } else { | |
| current = null; | |
| } | |
| if (current == null) { | |
| return null; | |
| } |
I think this is a bit simpler? It also feels more correct to check the subField for what it's type is (Map vs List) rather than inferring it from the presence of sourceField in segment.
Thoughts?
|
|
||
| // Navigates `source` through `pathSegments` to the target nested element, or null if any hop is missing. | ||
| // `keyParts` has one entry per segment (aligned by index): a list element's matched id, or an object field name. | ||
| def navigateToNestedElement(Map source, List pathSegments, List keyParts) { |
There was a problem hiding this comment.
| def navigateToNestedElement(Map source, List pathSegments, List keyParts) { | |
| def extractNestedElement(Map source, List pathSegments, List keyParts) { |
/nit to me, "navigate" has the sense of movement--like the script context is updating itself to move to the inner nested element. But this feels like a function that just extracts and returns the inner element--not movement in a meaningful sense.
| if (source.__nested_sourced_data == null) { | ||
| source.__nested_sourced_data = [:]; | ||
| } | ||
| if (source.__nested_sourced_data[relationship] == null) { |
There was a problem hiding this comment.
Claude pointed this out:
storeNestedSourcedDataearly-returns on empty fields, but setup still created the relationship bucket
setupcreatessource.__nested_sourced_data[relationship] = [:]whenevernestedElementKeyis non-empty, butstoreNestedSourcedDatareturns early whensourcedFromNestedFields.isEmpty(). Net effect: an empty{}relationship map can be persisted into_sourcewith no entries. Harmless (just a tiny empty object), but slightly inconsistent — the bucket is created even when nothing is ever stored in it. Minor; not worth blocking.
Maybe worth addressing?
| } | ||
|
|
||
| if (current == null) { | ||
| return null; |
There was a problem hiding this comment.
🤖 (Claude here — this is my analysis, building on the instanceof suggestion above.)
Worth taking the instanceof subField suggestion for a correctness reason, not just readability. Navigation here relies on keyParts staying index-aligned with pathSegments. That alignment exists only because buildNestedElementKey adds a part for every segment (object segments contribute the field name, list segments the id).
But the Ruby side does the opposite: path_identifier_params (resolved_relationship_chain.rb:74) filter_maps and skips object segments. So there are two representations with different shapes — identifiers are list-only, the element key is all-segments — and this navigation silently depends on the all-segments one.
Keying off subField instanceof List/Map removes the dependence on segment shape entirely: navigation reads the actual data, so a future change to buildNestedElementKey (e.g. skipping object segments to mirror path_identifier_params) could not silently break index alignment here.
| ((Map) source.__nested_sourced_data[relationship]).put(nestedElementKey, sourcedFromNestedFields); | ||
| } | ||
|
|
||
| // Re-applies all buffered nested sourced data to its target elements. Runs on every event so that a |
There was a problem hiding this comment.
🤖 (Claude here — my own analysis, following from the put/deletion thread above.)
I think this isEmpty() guard is effectively dead code for nested events. sourcedFromNestedFields comes from field_params.transform_values { … } (sourced_from_nested_params.rb:43), and a nested update target always has ≥1 sourced field, so the keys are always present. An upstream deletion yields {"stats" => null} — a non-empty map with a null value — not an empty map. So for any real nested event this is never empty, and the guard never trips.
(That non-emptiness is also exactly why deletions propagate correctly: put overwrites with {"stats" => null} and the later putAll clears the field on the target. So the deletion concern raised elsewhere is a non-issue.)
If there is a scenario where this is empty I'm not seeing it — otherwise per the repo guidance to avoid defensive code for impossible cases (fail fast instead), this guard could go.
| } | ||
| } | ||
|
|
||
| void recordSource(Map source, String versionsKey, String relationship, String sourceId, long eventVersion) { |
There was a problem hiding this comment.
🤖 (Claude here — my own analysis; a lifecycle gap broader than the schema-drop case noted on applyNestedSourcedData above.)
There is no path that ever removes entries from __nested_sourced_data, which I think leads to unbounded growth in a routine case (not just the schema-drop one):
- A self-event replaces the
roster.playersarray and a previously-sourced player is dropped. - The buffer entry keyed by that player's id is now orphaned.
navigateToNestedElement→findInListreturns null →target == null→ skipped here. No crash, good.- But the orphaned entry stays in
_sourceforever and gets re-navigated on every future event.
For a document whose nested collections churn (players traded, line items added/removed), this map grows without bound and adds work to every update. At minimum worth documenting as a known limitation; ideally prune the entry when navigation returns null here (i.e. else branch removes the key).
Related: nested-element re-targeting (same source entity now pointing at a different element) won't be caught by validateSource's "relationship changed" guard, since each element gets its own versionsKey. Probably an accepted sourced_from mutation limitation, but worth confirming it's intended for the nested case.
Still WIP, will fill this in later.