Skip to content

Support nested sourced_from fields end-to-end#1252

Draft
ellisandrews-toast wants to merge 1 commit into
mainfrom
nested-sourced-from-script-and-acceptance
Draft

Support nested sourced_from fields end-to-end#1252
ellisandrews-toast wants to merge 1 commit into
mainfrom
nested-sourced-from-script-and-acceptance

Conversation

@ellisandrews-toast

Copy link
Copy Markdown
Collaborator

Still WIP, will fill this in later.

Comment on lines +4 to +26
String encodeKey(List parts) {
StringBuilder sb = new StringBuilder();
for (String part : parts) {
sb.append(part.length());
sb.append(':');
sb.append(part);
}
return sb.toString();
}

// Inverse of `encodeKey`.
List decodeKey(String key) {
List parts = new ArrayList();
int i = 0;
while (i < key.length()) {
int colonPos = key.indexOf(":", i);
int length = Integer.parseInt(key.substring(i, colonPos));
int valueStart = colonPos + 1;
parts.add(key.substring(valueStart, valueStart + length));
i = valueStart + length;
}
return parts;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String encodeKey(List parts) {
StringBuilder sb = new StringBuilder();
for (String part : parts) {
sb.append(part.length());
sb.append(':');
sb.append(part);
}
return sb.toString();
}
// Inverse of `encodeKey`.
List decodeKey(String key) {
List parts = new ArrayList();
int i = 0;
while (i < key.length()) {
int colonPos = key.indexOf(":", i);
int length = Integer.parseInt(key.substring(i, colonPos));
int valueStart = colonPos + 1;
parts.add(key.substring(valueStart, valueStart + length));
i = valueStart + length;
}
return parts;
}
String encodeKey(List parts) {
StringBuilder sb = new StringBuilder();
for (String part : parts) {
sb.append(part.length()).append(':').append(part).append('|');
}
return sb.toString();
}
List decodeKey(String key) {
List parts = new ArrayList();
int i = 0;
while (i < key.length()) {
int colonPos = key.indexOf(":", i);
int length = Integer.parseInt(key.substring(i, colonPos));
int valueStart = colonPos + 1;
parts.add(key.substring(valueStart, valueStart + length));
i = valueStart + length + 1; // +1 skips the '|' terminator
}
return parts;
}

Consider a list of parts like [widget-42, parts, gizmo-7]. With your algorithm, it would encode as:

9:widget-425:parts7:gizmo-7

the 425 runs together and looks like one number even though the 42 is part of the widget-42 identifier and the 5 is the length of the next part. So it's hard to read/scan at a glance.

My suggestion above adds | as a separator, so that this becomes:

9:widget-42|5:parts|7:gizmo-7|

...which is way easier to read/scan. It's worth noting that | could show up with an identifier, but that still works just fine even though it looks odd to a human:

10:wid|get-42|5:parts|7:gizmo-7|

That's unlikely to happen in practice though--| isn't a common character in identifiers.

The alternative is to split/join on a character like | but that adds a lot of complexity because you have to deal with escaping and painless disable regex support by default. Using length-prefixing is super simple.

@myronmarston myronmarston changed the base branch from main to myron/nested-sourced-from-prep June 12, 2026 22:35
@myronmarston myronmarston force-pushed the nested-sourced-from-script-and-acceptance branch from f4549d9 to 4e853ea Compare June 12, 2026 22:35
Base automatically changed from myron/nested-sourced-from-prep to main June 12, 2026 22:57
myronmarston added a commit that referenced this pull request Jun 12, 2026
…om fields. (#1255)

This is the parts of #1252 which are ready to merge.

Co-authored-by: ellisandrews-toast <ellis.andrews@toasttab.com>
@myronmarston myronmarston force-pushed the nested-sourced-from-script-and-acceptance branch from 4e853ea to b4b5815 Compare June 12, 2026 23:15

@myronmarston myronmarston left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good @ellisandrews-toast! Left some feedback.

"id" => "17",
"sourcedFromNestedFields" => {},
"sourcedFromNestedPathIdentifiers" => {},
"sourcedFromNestedPaths" => {},

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a test in which these sourcedFromNested* params are non-empty so that the tests can't be satisfied by the implementation hard-coding {}.


# Verify that it does not have a property for `size` or `options.size`
expect(mapping.fetch("properties").keys).to contain_exactly("id", "options", "__sources", "__versions", "__typename")
expect(mapping.fetch("properties").keys).to contain_exactly("id", "options", "__nested_sourced_data", "__sources", "__versions", "__typename")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update this spec on line 105?

it "includes a `__versions` property (so update scripts can maintain the versions) and a `__sources` property (so that we can filter on present sources)" do
mapping = index_mapping_for "my_type" do |s|
s.object_type "MyType" do |t|
t.field "id", "ID"
t.index "my_type"
end
end
expect(mapping.dig("properties")).to include({
"id" => {"type" => "keyword"},
"__versions" => {"dynamic" => "false", "type" => "object"},
"__sources" => {"type" => "keyword"}
})
end

}
List parts = decodeKey(nestedElementKey);
parts.add(0, relationship);
return encodeKey(parts);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little inefficient to decode nestedElementKey only to add a prefix to it since it requires re-encoding the full thing. To make it more efficient, can you you extract an appendKeyPart helper function? I'm imagining something like this:

diff --git a/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/scripting/scripts/update/index_data.painless b/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/scripting/scripts/update/index_data.painless
index b99baa9d..e34f6757 100644
--- a/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/scripting/scripts/update/index_data.painless
+++ b/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/scripting/scripts/update/index_data.painless
@@ -1,12 +1,17 @@
 // --- Helper Functions --- //
 
+void appendKeyPart(StringBuider sb, String part) {
+  sb.append(part.length());
+  sb.append(':');
+  sb.append(part);
+  sb.append('|');
+}
+
 // Encodes a list of strings into an unambiguous, length-prefixed string ("len:value" parts concatenated).
 String encodeKey(List parts) {
   StringBuilder sb = new StringBuilder();
   for (String part : parts) {
-    sb.append(part.length());
-    sb.append(':');
-    sb.append(part);
+    appendKeyPart(sb, part);
   }
   return sb.toString();
 }
@@ -49,9 +54,11 @@ String buildVersionsKey(String relationship, String nestedElementKey) {
   if (nestedElementKey.isEmpty()) {
     return relationship;
   }
-  List parts = decodeKey(nestedElementKey);
-  parts.add(0, relationship);
-  return encodeKey(parts);
+
+  StringBuilder sb = new StringBuilder();
+  appendKeyPart(sb, relationship);
+  sb.append(nestedElementKey);
+  return sb.toString();
 }
 
 // Finds the element of `elements` whose `id` equals `matchValue`, or null.

// --- Main script body --- //
Map source = ctx._source;
String id = params.id;
String relationship = params.relationship;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will relationship here wind up being the qualified nested relationship for a nested sourced_from event? That seems important to get right.

source.__nested_sourced_data = [:];
}
if (source.__nested_sourced_data[relationship] == null) {
source.__nested_sourced_data[relationship] = [:];

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little funny that you check if nestedElementKey is non-empty, but then put relationship in the map rather than nestedElementKey. I see why: nestedElementKey doesn't have relationship in it. But versionsKey does (it's basically nestedElementKey prefixed with relationship, right?). The result is a bit of a mismatch between __versions and __nested_sourced_data:

  • Both are keyed by the relationship + nestedElementKey
  • ...but __versions is flattened with relationship and nestedElementKey encoded into one string key where as __nested_sourced_data has nested maps (outer map uses relationship key, inner map uses nestedElementKey).

I think there's an opportunity to simplify here: can we use the combined encoded key (currently called versionsKey, but will have to be renamed) for both things?

If you did that, then this function wouldn't need the 3 separate String versionsKey, String relationship, String nestedElementKey args (something which is a bit duplicative since versionsKey contains the latter two!)--it could just be one.

That feels simpler and more consistent to me.

OTOH, the nested map approach (with outer relationship key) supports a lookup of all the nested data for a relationship under a single relationship key--something which can't efficiently be done if we use an encoded compound key. I'm not sure if we need that, though.

List pathSegments = (List) sourcedFromNestedPaths.get(relationship);

if (pathSegments == null) {
continue;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This situation kinda feels like an error, and not something to silently ignore. Should we throw in that case? If __nested_sourced_data has data for a relationship and sourcedFromNestedPaths doesn't have the nested paths for that relationship, it seems like something has gone wrong...

Although, now that I think about it, maybe not: here's a case I can think of where this could occur:

  • A Team index has a nested roster.players.statLine relationship
  • At some point that relationship (and the sourced_from Player.stats field) gets dropped from the schema
  • The relationship no longer exists, no data is coming in for it, and the fields aren't exposed in GraphQL, but the previously stored source data is still there. sourcedFromNestedPaths would not have an entry for the relationship and it's correct to skip the sourced data.

Still, it would be nice to surface via a logged warning, or something...not sure if that's possible, though.

Regardless, can you document this case (and why you skip) via a comment here?

// Finds the element of `elements` whose `id` equals `matchValue`, or null.
def findInList(List elements, String matchValue) {
for (Map element : elements) {
if (matchValue.equals(element.id)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (matchValue.equals(element.id)) {
if (matchValue == element.id) {

As per https://www.elastic.co/docs/reference/scripting-languages/painless/painless-operators-boolean, I think == just calls equals but IMO it's easier to read with the operator.

}

if (segment.containsKey("sourceField")) {
current = (Map) findInList((List) current.get(field), (String) keyParts.get(i));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it odd that this branch is triggered when segment has a sourceField key--but then it doesn't actually do segment.get("sourceField"). Should sourceField be used here some how? Or is this a sine that we don't actually need sourceField? If we're just using it as a signal to tell if the segment is an object or a list, we could make that more explicit via segmentType: [list or map].


if (current == null) {
return null;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
def subField = current.get(field);
if (subField instanceof List) {
current = (Map) findInList((List) subField, (String) keyParts.get(i));
} else if (subField instanceof Map) {
current = subField;
} else {
current = null;
}
if (current == null) {
return null;
}

I think this is a bit simpler? It also feels more correct to check the subField for what it's type is (Map vs List) rather than inferring it from the presence of sourceField in segment.

Thoughts?


// Navigates `source` through `pathSegments` to the target nested element, or null if any hop is missing.
// `keyParts` has one entry per segment (aligned by index): a list element's matched id, or an object field name.
def navigateToNestedElement(Map source, List pathSegments, List keyParts) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def navigateToNestedElement(Map source, List pathSegments, List keyParts) {
def extractNestedElement(Map source, List pathSegments, List keyParts) {

/nit to me, "navigate" has the sense of movement--like the script context is updating itself to move to the inner nested element. But this feels like a function that just extracts and returns the inner element--not movement in a meaningful sense.

if (source.__nested_sourced_data == null) {
source.__nested_sourced_data = [:];
}
if (source.__nested_sourced_data[relationship] == null) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude pointed this out:

storeNestedSourcedData early-returns on empty fields, but setup still created the relationship bucket

setup creates source.__nested_sourced_data[relationship] = [:] whenever nestedElementKey is non-empty, but storeNestedSourcedData returns early when sourcedFromNestedFields.isEmpty(). Net effect: an empty {} relationship map can be persisted into _source with no entries. Harmless (just a tiny empty object), but slightly inconsistent — the bucket is created even when nothing is ever stored in it. Minor; not worth blocking.

Maybe worth addressing?

}

if (current == null) {
return null;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 (Claude here — this is my analysis, building on the instanceof suggestion above.)

Worth taking the instanceof subField suggestion for a correctness reason, not just readability. Navigation here relies on keyParts staying index-aligned with pathSegments. That alignment exists only because buildNestedElementKey adds a part for every segment (object segments contribute the field name, list segments the id).

But the Ruby side does the opposite: path_identifier_params (resolved_relationship_chain.rb:74) filter_maps and skips object segments. So there are two representations with different shapes — identifiers are list-only, the element key is all-segments — and this navigation silently depends on the all-segments one.

Keying off subField instanceof List/Map removes the dependence on segment shape entirely: navigation reads the actual data, so a future change to buildNestedElementKey (e.g. skipping object segments to mirror path_identifier_params) could not silently break index alignment here.

((Map) source.__nested_sourced_data[relationship]).put(nestedElementKey, sourcedFromNestedFields);
}

// Re-applies all buffered nested sourced data to its target elements. Runs on every event so that a

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 (Claude here — my own analysis, following from the put/deletion thread above.)

I think this isEmpty() guard is effectively dead code for nested events. sourcedFromNestedFields comes from field_params.transform_values { … } (sourced_from_nested_params.rb:43), and a nested update target always has ≥1 sourced field, so the keys are always present. An upstream deletion yields {"stats" => null} — a non-empty map with a null value — not an empty map. So for any real nested event this is never empty, and the guard never trips.

(That non-emptiness is also exactly why deletions propagate correctly: put overwrites with {"stats" => null} and the later putAll clears the field on the target. So the deletion concern raised elsewhere is a non-issue.)

If there is a scenario where this is empty I'm not seeing it — otherwise per the repo guidance to avoid defensive code for impossible cases (fail fast instead), this guard could go.

}
}

void recordSource(Map source, String versionsKey, String relationship, String sourceId, long eventVersion) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 (Claude here — my own analysis; a lifecycle gap broader than the schema-drop case noted on applyNestedSourcedData above.)

There is no path that ever removes entries from __nested_sourced_data, which I think leads to unbounded growth in a routine case (not just the schema-drop one):

  • A self-event replaces the roster.players array and a previously-sourced player is dropped.
  • The buffer entry keyed by that player's id is now orphaned.
  • navigateToNestedElementfindInList returns null → target == null → skipped here. No crash, good.
  • But the orphaned entry stays in _source forever and gets re-navigated on every future event.

For a document whose nested collections churn (players traded, line items added/removed), this map grows without bound and adds work to every update. At minimum worth documenting as a known limitation; ideally prune the entry when navigation returns null here (i.e. else branch removes the key).

Related: nested-element re-targeting (same source entity now pointing at a different element) won't be caught by validateSource's "relationship changed" guard, since each element gets its own versionsKey. Probably an accepted sourced_from mutation limitation, but worth confirming it's intended for the nested case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants