Ingest: Add post-indexing content date resolution#3112
Conversation
HashedBulkUpdate uses bulk update actions (scripted upserts) which skip Elasticsearch ingest pipelines, so content_last_updated was never set during normal indexing. This adds a ResolveContentDatesAsync step that runs _update_by_query with the enrichment pipeline after indexing completes, and switches StopAsync to use read aliases instead of the write target (which is removed after CompleteAsync). Includes integration tests against a real Elasticsearch container validating cold-start, date preservation, change detection, and the bulk-update pipeline gap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis change adds post-indexing content date enrichment functionality to the Elasticsearch exporter. A new Sequence DiagramsequenceDiagram
participant Exporter as ElasticsearchMarkdownExporter
participant Enrichment as ContentDateEnrichment
participant ES as Elasticsearch Cluster
Exporter->>Exporter: StartAsync: Store read aliases<br/>(_lexicalReadAlias, _semanticReadAlias)
Note over Exporter,ES: Indexing occurs...
Exporter->>Exporter: StopAsync begins
Exporter->>Enrichment: ResolveContentDatesAsync<br/>(lexical alias)
Enrichment->>ES: _update_by_query on lexical alias<br/>with enrichment pipeline
ES-->>Enrichment: Apply pipeline, resolve dates
Exporter->>Enrichment: ResolveContentDatesAsync<br/>(semantic alias)
Enrichment->>ES: _update_by_query on semantic alias<br/>with enrichment pipeline
ES-->>Enrichment: Apply pipeline, resolve dates
Exporter->>Exporter: SyncLookupIndexAsync<br/>using lexical read alias
Exporter->>ES: Sync lookup index state
ES-->>Exporter: Lookup index updated
Exporter->>Exporter: StopAsync completes
Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
* Search: Add post-indexing content date resolution via update_by_query HashedBulkUpdate uses bulk update actions (scripted upserts) which skip Elasticsearch ingest pipelines, so content_last_updated was never set during normal indexing. This adds a ResolveContentDatesAsync step that runs _update_by_query with the enrichment pipeline after indexing completes, and switches StopAsync to use read aliases instead of the write target (which is removed after CompleteAsync). Includes integration tests against a real Elasticsearch container validating cold-start, date preservation, change detection, and the bulk-update pipeline gap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Search: Fix lint warnings in content date enrichment tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Search: Add post-indexing content date resolution via update_by_query HashedBulkUpdate uses bulk update actions (scripted upserts) which skip Elasticsearch ingest pipelines, so content_last_updated was never set during normal indexing. This adds a ResolveContentDatesAsync step that runs _update_by_query with the enrichment pipeline after indexing completes, and switches StopAsync to use read aliases instead of the write target (which is removed after CompleteAsync). Includes integration tests against a real Elasticsearch container validating cold-start, date preservation, change detection, and the bulk-update pipeline gap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Search: Fix lint warnings in content date enrichment tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What
Add a post-indexing
_update_by_querystep that resolvescontent_last_updatedon all documents after indexing completes, compensating for Elasticsearch bulk update actions skipping ingest pipelines.Why
HashedBulkUpdateuses scripted upserts (bulk update actions), which skipdefault_pipelineandfinal_pipeline. This means the enrichment pipeline that stampscontent_last_updatednever fires during normal indexing, leaving the field unset.How
ResolveContentDatesAsynctoContentDateEnrichment— runs_update_by_querywith the enrichment pipelineStartAsync(write targets are removed afterCompleteAsync)ResolveContentDatesAsyncon both lexical and semantic indices inStopAsyncbefore syncing the lookupSyncLookupIndexAsyncto use the read alias instead of the write targetTest plan
🤖 Generated with Claude Code