Skip to content

Fix: Eliminate redundant full table scans in messages and events collection#392

Open
MoralCode wants to merge 1 commit into
mainfrom
PredictiveManish/map-once-update
Open

Fix: Eliminate redundant full table scans in messages and events collection#392
MoralCode wants to merge 1 commit into
mainfrom
PredictiveManish/map-once-update

Conversation

@MoralCode

@MoralCode MoralCode commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

The below PR contents are lightly modified (to adjust issue and PR references mostly) from github.com/augurlabs/augur/pull/3444, filed by @PredictiveManish. The PR itself has been rebased to account for changes to CollectOSS since the fork and resolve merge conflicts

Description

Moved mapping queries outside batch loops and pass pre-built mappings as parameters to processing functions, following the pattern established by Shlok in augurlabs/augur#3439.

Changes Made

collectoss/tasks/github/messages.py

  • Built issue_url_to_id_map and pr_issue_url_to_id_map once in collect_github_messages() before any batch processing
  • Updated process_messages() to accept mappings as parameters instead of rebuilding them
  • Updated process_large_issue_and_pr_message_collection() to accept and pass mappings
  • Increased batch size from 20 to 1000 (reduces batch overhead)

collectoss/tasks/github/events.py

  • Built issue_url_to_id_map and pr_url_to_id_map once in BulkGithubEventCollection.collect() before the batch loop
  • Updated _process_events(), _process_issue_events(), and _process_pr_events() to accept mappings as parameters
  • Removed redundant _get_map_from_*() calls from batch processing methods

Performance Improvement

  • Before: 1,000 messages -> 50 full scans of issues AND PRs tables

  • After: 1,000 messages -> 1 full scan of each table (50x reduction)

  • Before: 10,000 events -> 40 full scans total

  • After: 10,000 events → 1 full scan of each table (40x reduction)

This PR fixes #146

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

Comment thread collectoss/tasks/github/messages.py Outdated
Comment thread collectoss/tasks/github/messages.py Outdated
@MoralCode MoralCode changed the title Fix: Eliminate redundant full table scans in messages and events coll… Fix: Eliminate redundant full table scans in messages and events collection Jun 14, 2026
…ection

Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
@MoralCode MoralCode force-pushed the PredictiveManish/map-once-update branch from 25b36d2 to e3c6796 Compare June 14, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Full table scans on every batch in messages and events collection

2 participants