Skip to content

Batch owner repository fetching with GraphQL#10

Open
pragnyanramtha wants to merge 2 commits into
UC-OSPO-Network:mainfrom
pragnyanramtha:fix/graphql-repo-batching-8
Open

Batch owner repository fetching with GraphQL#10
pragnyanramtha wants to merge 2 commits into
UC-OSPO-Network:mainfrom
pragnyanramtha:fix/graphql-repo-batching-8

Conversation

@pragnyanramtha

@pragnyanramtha pragnyanramtha commented May 13, 2026

Copy link
Copy Markdown

Fixes #8

Summary

  • Add a GraphQL batch request helper for repository owner lookups.
  • Fetch user and organization repositories in batches of 50 owners with cursor pagination.
  • Keep a paginated REST fallback when GraphQL data is unavailable for an owner or batch.
  • Preserve usable GraphQL data when a batched response has partial errors.
  • Map additional GraphQL fields back into the REST-shaped repository dictionaries expected by downstream JSON/database consumers.
  • Report deterministic owner progress at crossed 10-owner thresholds and completion.

Root cause

The user and organization repository collection steps made one sequential REST request per owner and only consumed the first response page, so large owner lists were slow and owners with more than one page of repositories were incomplete.

Files changed

  • repofinder/scraping/repo_scraping_utils.py
  • tests/test_repo_scraping_utils.py
  • tests/__init__.py

Tests run

  • python3 -m unittest discover -v
  • python3 -m py_compile repofinder/scraping/repo_scraping_utils.py tests/test_repo_scraping_utils.py tests/__init__.py
  • git diff --check
  • Mocked GraphQL partial-error handling, per-owner REST fallback, REST-compatible field mapping, and deterministic progress reporting
  • Earlier validation on the first commit: mocked GraphQL pagination, REST fallback, bot filtering, user filtering, JSON output behavior, live GraphQL query for UC-OSPO-Network organization repositories, live REST parity check, and temporary SQLite ingestion check

Risk

Low to medium. The downstream database ingestion path is covered, and REST fallback remains available if GraphQL fails for a batch or owner.

Copilot AI review requested due to automatic review settings May 13, 2026 19:26

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the sequential, single-page REST calls used to enumerate per-owner repositories in get_repositories_from_users and get_repositories_from_organizations with a batched GraphQL implementation, while keeping a paginated REST path as a fallback. This addresses the long runtime and silent repo truncation reported in issue #8.

Changes:

  • Adds a graphql_api_request helper with retry, error, and rate-limit handling for https://api.github.com/graphql.
  • Adds private helpers (_is_bot_login, _repository_graphql_fields, _build_repository_batch_query, _repository_node_to_rest_dict, _fetch_repositories_rest_paginated, _fetch_repositories_graphql) that batch up to 50 owners per query, paginate up to 100 repos per owner, and re-queue owners with more pages.
  • Rewrites the user/org repo fetching functions to filter bots once, dispatch through the GraphQL helper, and fall back to per-owner REST pagination when GraphQL data is unavailable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread repofinder/scraping/repo_scraping_utils.py
Comment thread repofinder/scraping/repo_scraping_utils.py Outdated
Comment thread repofinder/scraping/repo_scraping_utils.py
@pragnyanramtha

Copy link
Copy Markdown
Author

Thanks for the review notes. I have a follow-up patch ready locally for the GraphQL partial-error handling, REST-shaped output fields, and deterministic progress reporting.

I noticed PR #11 now also closes #8 and changes the same file. Before I push more updates here, which PR would you prefer to continue with? If #10 is still the preferred thread, I can push the review fixes here; if #11 is preferred, I can stand down to avoid duplicate review effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: replace sequential REST calls with GraphQL batching for user and org repo fetching

2 participants