Batch owner repository fetching with GraphQL#10
Conversation
There was a problem hiding this comment.
Pull request overview
This PR replaces the sequential, single-page REST calls used to enumerate per-owner repositories in get_repositories_from_users and get_repositories_from_organizations with a batched GraphQL implementation, while keeping a paginated REST path as a fallback. This addresses the long runtime and silent repo truncation reported in issue #8.
Changes:
- Adds a
graphql_api_requesthelper with retry, error, and rate-limit handling forhttps://api.github.com/graphql. - Adds private helpers (
_is_bot_login,_repository_graphql_fields,_build_repository_batch_query,_repository_node_to_rest_dict,_fetch_repositories_rest_paginated,_fetch_repositories_graphql) that batch up to 50 owners per query, paginate up to 100 repos per owner, and re-queue owners with more pages. - Rewrites the user/org repo fetching functions to filter bots once, dispatch through the GraphQL helper, and fall back to per-owner REST pagination when GraphQL data is unavailable.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks for the review notes. I have a follow-up patch ready locally for the GraphQL partial-error handling, REST-shaped output fields, and deterministic progress reporting. I noticed PR #11 now also closes #8 and changes the same file. Before I push more updates here, which PR would you prefer to continue with? If #10 is still the preferred thread, I can push the review fixes here; if #11 is preferred, I can stand down to avoid duplicate review effort. |
Fixes #8
Summary
Root cause
The user and organization repository collection steps made one sequential REST request per owner and only consumed the first response page, so large owner lists were slow and owners with more than one page of repositories were incomplete.
Files changed
repofinder/scraping/repo_scraping_utils.pytests/test_repo_scraping_utils.pytests/__init__.pyTests run
python3 -m unittest discover -vpython3 -m py_compile repofinder/scraping/repo_scraping_utils.py tests/test_repo_scraping_utils.py tests/__init__.pygit diff --checkRisk
Low to medium. The downstream database ingestion path is covered, and REST fallback remains available if GraphQL fails for a batch or owner.