Skip to content

Adding lots of new repos at once can create perceived bottlenecks #391

@MoralCode

Description

@MoralCode

Because CollectOSS schedules new repos first, and needs to do a full collection on them (all github messages/events) adding lots of new repos can cause a perceived bottleneck where:

  • system stats are relatively idle
  • flower reports all task workers are running
  • docker logs are moving... just maybe slower than expected for all workers running
  • API keys are being consumed (but maybe arent fully the bottleneck
  • grafana shows that the instance might be falling behind on existing collection upkeep

new_collection_git_list = get_newly_added_repos(session, limit, hook=self.name)
collection_list = [(repo_git, True) for repo_git in new_collection_git_list]
self.repo_list.extend(collection_list)
limit -= len(collection_list)
#Now start recollecting other repos if there is space to do so.
if limit <= 0:
return
recollection_git_list = get_repos_for_recollection(session, limit, hook=self.name, days_until_collect_again=self.days_until_collect_again)
collection_list = [(repo_git, False) for repo_git in recollection_git_list]
self.repo_list.extend(collection_list)

Possible solutions:

  • scheduling some (maybe configurable) repos for collection that are re-collections, just to make sure that there is some more visible throughput that isnt just waiting on massive numbers of API calls to github
  • improving the efficiency/parallelism of full collection, specifically for large tasks like message or event collection so theres less waiting for slow network calls to come back (and we can better utilize CPU)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions