Because CollectOSS schedules new repos first, and needs to do a full collection on them (all github messages/events) adding lots of new repos can cause a perceived bottleneck where:
- system stats are relatively idle
- flower reports all task workers are running
- docker logs are moving... just maybe slower than expected for all workers running
- API keys are being consumed (but maybe arent fully the bottleneck
- grafana shows that the instance might be falling behind on existing collection upkeep
|
new_collection_git_list = get_newly_added_repos(session, limit, hook=self.name) |
|
collection_list = [(repo_git, True) for repo_git in new_collection_git_list] |
|
self.repo_list.extend(collection_list) |
|
limit -= len(collection_list) |
|
|
|
#Now start recollecting other repos if there is space to do so. |
|
if limit <= 0: |
|
return |
|
|
|
recollection_git_list = get_repos_for_recollection(session, limit, hook=self.name, days_until_collect_again=self.days_until_collect_again) |
|
collection_list = [(repo_git, False) for repo_git in recollection_git_list] |
|
self.repo_list.extend(collection_list) |
Possible solutions:
- scheduling some (maybe configurable) repos for collection that are re-collections, just to make sure that there is some more visible throughput that isnt just waiting on massive numbers of API calls to github
- improving the efficiency/parallelism of full collection, specifically for large tasks like message or event collection so theres less waiting for slow network calls to come back (and we can better utilize CPU)
Because CollectOSS schedules new repos first, and needs to do a full collection on them (all github messages/events) adding lots of new repos can cause a perceived bottleneck where:
CollectOSS/collectoss/tasks/util/collection_util.py
Lines 46 to 57 in fd9a862
Possible solutions: