A scraper for the Companies House Advanced Search API, with the specific intention of collecting data on Third Sector Organisations. The open data on this website is sourced from Public Records made available by Companies House and licensed under the Open Government License.
It should run without any special installations or requirements; tqdm and np are mostly luxuries which improve the quality of life. To install them, a simple pip install -r requirements.txt should do the trick. There are various things that can be done to improve the scraper, including but not limited to:
- Logging
- Better error handling of unknown status codes.
- De-duplicate and compress upon completion of the script.
- Dynamically scale up and down the window of the scrape, based on whether the previous period was close to the 10k threshold.
This work is free. You can redistribute it and/or modify it under the terms of the GNU GPL 3.0 license.
We are grateful for funding from the ESRC (project reference: ES/X000524/1).