Fix/world bank okr doi fix#144
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors the codebase to consolidate scraping-related utility functions under welearn_datastack.modules.scraping_utils (replacing welearn_datastack.utils_.scraping_utils) and introduces a shared clean_doi helper, updating collectors/scrapers and tests to use it for consistent DOI normalization.
Changes:
- Migrate scraping utility imports across scrapers, REST collectors, and supporting modules to
welearn_datastack.modules.scraping_utils. - Add
clean_doiand replace duplicated DOI-prefix stripping logic in collectors/scrapers with the shared helper. - Update affected unit tests to import from the new module location and add new
clean_doitest cases.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| welearn_datastack/plugins/scrapers/unccelearn.py | Switch remove_extra_whitespace import to new scraping utils module. |
| welearn_datastack/plugins/scrapers/plos.py | Use clean_doi helper; adjust typing imports. |
| welearn_datastack/plugins/scrapers/peerj.py | Migrate scraping utils imports to new module. |
| welearn_datastack/plugins/scrapers/oe_books.py | Migrate extract_property_from_html import to new module. |
| welearn_datastack/plugins/scrapers/ird_le_mag.py | Reorder and migrate scraping utils imports to new module. |
| welearn_datastack/plugins/scrapers/conversation.py | Migrate scraping utils imports and simplify typing imports. |
| welearn_datastack/plugins/rest_requesters/world_bank_okr.py | Use clean_doi for DOI normalization; migrate whitespace util import. |
| welearn_datastack/plugins/rest_requesters/uved.py | Migrate format_cc_license import to new module. |
| welearn_datastack/plugins/rest_requesters/ted.py | Migrate clean_return_to_line import to new module. |
| welearn_datastack/plugins/rest_requesters/pressbooks.py | Migrate clean_text import to new module. |
| welearn_datastack/plugins/rest_requesters/open_alex.py | Use clean_doi helper for DOI normalization. |
| welearn_datastack/plugins/rest_requesters/oapen.py | Remove unused imports (pdf/text helpers) and old scraping util import. |
| welearn_datastack/plugins/rest_requesters/hal.py | Migrate get_url_without_hal_like_versionning import to new module. |
| welearn_datastack/plugins/rest_requesters/fao_open_knowledge.py | Migrate license/whitespace utilities to new module; remove unused exception import. |
| welearn_datastack/modules/scraping_utils.py | Add clean_doi helper. |
| welearn_datastack/modules/pdf_extractor.py | Migrate remove_extra_whitespace import to new module. |
| welearn_datastack/collectors/hal_collector.py | Migrate HAL URL cleanup import; remove unused datetime import. |
| tests/url_collector/test_hal_collector.py | Update import for HAL URL cleanup helper. |
| tests/test_scraping_utils.py | Update scraping utils imports to new module. |
| tests/document_collector_hub/test_utils.py | Update scraping utils imports to new module. |
| tests/document_collector_hub/plugins_test/test_utils.py | Update references to scraping utils module and add clean_doi tests. |
Comments suppressed due to low confidence (1)
welearn_datastack/modules/scraping_utils.py:177
clean_doicurrently accepts/returns non-strvalues (e.g., it returns the input unchanged when it is not astr, and callers/models useOptional[str]for DOI), but its signature/docstring claimstr -> str. This is misleading for readers and type checkers; the signature/docstring should reflect the actual behavior.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
sandragjacinto
approved these changes
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request primarily refactors the codebase to move all utility functions related to scraping from
welearn_datastack.utils_.scraping_utilsto a new module,welearn_datastack.modules.scraping_utils. Additionally, it introduces a new utility function,clean_doi, and ensures its usage throughout the codebase. Several import statements are updated for consistency and clarity, and redundant imports are removed.Key changes include:
Refactoring and Code Organization:
welearn_datastack.utils_.scraping_utilstowelearn_datastack.modules.scraping_utils, and all relevant imports across the codebase are updated accordingly. This improves modularity and maintainability. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]New Utility Functionality:
clean_doitowelearn_datastack.modules.scraping_utils, which standardizes DOI strings by removing thehttps://doi.org/prefix if present.Usage of New Utility Function:
open_alex,world_bank_okr) to use the newclean_doifunction, ensuring consistent DOI formatting. [1] [2] [3] [4] [5]Testing Improvements:
clean_doiintest_utils.py, covering typical, already-clean, and incorrect input cases.General Code Cleanup:
These changes collectively improve the clarity, maintainability, and consistency of the codebase regarding scraping utilities and DOI handling.