Skip to content

Fix/world bank okr doi fix#144

Merged
lpi-tn merged 4 commits into
mainfrom
Fix/world-bank-okr-doi-fix
Jun 11, 2026
Merged

Fix/world bank okr doi fix#144
lpi-tn merged 4 commits into
mainfrom
Fix/world-bank-okr-doi-fix

Conversation

@lpi-tn

@lpi-tn lpi-tn commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

This pull request primarily refactors the codebase to move all utility functions related to scraping from welearn_datastack.utils_.scraping_utils to a new module, welearn_datastack.modules.scraping_utils. Additionally, it introduces a new utility function, clean_doi, and ensures its usage throughout the codebase. Several import statements are updated for consistency and clarity, and redundant imports are removed.

Key changes include:

Refactoring and Code Organization:

  • All scraping utility functions are moved from welearn_datastack.utils_.scraping_utils to welearn_datastack.modules.scraping_utils, and all relevant imports across the codebase are updated accordingly. This improves modularity and maintainability. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

New Utility Functionality:

  • Added a new function clean_doi to welearn_datastack.modules.scraping_utils, which standardizes DOI strings by removing the https://doi.org/ prefix if present.

Usage of New Utility Function:

  • Refactored all DOI cleaning logic in plugins (e.g., open_alex, world_bank_okr) to use the new clean_doi function, ensuring consistent DOI formatting. [1] [2] [3] [4] [5]

Testing Improvements:

  • Added new unit tests for clean_doi in test_utils.py, covering typical, already-clean, and incorrect input cases.

General Code Cleanup:

  • Removed unused imports and cleaned up type annotations in several files for better readability and to remove redundancy. [1] [2] [3] [4] [5] [6] [7] [8]

These changes collectively improve the clarity, maintainability, and consistency of the codebase regarding scraping utilities and DOI handling.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the codebase to consolidate scraping-related utility functions under welearn_datastack.modules.scraping_utils (replacing welearn_datastack.utils_.scraping_utils) and introduces a shared clean_doi helper, updating collectors/scrapers and tests to use it for consistent DOI normalization.

Changes:

  • Migrate scraping utility imports across scrapers, REST collectors, and supporting modules to welearn_datastack.modules.scraping_utils.
  • Add clean_doi and replace duplicated DOI-prefix stripping logic in collectors/scrapers with the shared helper.
  • Update affected unit tests to import from the new module location and add new clean_doi test cases.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
welearn_datastack/plugins/scrapers/unccelearn.py Switch remove_extra_whitespace import to new scraping utils module.
welearn_datastack/plugins/scrapers/plos.py Use clean_doi helper; adjust typing imports.
welearn_datastack/plugins/scrapers/peerj.py Migrate scraping utils imports to new module.
welearn_datastack/plugins/scrapers/oe_books.py Migrate extract_property_from_html import to new module.
welearn_datastack/plugins/scrapers/ird_le_mag.py Reorder and migrate scraping utils imports to new module.
welearn_datastack/plugins/scrapers/conversation.py Migrate scraping utils imports and simplify typing imports.
welearn_datastack/plugins/rest_requesters/world_bank_okr.py Use clean_doi for DOI normalization; migrate whitespace util import.
welearn_datastack/plugins/rest_requesters/uved.py Migrate format_cc_license import to new module.
welearn_datastack/plugins/rest_requesters/ted.py Migrate clean_return_to_line import to new module.
welearn_datastack/plugins/rest_requesters/pressbooks.py Migrate clean_text import to new module.
welearn_datastack/plugins/rest_requesters/open_alex.py Use clean_doi helper for DOI normalization.
welearn_datastack/plugins/rest_requesters/oapen.py Remove unused imports (pdf/text helpers) and old scraping util import.
welearn_datastack/plugins/rest_requesters/hal.py Migrate get_url_without_hal_like_versionning import to new module.
welearn_datastack/plugins/rest_requesters/fao_open_knowledge.py Migrate license/whitespace utilities to new module; remove unused exception import.
welearn_datastack/modules/scraping_utils.py Add clean_doi helper.
welearn_datastack/modules/pdf_extractor.py Migrate remove_extra_whitespace import to new module.
welearn_datastack/collectors/hal_collector.py Migrate HAL URL cleanup import; remove unused datetime import.
tests/url_collector/test_hal_collector.py Update import for HAL URL cleanup helper.
tests/test_scraping_utils.py Update scraping utils imports to new module.
tests/document_collector_hub/test_utils.py Update scraping utils imports to new module.
tests/document_collector_hub/plugins_test/test_utils.py Update references to scraping utils module and add clean_doi tests.
Comments suppressed due to low confidence (1)

welearn_datastack/modules/scraping_utils.py:177

  • clean_doi currently accepts/returns non-str values (e.g., it returns the input unchanged when it is not a str, and callers/models use Optional[str] for DOI), but its signature/docstring claim str -> str. This is misleading for readers and type checkers; the signature/docstring should reflect the actual behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/document_collector_hub/plugins_test/test_utils.py
Comment thread tests/document_collector_hub/plugins_test/test_utils.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@lpi-tn lpi-tn merged commit 621a2cd into main Jun 11, 2026
7 checks passed
@lpi-tn lpi-tn deleted the Fix/world-bank-okr-doi-fix branch June 11, 2026 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants