Skip to content

Modular collect_data package with per-source collectors and build layer#89

Closed
kupsas wants to merge 18 commits into
oseymour:mainfrom
kupsas:feat/modular-collect-data
Closed

Modular collect_data package with per-source collectors and build layer#89
kupsas wants to merge 18 commits into
oseymour:mainfrom
kupsas:feat/modular-collect-data

Conversation

@kupsas
Copy link
Copy Markdown

@kupsas kupsas commented May 18, 2026

Summary

This branch delivers a modular collect_data package instead of a single large pipeline module. Scraping logic is split one source per file under collect_data/collectors/ (FBref, Understat, SofaScore season and per-match packs, ClubElo, Transfermarkt, Capology). Merge and financial enrichment live under collect_data/build/ (unified.py for the layered merge and Parquet export, financials.py for Transfermarkt and Capology joins). collect_data/pipeline.py is now CLI-only (argparse and dispatch). The root collect_data.py shim and python -m collect_data entry stay the same for users.

Supporting pieces: setuptools discovers the top-level collect_data package for editable installs; tests use find_root() for stable sys.path; fbref_season_to_understat lives in helpers.py for Understat; README structure diagram reflects the new layout.

Test plan

  • tox -e lint,typecheck,test-all,docs and per-source tox envs (test-botasaurus, test-sofascore, test-clubelo, test-transfermarkt, test-understat) — all pass locally (mirrors GitHub Actions test.yml).
  • CI green on the PR after merge queue / checks.

kupsas and others added 18 commits May 8, 2026 18:39
- collect_data.py: multi-source pipeline merging FBref, SofaScore,
  Understat, and Transfermarkt into a unified player stats CSV
  (18,854 rows x 146 cols, 10 leagues, 3 seasons)
- soccer_server.py: MCP server with 8 tools (get_player, scout_position,
  compare_players, find_similar_players, get_league_table, get_match,
  get_player_history, data_status)
- README.md: rewritten to describe this project on top of ScraperFC
- .gitignore: exclude data/ files (too large for git)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add football data pipeline and MCP server
Understat no longer embeds season/match/team data in <script> tags.
All data is now served via AJAX endpoints.

Changes:
- Add _AJAX_HEADERS and _ajax_get() helper for XHR-style requests
- scrape_season_data(): replace script-tag parsing with getLeagueData API
- scrape_match(): replace script-tag parsing with getMatchData API
- scrape_team_data(): replace script-tag parsing with getTeamData API

All public method signatures and return types are unchanged.
_json_from_script() is kept but no longer used — can be removed in a
follow-up cleanup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Transfermarkt site no longer embeds market-value history in an
inline Highcharts script tag, and no longer renders the transfer-history
grid in the initial HTML response.  Both sections are now served as
JSON from dedicated ceapi endpoints that were already used internally
by the site's JavaScript.

Changes:
- Add _TM_AJAX_HEADERS module-level constant (X-Requested-With: XMLHttpRequest)
- scrape_player(): replace Highcharts script-tag parsing with a call to
  /ceapi/marketValueDevelopment/graph/{player_id}
- scrape_player(): replace div.grid.tm-player-transfer-history-grid DOM
  parsing with a call to /ceapi/transferHistory/list/{player_id}
- Both fixes preserve the original DataFrame column schema
  (market_value_history: date/value; transfer_history: Season/Date/Left/Joined/MV/Fee)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add football data pipeline, MCP server, and scraper fixes
- Add timeout=30 to both ceapi requests.get() calls in transfermarkt.py
  (flagged HIGH by Codacy/DeepSource — legitimate; prevents hangs if TM
  server is slow)
- Remove unnecessary else-after-return in understat.scrape_match()
- Replace dict() calls with {} literals (idiomatic, marginally faster)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add ceapi request timeouts and Understat scrape_match cleanup
Co-authored-by: Cursor <cursoragent@cursor.com>
Add parallel per-match SofaScore collection, freshness sidecars, retry and
logging improvements, optional Bright Data routing for JSON fetches, and MCP
tools for SofaScore match packs and ClubElo lookups plus richer data_status.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add offline botasaurus_getters tests (Bright Data + Chrome branches), mocked
SofaScore tests and an integration marker for live scrapes, brightdata-sdk as
a test extra, and swap the FBref tox/CI job for test-botasaurus.

Co-authored-by: Cursor <cursoragent@cursor.com>
Treat four parquets plus an active checkpoint as in-progress; reset checkpoint
JSON when force-matches starts; flush after the first match of each segment; clarify
progress logs (season index vs this-run counter). Add regression tests for pack-done
detection.

Co-authored-by: Cursor <cursoragent@cursor.com>
Schedule upstream shape checks while keeping default tox CI off live
ClubElo, Understat, and Transfermarkt calls via the integration marker.

Co-authored-by: Cursor <cursoragent@cursor.com>
collect_data no longer scrapes or merges wage files; Transfermarkt
remains the financial merge path. Remove the capology tox env and
GitHub Actions job (and DeepSource combine input). Add Sphinx
intersphinx for pandas, NumPy, and Python so API references resolve
without nitpick noise. Trim year-parameter docs and the dev marimo
notebook accordingly.

Co-authored-by: Cursor <cursoragent@cursor.com>
The canary.yml schedule is specific to the private deployment server.
Contract tests and all other changes are retained for the public repo.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add contract tests for scrapers and remove Capology
Introduce the ``collect_data`` package (config, storage, pipeline, ``python -m`` entry), keep the root ``collect_data.py`` shim, and document Parquet-first unified output plus server loading. Register the package in setuptools so editable installs and tox see ``collect_data`` imports. Normalize test and notebook ``sys.path`` setup with ``find_root()`` instead of cwd-relative paths. Fix a mypy issue in ``scrape_team_league_stats`` by separating Series vs one-row DataFrame names. Add checkpoint tracker unit tests.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move each scraper into ``collect_data/collectors`` (fbref, understat,
sofascore, sofascore_matches, clubelo, transfermarkt, capology) and merge
logic into ``collect_data/build`` (``unified.py``, ``financials.py``). Keep
``pipeline.py`` as CLI-only orchestration. Add ``fbref_season_to_understat`` to
helpers for Understat. Update the README structure diagram and public exports
so tests and ``python -m collect_data`` behave as before.

Co-authored-by: Cursor <cursoragent@cursor.com>
@kupsas
Copy link
Copy Markdown
Author

kupsas commented May 18, 2026

Opened by mistake (wrong default repo for gh); work belongs in kupsas/football-data-mcp.

@kupsas kupsas closed this May 18, 2026
@codacy-production
Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 54 high · 9 medium · 37 minor

Alerts:
⚠ 100 issues (≤ 0 issues of at least minor severity)

Results:
100 new issues

Category Results
Documentation 36 minor
ErrorProne 6 high
Security 4 medium
48 high
CodeStyle 1 minor
Complexity 5 medium

View in Codacy

🟢 Metrics 705 complexity · 15 duplication

Metric Results
Complexity 705
Duplication 15

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@deepsource-io
Copy link
Copy Markdown
Contributor

deepsource-io Bot commented May 18, 2026

DeepSource Code Review

We reviewed changes in 50f5df9...94742d5 on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

PR Report Card

Overall Grade   Security  

Reliability  

Complexity  

Hygiene  

Code Review Summary

Analyzer Status Updated (UTC) Details
Python May 18, 2026 3:50a.m. Review ↗
Code coverage May 18, 2026 3:50a.m. Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant