Modular collect_data package with per-source collectors and build layer#89
Modular collect_data package with per-source collectors and build layer#89kupsas wants to merge 18 commits into
Conversation
- collect_data.py: multi-source pipeline merging FBref, SofaScore, Understat, and Transfermarkt into a unified player stats CSV (18,854 rows x 146 cols, 10 leagues, 3 seasons) - soccer_server.py: MCP server with 8 tools (get_player, scout_position, compare_players, find_similar_players, get_league_table, get_match, get_player_history, data_status) - README.md: rewritten to describe this project on top of ScraperFC - .gitignore: exclude data/ files (too large for git) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add football data pipeline and MCP server
Understat no longer embeds season/match/team data in <script> tags. All data is now served via AJAX endpoints. Changes: - Add _AJAX_HEADERS and _ajax_get() helper for XHR-style requests - scrape_season_data(): replace script-tag parsing with getLeagueData API - scrape_match(): replace script-tag parsing with getMatchData API - scrape_team_data(): replace script-tag parsing with getTeamData API All public method signatures and return types are unchanged. _json_from_script() is kept but no longer used — can be removed in a follow-up cleanup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Transfermarkt site no longer embeds market-value history in an
inline Highcharts script tag, and no longer renders the transfer-history
grid in the initial HTML response. Both sections are now served as
JSON from dedicated ceapi endpoints that were already used internally
by the site's JavaScript.
Changes:
- Add _TM_AJAX_HEADERS module-level constant (X-Requested-With: XMLHttpRequest)
- scrape_player(): replace Highcharts script-tag parsing with a call to
/ceapi/marketValueDevelopment/graph/{player_id}
- scrape_player(): replace div.grid.tm-player-transfer-history-grid DOM
parsing with a call to /ceapi/transferHistory/list/{player_id}
- Both fixes preserve the original DataFrame column schema
(market_value_history: date/value; transfer_history: Season/Date/Left/Joined/MV/Fee)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add football data pipeline, MCP server, and scraper fixes
- Add timeout=30 to both ceapi requests.get() calls in transfermarkt.py
(flagged HIGH by Codacy/DeepSource — legitimate; prevents hangs if TM
server is slow)
- Remove unnecessary else-after-return in understat.scrape_match()
- Replace dict() calls with {} literals (idiomatic, marginally faster)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add ceapi request timeouts and Understat scrape_match cleanup
Co-authored-by: Cursor <cursoragent@cursor.com>
Add parallel per-match SofaScore collection, freshness sidecars, retry and logging improvements, optional Bright Data routing for JSON fetches, and MCP tools for SofaScore match packs and ClubElo lookups plus richer data_status. Co-authored-by: Cursor <cursoragent@cursor.com>
Add offline botasaurus_getters tests (Bright Data + Chrome branches), mocked SofaScore tests and an integration marker for live scrapes, brightdata-sdk as a test extra, and swap the FBref tox/CI job for test-botasaurus. Co-authored-by: Cursor <cursoragent@cursor.com>
Treat four parquets plus an active checkpoint as in-progress; reset checkpoint JSON when force-matches starts; flush after the first match of each segment; clarify progress logs (season index vs this-run counter). Add regression tests for pack-done detection. Co-authored-by: Cursor <cursoragent@cursor.com>
Schedule upstream shape checks while keeping default tox CI off live ClubElo, Understat, and Transfermarkt calls via the integration marker. Co-authored-by: Cursor <cursoragent@cursor.com>
collect_data no longer scrapes or merges wage files; Transfermarkt remains the financial merge path. Remove the capology tox env and GitHub Actions job (and DeepSource combine input). Add Sphinx intersphinx for pandas, NumPy, and Python so API references resolve without nitpick noise. Trim year-parameter docs and the dev marimo notebook accordingly. Co-authored-by: Cursor <cursoragent@cursor.com>
The canary.yml schedule is specific to the private deployment server. Contract tests and all other changes are retained for the public repo. Co-authored-by: Cursor <cursoragent@cursor.com>
Add contract tests for scrapers and remove Capology
Introduce the ``collect_data`` package (config, storage, pipeline, ``python -m`` entry), keep the root ``collect_data.py`` shim, and document Parquet-first unified output plus server loading. Register the package in setuptools so editable installs and tox see ``collect_data`` imports. Normalize test and notebook ``sys.path`` setup with ``find_root()`` instead of cwd-relative paths. Fix a mypy issue in ``scrape_team_league_stats`` by separating Series vs one-row DataFrame names. Add checkpoint tracker unit tests. Co-authored-by: Cursor <cursoragent@cursor.com>
Move each scraper into ``collect_data/collectors`` (fbref, understat, sofascore, sofascore_matches, clubelo, transfermarkt, capology) and merge logic into ``collect_data/build`` (``unified.py``, ``financials.py``). Keep ``pipeline.py`` as CLI-only orchestration. Add ``fbref_season_to_understat`` to helpers for Understat. Update the README structure diagram and public exports so tests and ``python -m collect_data`` behave as before. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Opened by mistake (wrong default repo for gh); work belongs in kupsas/football-data-mcp. |
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Documentation | 36 minor |
| ErrorProne | 6 high |
| Security | 4 medium 48 high |
| CodeStyle | 1 minor |
| Complexity | 5 medium |
🟢 Metrics 705 complexity · 15 duplication
Metric Results Complexity 705 Duplication 15
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
|
|
Overall Grade |
Security Reliability Complexity Hygiene |
Code Review Summary
| Analyzer | Status | Updated (UTC) | Details |
|---|---|---|---|
| Python | May 18, 2026 3:50a.m. | Review ↗ | |
| Code coverage | May 18, 2026 3:50a.m. | Review ↗ |
Important
AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.
Summary
This branch delivers a modular
collect_datapackage instead of a single large pipeline module. Scraping logic is split one source per file undercollect_data/collectors/(FBref, Understat, SofaScore season and per-match packs, ClubElo, Transfermarkt, Capology). Merge and financial enrichment live undercollect_data/build/(unified.pyfor the layered merge and Parquet export,financials.pyfor Transfermarkt and Capology joins).collect_data/pipeline.pyis now CLI-only (argparseand dispatch). The rootcollect_data.pyshim andpython -m collect_dataentry stay the same for users.Supporting pieces: setuptools discovers the top-level
collect_datapackage for editable installs; tests usefind_root()for stablesys.path;fbref_season_to_understatlives inhelpers.pyfor Understat; README structure diagram reflects the new layout.Test plan
tox -e lint,typecheck,test-all,docsand per-source tox envs (test-botasaurus,test-sofascore,test-clubelo,test-transfermarkt,test-understat) — all pass locally (mirrors GitHub Actionstest.yml).