Skip to content

Add contract tests for scrapers and remove Capology#88

Closed
kupsas wants to merge 15 commits into
oseymour:mainfrom
kupsas:sync/cherry-pick-pvt
Closed

Add contract tests for scrapers and remove Capology#88
kupsas wants to merge 15 commits into
oseymour:mainfrom
kupsas:sync/cherry-pick-pvt

Conversation

@kupsas
Copy link
Copy Markdown

@kupsas kupsas commented May 18, 2026

What changed

Contract tests added — four new integration-style tests in test/contract/ cover the live shape of responses from ClubElo, SofaScore, Transfermarkt, and Understat. These are opt-in (marked contract) so they never run in standard CI — only when explicitly triggered. pytest.toml and tox.ini have been updated with the new markers and a dedicated [testenv:contract] environment.

Capology removed — the Capology scraper and its supporting code have been dropped from the collection pipeline, CI jobs, documentation, and tests. Transfermarkt remains the financial data path. This also removes a Selenium-dependent test that was blocking CI.

Why

The contract tests give us a lightweight canary for upstream API shape changes without polluting the regular test run. Removing Capology cleans up dead code and unblocks CI from a test that required a live browser.

How to test

# Standard CI (no live calls)
tox -e py312

# Contract tests (hits live endpoints — run manually)
tox -e contract

kupsas and others added 15 commits May 8, 2026 18:39
- collect_data.py: multi-source pipeline merging FBref, SofaScore,
  Understat, and Transfermarkt into a unified player stats CSV
  (18,854 rows x 146 cols, 10 leagues, 3 seasons)
- soccer_server.py: MCP server with 8 tools (get_player, scout_position,
  compare_players, find_similar_players, get_league_table, get_match,
  get_player_history, data_status)
- README.md: rewritten to describe this project on top of ScraperFC
- .gitignore: exclude data/ files (too large for git)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add football data pipeline and MCP server
Understat no longer embeds season/match/team data in <script> tags.
All data is now served via AJAX endpoints.

Changes:
- Add _AJAX_HEADERS and _ajax_get() helper for XHR-style requests
- scrape_season_data(): replace script-tag parsing with getLeagueData API
- scrape_match(): replace script-tag parsing with getMatchData API
- scrape_team_data(): replace script-tag parsing with getTeamData API

All public method signatures and return types are unchanged.
_json_from_script() is kept but no longer used — can be removed in a
follow-up cleanup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Transfermarkt site no longer embeds market-value history in an
inline Highcharts script tag, and no longer renders the transfer-history
grid in the initial HTML response.  Both sections are now served as
JSON from dedicated ceapi endpoints that were already used internally
by the site's JavaScript.

Changes:
- Add _TM_AJAX_HEADERS module-level constant (X-Requested-With: XMLHttpRequest)
- scrape_player(): replace Highcharts script-tag parsing with a call to
  /ceapi/marketValueDevelopment/graph/{player_id}
- scrape_player(): replace div.grid.tm-player-transfer-history-grid DOM
  parsing with a call to /ceapi/transferHistory/list/{player_id}
- Both fixes preserve the original DataFrame column schema
  (market_value_history: date/value; transfer_history: Season/Date/Left/Joined/MV/Fee)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add football data pipeline, MCP server, and scraper fixes
- Add timeout=30 to both ceapi requests.get() calls in transfermarkt.py
  (flagged HIGH by Codacy/DeepSource — legitimate; prevents hangs if TM
  server is slow)
- Remove unnecessary else-after-return in understat.scrape_match()
- Replace dict() calls with {} literals (idiomatic, marginally faster)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add ceapi request timeouts and Understat scrape_match cleanup
Co-authored-by: Cursor <cursoragent@cursor.com>
Add parallel per-match SofaScore collection, freshness sidecars, retry and
logging improvements, optional Bright Data routing for JSON fetches, and MCP
tools for SofaScore match packs and ClubElo lookups plus richer data_status.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add offline botasaurus_getters tests (Bright Data + Chrome branches), mocked
SofaScore tests and an integration marker for live scrapes, brightdata-sdk as
a test extra, and swap the FBref tox/CI job for test-botasaurus.

Co-authored-by: Cursor <cursoragent@cursor.com>
Treat four parquets plus an active checkpoint as in-progress; reset checkpoint
JSON when force-matches starts; flush after the first match of each segment; clarify
progress logs (season index vs this-run counter). Add regression tests for pack-done
detection.

Co-authored-by: Cursor <cursoragent@cursor.com>
Schedule upstream shape checks while keeping default tox CI off live
ClubElo, Understat, and Transfermarkt calls via the integration marker.

Co-authored-by: Cursor <cursoragent@cursor.com>
collect_data no longer scrapes or merges wage files; Transfermarkt
remains the financial merge path. Remove the capology tox env and
GitHub Actions job (and DeepSource combine input). Add Sphinx
intersphinx for pandas, NumPy, and Python so API references resolve
without nitpick noise. Trim year-parameter docs and the dev marimo
notebook accordingly.

Co-authored-by: Cursor <cursoragent@cursor.com>
The canary.yml schedule is specific to the private deployment server.
Contract tests and all other changes are retained for the public repo.

Co-authored-by: Cursor <cursoragent@cursor.com>
@kupsas kupsas closed this May 18, 2026
@kupsas kupsas deleted the sync/cherry-pick-pvt branch May 18, 2026 03:20
@codacy-production
Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 45 high · 10 medium · 32 minor

Alerts:
⚠ 87 issues (≤ 0 issues of at least minor severity)

Results:
87 new issues

Category Results
UnusedCode 1 minor
Documentation 29 minor
ErrorProne 2 high
Security 3 medium
43 high
CodeStyle 2 minor
Complexity 7 medium

View in Codacy

🟢 Metrics 777 complexity · 11 duplication

Metric Results
Complexity 777
Duplication 11

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@deepsource-io
Copy link
Copy Markdown
Contributor

deepsource-io Bot commented May 18, 2026

DeepSource Code Review

We reviewed changes in 50f5df9...ec68847 on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

PR Report Card

Overall Grade   Security  

Reliability  

Complexity  

Hygiene  

Code Review Summary

Analyzer Status Updated (UTC) Details
Python May 18, 2026 3:19a.m. Review ↗
Code coverage May 18, 2026 3:19a.m. Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant