A Python script to scrape and collect papers from Hugging Face's Daily Papers page. The scraper extracts comprehensive information about each paper, downloads PDFs, and organizes them by date.
- Scrape papers for a single date or date range
- Extract detailed paper information:
- Title
- Landing page URL
- Complete list of authors
- PDF link (arxiv)
- Full abstract
- Upvote count
- Automatic PDF downloads
- Organized file structure
- Progress tracking for downloads
- Rate limiting to be server-friendly
- Command-line interface with date parameters
- Clone the repository:
git clone [your-repo-url]
cd hf-daily-paper-scraper- Install dependencies:
pip install -r requirements.txtRun the script without parameters to scrape today's papers:
python hf_paper_scraper.pyUse the --start-date parameter to scrape papers from a specific date:
python hf_paper_scraper.py --start-date 2024-01-06Use both --start-date and --end-date to scrape papers from a range of dates:
python hf_paper_scraper.py --start-date 2024-01-01 --end-date 2024-01-06The script creates an organized directory structure:
papers/
├── YYYY-MM-DD.json # Paper metadata for each date
├── YYYY-MM-DD/ # PDF files for each date
│ ├── paper1.pdf
│ ├── paper2.pdf
│ └── ...
├── next-date.json
└── next-date/
└── ...
Each date's JSON file contains:
{
"date": "YYYY-MM-DD",
"papers": [
{
"title": "Paper Title",
"url": "https://huggingface.co/papers/paper-id",
"authors": ["Author 1", "Author 2"],
"pdf_url": "https://arxiv.org/pdf/paper-id.pdf",
"abstract": "Paper abstract...",
"upvotes": 42
}
]
}- Python 3.6+
- requests==2.31.0
- beautifulsoup4==4.12.2
- Date validation and range support
- Error handling for failed requests
- Rate limiting (0.5s between requests)
- Progress tracking:
- Overall date progress
- Per-date PDF download progress
- Success/failure indicators
- Automatic file organization
- PDF downloads with clean filenames
Feel free to:
- Open issues for bugs or feature requests
- Submit pull requests with improvements
- Add documentation or examples
[Your chosen license]
- Thanks to Hugging Face for their daily papers feature
- Built with BeautifulSoup4 for HTML parsing