Hugging Face Daily Papers Scraper

A Python script to scrape and collect papers from Hugging Face's Daily Papers page. The scraper extracts comprehensive information about each paper, downloads PDFs, and organizes them by date.

Features

Scrape papers for a single date or date range
Extract detailed paper information:
- Title
- Landing page URL
- Complete list of authors
- PDF link (arxiv)
- Full abstract
- Upvote count
Automatic PDF downloads
Organized file structure
Progress tracking for downloads
Rate limiting to be server-friendly
Command-line interface with date parameters

Installation

Clone the repository:

git clone [your-repo-url]
cd hf-daily-paper-scraper

Install dependencies:

pip install -r requirements.txt

Usage

Basic Usage (Today's Papers)

Run the script without parameters to scrape today's papers:

python hf_paper_scraper.py

Single Date

Use the --start-date parameter to scrape papers from a specific date:

python hf_paper_scraper.py --start-date 2024-01-06

Date Range

Use both --start-date and --end-date to scrape papers from a range of dates:

python hf_paper_scraper.py --start-date 2024-01-01 --end-date 2024-01-06

Output Structure

The script creates an organized directory structure:

papers/
├── YYYY-MM-DD.json         # Paper metadata for each date
├── YYYY-MM-DD/            # PDF files for each date
│   ├── paper1.pdf
│   ├── paper2.pdf
│   └── ...
├── next-date.json
└── next-date/
    └── ...

JSON Format

Each date's JSON file contains:

{
  "date": "YYYY-MM-DD",
  "papers": [
    {
      "title": "Paper Title",
      "url": "https://huggingface.co/papers/paper-id",
      "authors": ["Author 1", "Author 2"],
      "pdf_url": "https://arxiv.org/pdf/paper-id.pdf",
      "abstract": "Paper abstract...",
      "upvotes": 42
    }
  ]
}

Dependencies

Python 3.6+
requests==2.31.0
beautifulsoup4==4.12.2

Features

Date validation and range support
Error handling for failed requests
Rate limiting (0.5s between requests)
Progress tracking:
- Overall date progress
- Per-date PDF download progress
- Success/failure indicators
Automatic file organization
PDF downloads with clean filenames

Contributing

Feel free to:

Open issues for bugs or feature requests
Submit pull requests with improvements
Add documentation or examples

License

[Your chosen license]

Acknowledgments

Thanks to Hugging Face for their daily papers feature
Built with BeautifulSoup4 for HTML parsing

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
papers		papers
README.md		README.md
hf_paper_scraper.py		hf_paper_scraper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hugging Face Daily Papers Scraper

Features

Installation

Usage

Basic Usage (Today's Papers)

Single Date

Date Range

Output Structure

JSON Format

Dependencies

Features

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hugging Face Daily Papers Scraper

Features

Installation

Usage

Basic Usage (Today's Papers)

Single Date

Date Range

Output Structure

JSON Format

Dependencies

Features

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages