Skip to content

rajputpriyankaa/scrapy-spider-framework

Repository files navigation

🕷️ Web Scraping with Scrapy

Python Scrapy License Stars

A web scraping project using Scrapy — Python's most powerful and scalable scraping framework. Demonstrates XPath and CSS selector usage for precise, fast data extraction across multiple pages.


📌 What This Project Does

  • Crawls multi-page websites using Scrapy spiders
  • Extracts structured data using XPath and CSS selectors
  • Handles pagination and link following automatically
  • Exports data to CSV, JSON, or XML via Scrapy's built-in pipelines
  • Demonstrates Scrapy's item pipeline architecture

🧰 Tech Stack

Tool Purpose
Python 3.8+ Core language
Scrapy 2.x Scraping framework
XPath Element targeting
CSS Selectors Alternative element targeting
Scrapy Item Pipelines Data processing & export

🚀 Getting Started

1. Run a Spider

scrapy crawl <spider_name>

2. Export Data

# Export to CSV
scrapy crawl <spider_name> -o output.csv

# Export to JSON
scrapy crawl <spider_name> -o output.json


🔍 XPath vs CSS Selectors — Quick Reference

# XPath examples
response.xpath('//h1/text()').get()
response.xpath('//a/@href').getall()

# CSS selector examples
response.css('h1::text').get()
response.css('a::attr(href)').getall()

📦 Requirements

Scrapy>=2.8.0

⚠️ Notes

  • Configure DOWNLOAD_DELAY in settings.py to avoid overwhelming servers
  • Use ROBOTSTXT_OBEY = True (default in Scrapy) to respect robots.txt
  • For large crawls, consider enabling Scrapy's AutoThrottle extension

🙋 Author

Priyanka RajputGitHub

Releases

No releases published

Packages

 
 
 

Contributors

Languages