A web scraping project using Scrapy — Python's most powerful and scalable scraping framework. Demonstrates XPath and CSS selector usage for precise, fast data extraction across multiple pages.
- Crawls multi-page websites using Scrapy spiders
- Extracts structured data using XPath and CSS selectors
- Handles pagination and link following automatically
- Exports data to CSV, JSON, or XML via Scrapy's built-in pipelines
- Demonstrates Scrapy's item pipeline architecture
| Tool | Purpose |
|---|---|
Python 3.8+ |
Core language |
Scrapy 2.x |
Scraping framework |
XPath |
Element targeting |
CSS Selectors |
Alternative element targeting |
Scrapy Item Pipelines |
Data processing & export |
scrapy crawl <spider_name># Export to CSV
scrapy crawl <spider_name> -o output.csv
# Export to JSON
scrapy crawl <spider_name> -o output.json# XPath examples
response.xpath('//h1/text()').get()
response.xpath('//a/@href').getall()
# CSS selector examples
response.css('h1::text').get()
response.css('a::attr(href)').getall()Scrapy>=2.8.0
- Configure
DOWNLOAD_DELAYinsettings.pyto avoid overwhelming servers - Use
ROBOTSTXT_OBEY = True(default in Scrapy) to respect robots.txt - For large crawls, consider enabling Scrapy's AutoThrottle extension
Priyanka Rajput — GitHub