This Python project allows you to scrape search results from the web using Google API and Google Custom Search Engine ID, extract useful information, and perform basic data analysis using Gemini API. It is designed to be reliable, modular, and easy to run from the command line.
-
Extracting Titles, URLs, and Snippets
- Scrapes and saves the title, URL, and snippet/description from search results.
-
Taking Dynamic Input (Query from Command Line)
- Run the scraper with any search query directly from the command line:
python scraper.py <your query>
For Example
python scraper.py "AI in healthcare" -
Saving Results to CSV File
- Results are saved in a seperate CSV file for each query.
-
Running in Headless Mode (Browser in Background)
- The usage of the Custom Search Engine ID makes it totally headless.
-
Crawling Multiple Pages
- The scraper can crawl multiple pages of search results (Free tier Google API only allows max 10 results at a time).
-
Adding Logs
- Logs are stored in
data/logs/.
- Logs are stored in
-
Data Summarizer
- Summarizes the results all the results that were fetched and stores them in
data_analysisfolder.
- Summarizes the results all the results that were fetched and stores them in
- Install dependencies:
pip install -r requirements.txt- Run Scraper
python scraper.py <your query>- Ensure you have
Google APIkey,Google Custom Search Engine IDandGemini APIkey set up in the script. - Logs are automatically created for debugging and tracking scraping activity.