Google Scholar scraping with Python library and REST API with Redis caching.
- Python Library (
google_scholar_lib) - Direct Python interface - REST API (
src/api) - FastAPI server with Redis caching
- Publication search, author profiles, citations (BibTeX, etc.)
- Automatic fallback for rate limiting
- Redis caching with configurable TTL
- Interactive API documentation at
/docs - ARM64 compatible (Jetson, Raspberry Pi)
- Python 3.8+
- Chrome or Chromium browser
- ChromeDriver
- Redis (for API caching - optional for library use)
pip install .pip install -r requirements.txt
# or
pip install -e .For ARM64 systems (Jetson, Raspberry Pi):
sudo apt install chromium-browser chromium-chromedriver redis-serverfrom google_scholar_lib import GoogleScholar
# Initialize
api = GoogleScholar()
# Search for publications
results = api.search_scholar("Deep Learning", num=5)
for paper in results.organic_results:
print(f"{paper.title} - {paper.link}")from google_scholar_lib import GoogleScholar
api = GoogleScholar()
results = api.search_scholar("Machine Learning", num=10)
for paper in results.organic_results:
print(f"Title: {paper.title}")
print(f"Authors: {', '.join([a.name for a in paper.authors])}")
if paper.inline_links and paper.inline_links.cited_by:
print(f"Citations: {paper.inline_links.cited_by['total']}")
print()If you know the author's Google Scholar ID:
api = GoogleScholar()
results = api.search_author(author_id="JicYPdAAAAAJ") # Geoffrey Hinton
print(f"Name: {results.author.name}")
print(f"Affiliation: {results.author.affiliations}")
print(f"Interests: {', '.join([i.title for i in results.author.interests])}")
print("\nTop Publications:")
for article in results.articles[:5]:
print(f" - {article.title}")Common Author IDs:
JicYPdAAAAAJ- Geoffrey HintonWLN3QrAAAAAJ- Yann LeCunkukA0LcAAAAJ- Yoshua Bengio
Search for an author by name. Uses automatic fallback if direct search is blocked:
api = GoogleScholar()
results = api.search(engine="google_scholar_profiles", q="Andrew Ng")
for profile in results.profiles:
print(f"Name: {profile.name}")
print(f"ID: {profile.author_id}")
print(f"Affiliation: {profile.affiliations}")api = GoogleScholar()
results = api.search_cite(data_cid="PAPER_CID")
# Available formats
for link in results.links:
print(f"{link['title']}: {link['link']}")
# Citation text
for citation in results.citations:
print(f"{citation['title']}")
print(f"{citation['snippet']}\n")# 1. Start Redis (if not already running)
sudo systemctl start redis-server
# 2. Start API
cd src/api
uvicorn main:app --host 0.0.0.0 --port 8765
# Or with reload for development
uvicorn main:app --host 0.0.0.0 --port 8765 --reloadAccess the API at http://localhost:8765/docs
# Search publications
curl -X POST http://localhost:8765/api/v1/search/scholar \
-H "Content-Type: application/json" \
-d '{"q": "machine learning", "num": 10}'
# Get author profile
curl http://localhost:8765/api/v1/author/JicYPdAAAAAJ
# Check health & cache stats
curl http://localhost:8765/healthPython:
import requests
response = requests.post("http://localhost:8765/api/v1/search/scholar",
json={"q": "quantum computing", "num": 10})
print(response.headers.get("X-Cache-Status")) # HIT or MISSFull documentation: src/api/README.md or visit /docs when running
Test the API interactively:
python demo.pyThe demo lets you:
- Choose between different search engines
- Enter custom search parameters
- View formatted results with metadata
The API supports four search engines:
| Engine | Purpose | Returns |
|---|---|---|
google_scholar |
Search publications | Papers, authors, citations |
google_scholar_author |
Get author by ID | Profile, publications, interests |
google_scholar_profiles |
Find author by name | Profile matches |
google_scholar_cite |
Get citation formats | BibTeX, EndNote, etc. |
Runs Chrome in headless mode by default (no visible browser window):
# Headless by default
api = GoogleScholar()
# Or explicitly set headless mode
from google_scholar_lib.backends.selenium_backend import SeleniumBackend
backend = SeleniumBackend(headless=True)The REST API automatically caches responses with configurable TTL:
- Scholar searches: 24 hours
- Author profiles: 7 days
- Profile searches: 12 hours
- Citations: 30 days
Cache status is indicated in the X-Cache-Status response header.
This project runs on Google Cloud's Always Free e2-micro (1GB RAM). A 2GB swap file is required to prevent Chrome crashes.
# 1. Create 2GB Swap (Required!)
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
# 2. Install Dependencies
sudo apt update
sudo apt install -y git python3-venv python3-pip redis-server chromium chromium-driver
sudo systemctl enable redis-server && sudo systemctl start redis-server
# 3. Clone & Install
git clone https://github.com/YOUR_USERNAME/google-scholar-api.git
cd google-scholar-api
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -e .
# 4. Configure
cp .env.example .env
nano .envCreate /etc/systemd/system/scholar.service:
[Unit]
Description=Google Scholar API
After=network.target redis-server.service
[Service]
User=YOUR_USERNAME
WorkingDirectory=/home/YOUR_USERNAME/google-scholar-api
Environment="PATH=/home/YOUR_USERNAME/google-scholar-api/venv/bin"
ExecStart=/home/YOUR_USERNAME/google-scholar-api/venv/bin/uvicorn src.api.main:app --host 0.0.0.0 --port 8765
Restart=always
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload
sudo systemctl enable scholar
sudo systemctl start scholar# Install
curl -L --output cloudflared.deb https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
sudo dpkg -i cloudflared.deb
# Quick tunnel (temporary URL)
cloudflared tunnel --url http://localhost:8765
# Or run as service for persistent tunnel
sudo cloudflared service installNote: Use Standard Persistent Disk (30GB) in us-central1, us-west1, or us-east1 to stay within free tier limits.
This tool is for educational and research purposes only. Please respect Google's Terms of Service and robots.txt.