A fully serverless AWS data pipeline that scrapes GitHub Trending daily and visualises programming language trends over time.
Built as a portfolio project to demonstrate hands-on AWS experience across Lambda, S3, Glue, Athena, EventBridge, IAM, and CloudWatch — running entirely within the AWS Free Tier at $0/month.
Every day at 06:00 UTC, an AWS Lambda function automatically scrapes github.com/trending, extracts repository data, and stores it as JSON in S3. A second Lambda function transforms the raw data into CSV format partitioned by date. A Glue Crawler keeps the Data Catalog up to date, enabling SQL queries via Amazon Athena. A local Streamlit dashboard connects to Athena and visualises language trends, star metrics, and repo rankings over time.
EventBridge (cron 06:00 UTC)
↓
Lambda — scraper (Python + BeautifulSoup)
↓
S3 raw/YYYY-MM-DD.json
↓
Lambda — ETL job
↓
S3 processed/year=/month=/day=/data.csv
↓
Glue Crawler → Data Catalog
↓
Athena (SQL queries)
↓
Streamlit dashboard (local)
Four layers:
- Collection — EventBridge triggers Lambda daily. Lambda scrapes GitHub Trending and writes raw JSON to S3. CloudWatch logs all runs.
- Storage — S3 bucket with two zones:
raw/(one JSON per day) andprocessed/(CSV, partitioned by date). - Analysis — ETL Lambda transforms raw JSON to clean CSV. Glue Crawler updates the Data Catalog. Athena exposes the data as a queryable SQL table.
- Visualisation — Streamlit dashboard connects to Athena via pyathena. Runs locally — no AWS compute cost.
| Field | Description |
|---|---|
repo_name |
Repository name |
owner |
GitHub username |
language |
Programming language |
stars_today |
Stars gained on the scraped day |
total_stars |
Total stars at time of scraping |
forks |
Total forks |
rank_on_day |
Position on trending page (1–25) |
date_scraped |
Date of scraping |
Based on 14 days of data (140 repos scraped across 14 languages):
- Python dominates GitHub Trending with 61 appearances — nearly 50% more than TypeScript (41 appearances), making it the most consistently trending language by a wide margin.
- Shell has the highest star potential — averaging 3,459 stars per trending repo, peaking at 10,749 in a single day. Shell repos trend rarely but attract massive attention when they do.
- Rust is consistent but niche — appearing 9 times with an average of 1,080 stars per repo, showing a stable and engaged community despite low volume.
- TypeScript and Python dominate by volume but C++ and Shell punch above their weight in stars-per-repo, suggesting different audience dynamics.
| Layer | Technology |
|---|---|
| Scraping | Python 3.11, BeautifulSoup4, requests |
| Compute | AWS Lambda |
| Storage | Amazon S3 |
| Scheduling | Amazon EventBridge |
| ETL | AWS Lambda (Python) |
| Cataloguing | AWS Glue Crawler, Glue Data Catalog |
| Querying | Amazon Athena |
| Security | AWS IAM (least-privilege roles) |
| Monitoring | Amazon CloudWatch |
| Dashboard | Streamlit, Plotly, pyathena, pandas |
| Version control | Git, GitHub |
AWS Lambda Amazon S3 AWS Glue Amazon Athena Amazon EventBridge AWS IAM Amazon CloudWatch
awsProject/
├── src/
│ ├── scraper/
│ │ ├── scraper.py # Lambda scraper function
│ │ └── __init__.py
│ ├── glue/
│ │ └── etl_job.py # ETL Lambda function
│ └── dashboard/
│ └── app.py # Streamlit dashboard
├── infrastructure/
│ ├── lambda-trust-policy.json
│ ├── lambda-s3-policy.json
│ ├── lambda-invoke-policy.json
│ ├── glue-trust-policy.json
│ ├── glue-s3-policy.json
│ └── glue-s3-write-policy.json
├── .python-version
├── requirements.txt
└── README.md
- Python 3.11
- Git
- An AWS account (Free Tier is sufficient)
- AWS CLI installed and configured (
aws configure)
git clone https://github.com/YOUR-USERNAME/github-trending-tracker.git
cd github-trending-tracker
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txtpython src/scraper/scraper.pypython src/glue/etl_job.pystreamlit run src/dashboard/app.pyThe dashboard opens at http://localhost:8501 and connects to Athena automatically using your AWS CLI credentials.
Runs entirely within the AWS Free Tier.
| Service | Usage | Cost |
|---|---|---|
| Lambda | ~60 invocations/month | Free (1M free/month) |
| S3 | ~1MB storage | Free (5GB free) |
| Glue Crawler | ~30 runs/month | Free (10 DPU-hours free) |
| Athena | ~30 queries/month scanning MB | ~$0.00 ($5/TB scanned) |
| EventBridge | 2 rules | Free |
| CloudWatch | Basic logging | Free |
| Total | $0/month |
Sondre Espe