Skip to content

Sondreespe/github-trending-tracker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Trending Language Tracker

A fully serverless AWS data pipeline that scrapes GitHub Trending daily and visualises programming language trends over time.

Built as a portfolio project to demonstrate hands-on AWS experience across Lambda, S3, Glue, Athena, EventBridge, IAM, and CloudWatch — running entirely within the AWS Free Tier at $0/month.


What it does

Every day at 06:00 UTC, an AWS Lambda function automatically scrapes github.com/trending, extracts repository data, and stores it as JSON in S3. A second Lambda function transforms the raw data into CSV format partitioned by date. A Glue Crawler keeps the Data Catalog up to date, enabling SQL queries via Amazon Athena. A local Streamlit dashboard connects to Athena and visualises language trends, star metrics, and repo rankings over time.


Architecture

EventBridge (cron 06:00 UTC)
    ↓
Lambda — scraper (Python + BeautifulSoup)
    ↓
S3 raw/YYYY-MM-DD.json
    ↓
Lambda — ETL job
    ↓
S3 processed/year=/month=/day=/data.csv
    ↓
Glue Crawler → Data Catalog
    ↓
Athena (SQL queries)
    ↓
Streamlit dashboard (local)

Four layers:

  • Collection — EventBridge triggers Lambda daily. Lambda scrapes GitHub Trending and writes raw JSON to S3. CloudWatch logs all runs.
  • Storage — S3 bucket with two zones: raw/ (one JSON per day) and processed/ (CSV, partitioned by date).
  • Analysis — ETL Lambda transforms raw JSON to clean CSV. Glue Crawler updates the Data Catalog. Athena exposes the data as a queryable SQL table.
  • Visualisation — Streamlit dashboard connects to Athena via pyathena. Runs locally — no AWS compute cost.

Data collected per repository

Field Description
repo_name Repository name
owner GitHub username
language Programming language
stars_today Stars gained on the scraped day
total_stars Total stars at time of scraping
forks Total forks
rank_on_day Position on trending page (1–25)
date_scraped Date of scraping

Key findings

Based on 14 days of data (140 repos scraped across 14 languages):

  • Python dominates GitHub Trending with 61 appearances — nearly 50% more than TypeScript (41 appearances), making it the most consistently trending language by a wide margin.
  • Shell has the highest star potential — averaging 3,459 stars per trending repo, peaking at 10,749 in a single day. Shell repos trend rarely but attract massive attention when they do.
  • Rust is consistent but niche — appearing 9 times with an average of 1,080 stars per repo, showing a stable and engaged community despite low volume.
  • TypeScript and Python dominate by volume but C++ and Shell punch above their weight in stars-per-repo, suggesting different audience dynamics.

Tech stack

Layer Technology
Scraping Python 3.11, BeautifulSoup4, requests
Compute AWS Lambda
Storage Amazon S3
Scheduling Amazon EventBridge
ETL AWS Lambda (Python)
Cataloguing AWS Glue Crawler, Glue Data Catalog
Querying Amazon Athena
Security AWS IAM (least-privilege roles)
Monitoring Amazon CloudWatch
Dashboard Streamlit, Plotly, pyathena, pandas
Version control Git, GitHub

AWS services demonstrated

AWS Lambda Amazon S3 AWS Glue Amazon Athena Amazon EventBridge AWS IAM Amazon CloudWatch


Project structure

awsProject/
├── src/
│   ├── scraper/
│   │   ├── scraper.py        # Lambda scraper function
│   │   └── __init__.py
│   ├── glue/
│   │   └── etl_job.py        # ETL Lambda function
│   └── dashboard/
│       └── app.py            # Streamlit dashboard
├── infrastructure/
│   ├── lambda-trust-policy.json
│   ├── lambda-s3-policy.json
│   ├── lambda-invoke-policy.json
│   ├── glue-trust-policy.json
│   ├── glue-s3-policy.json
│   └── glue-s3-write-policy.json
├── .python-version
├── requirements.txt
└── README.md

How to run locally

Prerequisites

  • Python 3.11
  • Git
  • An AWS account (Free Tier is sufficient)
  • AWS CLI installed and configured (aws configure)

Setup

git clone https://github.com/YOUR-USERNAME/github-trending-tracker.git
cd github-trending-tracker

python3.11 -m venv venv
source venv/bin/activate

pip install -r requirements.txt

Run the scraper locally

python src/scraper/scraper.py

Run the ETL job locally

python src/glue/etl_job.py

Run the dashboard

streamlit run src/dashboard/app.py

The dashboard opens at http://localhost:8501 and connects to Athena automatically using your AWS CLI credentials.


Cost breakdown

Runs entirely within the AWS Free Tier.

Service Usage Cost
Lambda ~60 invocations/month Free (1M free/month)
S3 ~1MB storage Free (5GB free)
Glue Crawler ~30 runs/month Free (10 DPU-hours free)
Athena ~30 queries/month scanning MB ~$0.00 ($5/TB scanned)
EventBridge 2 rules Free
CloudWatch Basic logging Free
Total $0/month

Author

Sondre Espe

About

Serverless AWS data pipeline that scrapes GitHub Trending daily and visualises programming language trends over time. Built with Lambda, S3, Glue, Athena, EventBridge and Streamlit.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages