Skip to content

mn-cs/successfactors-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SuccessFactors Data Pipeline

Python Docker PostgreSQL Jupyter License: MIT

⚠️ Development Status: This project is currently under active development and is not yet complete. Features and documentation may change as development progresses.

A comprehensive data science project that collects and analyzes profiles of successful individuals through APIs and web scraping to identify common traits, educational patterns, and career paths that contribute to extraordinary success.

🎯 Project Overview

This project combines data collection, transformation, and machine learning to:

  • Collect billionaire profiles from Forbes API
  • Scrape biographical data from Wikipedia
  • Analyze educational backgrounds and career patterns
  • Model success factors and predictive traits
  • Visualize insights through interactive dashboards

πŸ”— Quick Links

πŸš€ Quick Start

Prerequisites

  • Python 3.12 recommended
  • uv
  • A valid RAPIDAPI_KEY in .env
  • Docker/PostgreSQL only if you want the optional database workflow

Installation

Install the project and development dependencies from pyproject.toml and uv.lock:

make install

This runs:

uv sync --frozen --extra dev

Run The Pipeline

make run

This command:

  • fetches the latest Forbes data
  • reuses existing Wikipedia scrape files if they already exist
  • otherwise scrapes missing Wikipedia birth-date and education data
  • writes the merged output to data/processed/merged_dataset_<YYYY-MM-DD>.csv

Use The Included Data

This repository already includes generated data files under data/, including:

  • data/external/api_fetched/
  • data/external/web_scraped/
  • data/interim/
  • data/processed/

If you only want to inspect the latest results, you can open the latest CSV in data/processed/ without rerunning the pipeline.

Scrape Limit

When the pipeline needs to generate fresh Wikipedia scrape files, it limits scraping to 10 people by default.

You can change that with SCRAPE_LIMIT:

SCRAPE_LIMIT=10 make run
SCRAPE_LIMIT=100 make run
SCRAPE_LIMIT=all make run

Notes:

  • SCRAPE_LIMIT=all removes the limit and attempts to scrape every Forbes row.
  • If scrape files already exist in data/external/web_scraped/, make run will reuse them.
  • To force a new scrape with a different limit, delete or rename the existing wiki_date_of_birth_*.csv and wiki_university_degree_*.csv files first.
  • The merged dataset still contains the full Forbes row set. The scrape limit only controls how many rows get newly scraped Wikipedia enrichment during regeneration.

Scrape Only

If you want to refresh only the Wikipedia scrape inputs:

make scrape

Optional Database Workflow

make run does not require PostgreSQL.

If you want the database container for SQL work, start it with:

docker compose up -d db

πŸ“Š Data Pipeline

The project follows a structured ETL pipeline:

  1. Extract: Fetch data from Forbes API and scrape Wikipedia
  2. Transform: Clean, normalize, and merge datasets
  3. Load: Store in PostgreSQL with proper schema
  4. Analyze: Generate insights and train ML models

Data Sources

  • Forbes Billionaires API: Financial and demographic data
  • Wikipedia: Educational backgrounds and biographical details

πŸ› οΈ Tech Stack

  • Language: Python 3.10+
  • Data Processing: pandas, numpy, scikit-learn
  • Web Scraping: BeautifulSoup4, httpx, requests-cache
  • Database: PostgreSQL, SQLAlchemy
  • Notebooks: Jupyter Lab
  • Containerization: Docker, Docker Compose
  • Code Quality: Ruff (linting & formatting), pytest
  • Documentation: MkDocs

πŸ“ Project Structure

β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ external/
β”‚   β”‚   β”œβ”€β”€ api_fetched/
β”‚   β”‚   └── web_scraped/
β”‚   β”œβ”€β”€ interim/
β”‚   └── processed/
β”œβ”€β”€ notebooks/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api_fetchers/
β”‚   β”œβ”€β”€ data_transformers/
β”‚   β”œβ”€β”€ modeling/
β”‚   β”œβ”€β”€ scrapers/
β”‚   β”œβ”€β”€ sql/
β”‚   β”œβ”€β”€ config.py
β”‚   β”œβ”€β”€ dataset.py
β”‚   β”œβ”€β”€ features.py
β”‚   └── plots.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ Makefile
β”œβ”€β”€ pyproject.toml
└── uv.lock

πŸ”§ Development

Code Quality

make lint      # Check code style
make format    # Format code

Common Commands

make help      # List available commands
make install   # Install dependencies with uv
make scrape    # Refresh the Wikipedia scrape files
make run       # Run the full pipeline

Database Operations

docker compose up -d db

Jupyter Notebooks

uv run jupyter lab

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guidelines (enforced by Ruff)
  • Write tests for new features
  • Update documentation as needed
  • Use conventional commit messages

πŸ“‹ Roadmap

  • Phase 1: Data collection and cleaning βœ…
  • Phase 2: Database schema and ETL pipeline βœ…
  • Phase 3: Exploratory data analysis (In Progress)
  • Phase 4: Machine learning models
  • Phase 5: Web dashboard and visualization
  • Phase 6: API for external access

⚠️ Limitations

  • Data limited to publicly available information
  • Forbes API rate limiting may affect data collection speed
  • Wikipedia scraping subject to website changes
  • Educational data may be incomplete for some individuals

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Forbes for providing billionaire data through their API
  • Wikipedia for biographical information
  • Open source community for the amazing tools and libraries

About

Data pipeline project that collects and analyzes profiles of successful individuals via APIs and web scraping to identify common traits and career patterns.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors