β οΈ Development Status: This project is currently under active development and is not yet complete. Features and documentation may change as development progresses.
A comprehensive data science project that collects and analyzes profiles of successful individuals through APIs and web scraping to identify common traits, educational patterns, and career paths that contribute to extraordinary success.
This project combines data collection, transformation, and machine learning to:
- Collect billionaire profiles from Forbes API
- Scrape biographical data from Wikipedia
- Analyze educational backgrounds and career patterns
- Model success factors and predictive traits
- Visualize insights through interactive dashboards
- Included Dataset Snapshot (CSV)
- Exploratory Analysis Notebook
- Data Cleaning Workflow
- Visualization Notebook
- Education Split Notebook
- Dataset Merge Notebook
- Python 3.12 recommended
uv- A valid
RAPIDAPI_KEYin.env - Docker/PostgreSQL only if you want the optional database workflow
Install the project and development dependencies from pyproject.toml and uv.lock:
make installThis runs:
uv sync --frozen --extra devmake runThis command:
- fetches the latest Forbes data
- reuses existing Wikipedia scrape files if they already exist
- otherwise scrapes missing Wikipedia birth-date and education data
- writes the merged output to
data/processed/merged_dataset_<YYYY-MM-DD>.csv
This repository already includes generated data files under data/, including:
data/external/api_fetched/data/external/web_scraped/data/interim/data/processed/
If you only want to inspect the latest results, you can open the latest CSV in data/processed/ without rerunning the pipeline.
When the pipeline needs to generate fresh Wikipedia scrape files, it limits scraping to 10 people by default.
You can change that with SCRAPE_LIMIT:
SCRAPE_LIMIT=10 make run
SCRAPE_LIMIT=100 make run
SCRAPE_LIMIT=all make runNotes:
SCRAPE_LIMIT=allremoves the limit and attempts to scrape every Forbes row.- If scrape files already exist in
data/external/web_scraped/,make runwill reuse them. - To force a new scrape with a different limit, delete or rename the existing
wiki_date_of_birth_*.csvandwiki_university_degree_*.csvfiles first. - The merged dataset still contains the full Forbes row set. The scrape limit only controls how many rows get newly scraped Wikipedia enrichment during regeneration.
If you want to refresh only the Wikipedia scrape inputs:
make scrapemake run does not require PostgreSQL.
If you want the database container for SQL work, start it with:
docker compose up -d dbThe project follows a structured ETL pipeline:
- Extract: Fetch data from Forbes API and scrape Wikipedia
- Transform: Clean, normalize, and merge datasets
- Load: Store in PostgreSQL with proper schema
- Analyze: Generate insights and train ML models
- Forbes Billionaires API: Financial and demographic data
- Wikipedia: Educational backgrounds and biographical details
- Language: Python 3.10+
- Data Processing: pandas, numpy, scikit-learn
- Web Scraping: BeautifulSoup4, httpx, requests-cache
- Database: PostgreSQL, SQLAlchemy
- Notebooks: Jupyter Lab
- Containerization: Docker, Docker Compose
- Code Quality: Ruff (linting & formatting), pytest
- Documentation: MkDocs
βββ data/
β βββ external/
β β βββ api_fetched/
β β βββ web_scraped/
β βββ interim/
β βββ processed/
βββ notebooks/
βββ src/
β βββ api_fetchers/
β βββ data_transformers/
β βββ modeling/
β βββ scrapers/
β βββ sql/
β βββ config.py
β βββ dataset.py
β βββ features.py
β βββ plots.py
βββ Dockerfile
βββ docker-compose.yml
βββ Makefile
βββ pyproject.toml
βββ uv.lock
make lint # Check code style
make format # Format codemake help # List available commands
make install # Install dependencies with uv
make scrape # Refresh the Wikipedia scrape files
make run # Run the full pipelinedocker compose up -d dbuv run jupyter lab- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guidelines (enforced by Ruff)
- Write tests for new features
- Update documentation as needed
- Use conventional commit messages
- Phase 1: Data collection and cleaning β
- Phase 2: Database schema and ETL pipeline β
- Phase 3: Exploratory data analysis (In Progress)
- Phase 4: Machine learning models
- Phase 5: Web dashboard and visualization
- Phase 6: API for external access
- Data limited to publicly available information
- Forbes API rate limiting may affect data collection speed
- Wikipedia scraping subject to website changes
- Educational data may be incomplete for some individuals
This project is licensed under the MIT License - see the LICENSE file for details.
- Forbes for providing billionaire data through their API
- Wikipedia for biographical information
- Open source community for the amazing tools and libraries