Skip to content

SatvikPraveen/PandasPlayground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

19 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

๐Ÿ“Š PandasPlayground โ€“ A Comprehensive Data Manipulation Project

Python License: GPL v3 Dockerized Notebooks Tests Code style: black PRs Welcome

Master data manipulation with pandas โ€” from fundamentals to advanced performance tuning โ€” using real-world datasets and modular notebooks.


๐Ÿง  Project Purpose

PandasPlayground is designed to help aspiring data scientists and analysts master the entire pandas ecosystem through hands-on, progressive, and fully-documented Jupyter notebooks. Each module targets a specific capability โ€” from data loading and cleaning to advanced transformations and memory profiling โ€” ensuring a complete learning and review reference.

Preview Dashboard

๐Ÿ” Preview: Explore datasets, visualize trends, and profile performance โ€” all in one playground!


๐Ÿ“ Project Structure

PandasPlayground/
โ”œโ”€โ”€ assets/               # Charts, exports, and visual output images
โ”œโ”€โ”€ cheatsheets/          # Markdown-based reference sheets (e.g., pandas_cheatsheet.md)
โ”œโ”€โ”€ data/                 # Raw datasets (CSV, Excel, JSON, Parquet)
โ”œโ”€โ”€ exports/              # Final output files (CSV, Excel, styled reports)
โ”œโ”€โ”€ notebooks/            # All 10 learning notebooks (01โ€“10)
โ”œโ”€โ”€ pages/                # Streamlit multipage app (expanded)
โ”œโ”€โ”€ pandas_env/           # Local virtual environment (โš ๏ธ add to .gitignore)
โ”œโ”€โ”€ scripts/              # Modular reusable utility functions
โ”œโ”€โ”€ Dockerfile            # Docker support for reproducible environments
โ”œโ”€โ”€ LICENSE.md
โ”œโ”€โ”€ README.md             # Youโ€™re here!
โ”œโ”€โ”€ requirements.txt      # Minimal dependencies to run the project
โ”œโ”€โ”€ requirements_dev.txt  # Full dev environment
โ””โ”€โ”€ STREAMLIT_App.py      # Interactive dashboard using Streamlit

๐Ÿงพ Datasets Used

This project uses artificially generated datasets designed to replicate common real-world scenarios. Each file highlights a unique aspect of data handling and analysis using pandas.

Dataset File Format Purpose
superstore_sales.csv CSV Simulated retail sales data for grouping, time series
weather_data.json JSON Unstructured data for parsing, cleaning, and visualization
bank_loans.xlsx Excel Tabular data for filtering, EDA, and feature engineering
bank_loans_multisheet.xlsx Excel Multi-sheet structure for advanced Excel parsing
covid_data.parquet Parquet Efficient columnar data for joins and time-based analysis

๐Ÿ›  These datasets are not from public sources and were created to demonstrate the versatility of pandas across different formats and data challenges. You can find them in the data/ folder.


โœ… Modules and Concepts

Notebook Concepts
01_data_loading.ipynb Load data, inspect structure, parse dates
02_data_cleaning.ipynb Handle missing values, type conversion, string ops
03_aggregation_grouping.ipynb GroupBy, pivot, window functions
04_merging_joining.ipynb Merge, concat, index joins
05_time_series.ipynb Resample, rolling, timezone handling
06_advanced_pandas.ipynb .apply(), .map(), method chaining, memory tuning
07_visualization_with_pandas.ipynb Bar, line, box, grouped plots
08_final_pipeline.ipynb End-to-end data workflow pipeline
09_reporting_exporting.ipynb Export to Excel/CSV/Parquet, styled reports
10_performance_diagnostics.ipynb Profiling, eval(), categorical, Dask

๐Ÿ“š Learning Outcomes

โœ… Develop fluency with pandas core APIs โœ… Build modular, reusable data pipelines โœ… Understand performance bottlenecks in large datasets โœ… Practice version-controlled and containerized data science


๏ฟฝ Quick Start

Get up and running in under 2 minutes:

# 1. Clone the repository
git clone https://github.com/SatvikPraveen/PandasPlayground.git
cd PandasPlayground

# 2. Run the automated setup script (macOS/Linux)
chmod +x scripts/setup.sh
./scripts/setup.sh

# Or install manually:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

# 3. Launch Jupyter Lab
jupyter lab
# Or start the Streamlit dashboard
streamlit run STREAMLIT_App.py

Using Make (Recommended):

make install        # Install dependencies
make run-jupyter    # Launch Jupyter Lab
make run-streamlit  # Launch Streamlit dashboard
make test           # Run all tests
make help           # See all available commands

๐Ÿ“ฆ Installation

Option 1: Automated Setup (Recommended)

./scripts/setup.sh

This script will:

  • โœ… Check Python version (3.9+)
  • โœ… Create virtual environment
  • โœ… Install all dependencies
  • โœ… Run verification tests

Option 2: Manual Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# For development (includes testing tools)
pip install -r requirements_dev.txt

Option 3: Docker (Isolated Environment)

# Build image
docker build -t pandasplayground .

# Run container on http://localhost:8899
docker run -pd 8899:8888 -v $(pwd):/app pandasplayground

๐Ÿ’ป Using the Project

For Learning

  • ๐Ÿ“– Start with 01_data_loading.ipynb and progress sequentially
  • ๐Ÿงช Each notebook includes exercises and real-world examples
  • ๐Ÿ“ Refer to cheatsheets/pandas_cheatsheet.md for quick reference
  • ๐ŸŽฏ All notebooks are standalone and can be explored in any order

For Your Own Projects

  • ๐Ÿ”„ Start with 08_final_pipeline.ipynb as a template
  • ๐Ÿงฉ Reuse functions from scripts/ for your ETL workflows
  • ๐Ÿ“Š Customize the Streamlit dashboard for your datasets
  • ๐Ÿณ Use Dockerfile for reproducible environments

Interactive Dashboard

streamlit run STREAMLIT_App.py
# Visit http://localhost:8501

Features:

  • ๐Ÿ“ˆ Real-time data visualization
  • ๐Ÿ” Interactive filtering and exploration
  • ๐Ÿ“Š KPI metrics and trend analysis
  • ๐Ÿ“ฑ Multi-page navigation

๐Ÿ“Š Performance Benchmarks

This project includes performance optimization techniques:

Operation Dataset Size Standard pandas Optimized Improvement
Memory Usage 100K rows ~45 MB ~12 MB 73% reduction
GroupBy Aggregation 1M rows 2.3s 0.8s 65% faster
String Operations 500K rows 5.1s 1.2s 76% faster

See 10_performance_diagnostics.ipynb for detailed benchmarks and scripts/optimize_memory.py for optimization utilities.


โ“ FAQ (Frequently Asked Questions)

Q: I'm getting a "Module not found" error. What should I do?

A: Make sure you've activated your virtual environment and installed all dependencies:

source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
Q: Can I use my own datasets?

A: Absolutely! Place your data files in the data/ folder and adapt the notebooks. Start with 08_final_pipeline.ipynb as a template for custom data workflows.

Q: Which notebook should I start with?

A: If you're new to pandas, start with 01_data_loading.ipynb. If you're experienced, jump to any topic of interest. Each notebook is self-contained.

Q: Do I need to run notebooks in order?

A: Not necessarily. While they're designed to build on each other, each notebook can run independently. However, notebooks 1-7 are recommended before attempting 8-10.

Q: How do I contribute a new notebook or feature?

A: See the Contributing section below. We welcome all contributions! Submit a pull request with your changes.

Q: Why use Docker?

A: Docker ensures a consistent environment across different machines, eliminating "works on my machine" issues. It's optional but recommended for deployment.

Q: Can I use this project for teaching?

A: Yes! This project is designed for education. Feel free to use it in courses, workshops, or tutorials. Attribution is appreciated but not required under the GPL-3.0 license.

Q: How do I update my fork with the latest changes?

A:

git remote add upstream https://github.com/SatvikPraveen/PandasPlayground.git
git fetch upstream
git merge upstream/main
Q: The Streamlit app isn't loading data. What's wrong?

A: Ensure you've run the pipeline notebooks (especially 08_final_pipeline.ipynb) to generate the required export files in the exports/ directory.

Q: How do I run tests?

A:

make test  # Using Makefile
# or
pytest -v  # Direct command

๐Ÿงฐ Tools & Libraries

  • pandas - Core data manipulation
  • numpy - Numerical computing
  • matplotlib, seaborn - Data visualization
  • Jupyter, JupyterLab - Interactive notebooks
  • openpyxl, pyarrow - File format support
  • memory_profiler, psutil - Performance profiling
  • Streamlit, Plotly - Interactive dashboards
  • pytest - Testing framework
  • Dask - Parallel computing (optional)

๐Ÿ“š Documentation

Comprehensive guides and references:


๐Ÿ”— Related Projects

  • ๐Ÿงฎ NumPyMasterPro โ€“ Master NumPy with modular walkthroughs

Absolutely! Here's an expanded and professional version of the How to Contribute or Fork section to better guide future collaborators:


๐Ÿค How to Contribute or Fork

Whether you're fixing a bug, suggesting an enhancement, or adding new learning notebooks โ€” contributions are welcome and appreciated!

๐Ÿ”€ Fork & Clone the Repository

# Step 1: Fork this repository on GitHub
# Step 2: Clone your fork locally
git clone https://github.com/SatvikPraveen/PandasPlayground.git
cd PandasPlayground

๐ŸŒฑ Create a Feature Branch

Always create a new branch for your changes instead of working on main:

git checkout -b feature/your-feature-name

๐Ÿ›  Make Your Changes

  • Add your improvements (e.g., a new notebook, function in scripts/, or fixes in requirements.txt)
  • Follow consistent formatting, naming, and markdown style as used across the project
  • Update the README.md or cheatsheets if your change impacts the documentation
  • Test your code locally (if it includes logic)

โœ… Commit and Push

git add .
git commit -m "โœจ Added: Short summary of your feature"
git push origin feature/your-feature-name

๐Ÿ“ฉ Submit a Pull Request

  • Go to your fork on GitHub
  • Click "Compare & pull request"
  • Provide a clear and concise description of your changes
  • If applicable, reference any related issue (e.g., Fixes #12)
  • Wait for review or feedback

๐Ÿงช Contribution Tips

  • Keep changes modular and atomic โ€” one feature or fix per pull request

  • Be sure to sync your fork with the upstream repository periodically:

    git remote add upstream https://github.com/SatvikPraveen/PandasPlayground.git
    git pull upstream main
  • If your feature involves code, prefer writing reusable functions in scripts/ and importing them in your notebooks


๐Ÿ™ Thank You

Every contribution, no matter how small, helps improve this resource for the entire data science community. Letโ€™s build this playground together! ๐ŸŽ‰


๐Ÿ“œ License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.


๐Ÿ™‹โ€โ™‚๏ธ Author

Built with ๐Ÿ’ป and โ˜• by Satvik Praveen Drop a โญ if you find this project helpful!

About

๐Ÿ“Š A comprehensive pandas mastery project with 10 modular Jupyter notebooks covering data loading, cleaning, grouping, merging, time series, visualization, and performance profiling. Includes real-world workflows, Docker, Streamlit, and reusable utils. Ideal for data scientists and analysts to learn, practice, and refer. Practice-ready and modular.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages