Master data manipulation with pandas โ from fundamentals to advanced performance tuning โ using real-world datasets and modular notebooks.
PandasPlayground is designed to help aspiring data scientists and analysts master the entire pandas ecosystem through hands-on, progressive, and fully-documented Jupyter notebooks. Each module targets a specific capability โ from data loading and cleaning to advanced transformations and memory profiling โ ensuring a complete learning and review reference.
๐ Preview: Explore datasets, visualize trends, and profile performance โ all in one playground!
PandasPlayground/
โโโ assets/ # Charts, exports, and visual output images
โโโ cheatsheets/ # Markdown-based reference sheets (e.g., pandas_cheatsheet.md)
โโโ data/ # Raw datasets (CSV, Excel, JSON, Parquet)
โโโ exports/ # Final output files (CSV, Excel, styled reports)
โโโ notebooks/ # All 10 learning notebooks (01โ10)
โโโ pages/ # Streamlit multipage app (expanded)
โโโ pandas_env/ # Local virtual environment (โ ๏ธ add to .gitignore)
โโโ scripts/ # Modular reusable utility functions
โโโ Dockerfile # Docker support for reproducible environments
โโโ LICENSE.md
โโโ README.md # Youโre here!
โโโ requirements.txt # Minimal dependencies to run the project
โโโ requirements_dev.txt # Full dev environment
โโโ STREAMLIT_App.py # Interactive dashboard using Streamlit
This project uses artificially generated datasets designed to replicate common real-world scenarios. Each file highlights a unique aspect of data handling and analysis using pandas.
| Dataset File | Format | Purpose |
|---|---|---|
superstore_sales.csv |
CSV | Simulated retail sales data for grouping, time series |
weather_data.json |
JSON | Unstructured data for parsing, cleaning, and visualization |
bank_loans.xlsx |
Excel | Tabular data for filtering, EDA, and feature engineering |
bank_loans_multisheet.xlsx |
Excel | Multi-sheet structure for advanced Excel parsing |
covid_data.parquet |
Parquet | Efficient columnar data for joins and time-based analysis |
๐ These datasets are not from public sources and were created to demonstrate the versatility of
pandasacross different formats and data challenges. You can find them in thedata/folder.
| Notebook | Concepts |
|---|---|
01_data_loading.ipynb |
Load data, inspect structure, parse dates |
02_data_cleaning.ipynb |
Handle missing values, type conversion, string ops |
03_aggregation_grouping.ipynb |
GroupBy, pivot, window functions |
04_merging_joining.ipynb |
Merge, concat, index joins |
05_time_series.ipynb |
Resample, rolling, timezone handling |
06_advanced_pandas.ipynb |
.apply(), .map(), method chaining, memory tuning |
07_visualization_with_pandas.ipynb |
Bar, line, box, grouped plots |
08_final_pipeline.ipynb |
End-to-end data workflow pipeline |
09_reporting_exporting.ipynb |
Export to Excel/CSV/Parquet, styled reports |
10_performance_diagnostics.ipynb |
Profiling, eval(), categorical, Dask |
โ
Develop fluency with pandas core APIs
โ
Build modular, reusable data pipelines
โ
Understand performance bottlenecks in large datasets
โ
Practice version-controlled and containerized data science
Get up and running in under 2 minutes:
# 1. Clone the repository
git clone https://github.com/SatvikPraveen/PandasPlayground.git
cd PandasPlayground
# 2. Run the automated setup script (macOS/Linux)
chmod +x scripts/setup.sh
./scripts/setup.sh
# Or install manually:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
# 3. Launch Jupyter Lab
jupyter lab
# Or start the Streamlit dashboard
streamlit run STREAMLIT_App.pyUsing Make (Recommended):
make install # Install dependencies
make run-jupyter # Launch Jupyter Lab
make run-streamlit # Launch Streamlit dashboard
make test # Run all tests
make help # See all available commands./scripts/setup.shThis script will:
- โ Check Python version (3.9+)
- โ Create virtual environment
- โ Install all dependencies
- โ Run verification tests
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# For development (includes testing tools)
pip install -r requirements_dev.txt# Build image
docker build -t pandasplayground .
# Run container on http://localhost:8899
docker run -pd 8899:8888 -v $(pwd):/app pandasplayground- ๐ Start with
01_data_loading.ipynband progress sequentially - ๐งช Each notebook includes exercises and real-world examples
- ๐ Refer to
cheatsheets/pandas_cheatsheet.mdfor quick reference - ๐ฏ All notebooks are standalone and can be explored in any order
- ๐ Start with
08_final_pipeline.ipynbas a template - ๐งฉ Reuse functions from
scripts/for your ETL workflows - ๐ Customize the Streamlit dashboard for your datasets
- ๐ณ Use
Dockerfilefor reproducible environments
streamlit run STREAMLIT_App.py
# Visit http://localhost:8501Features:
- ๐ Real-time data visualization
- ๐ Interactive filtering and exploration
- ๐ KPI metrics and trend analysis
- ๐ฑ Multi-page navigation
This project includes performance optimization techniques:
| Operation | Dataset Size | Standard pandas | Optimized | Improvement |
|---|---|---|---|---|
| Memory Usage | 100K rows | ~45 MB | ~12 MB | 73% reduction |
| GroupBy Aggregation | 1M rows | 2.3s | 0.8s | 65% faster |
| String Operations | 500K rows | 5.1s | 1.2s | 76% faster |
See 10_performance_diagnostics.ipynb for detailed benchmarks and scripts/optimize_memory.py for optimization utilities.
Q: I'm getting a "Module not found" error. What should I do?
A: Make sure you've activated your virtual environment and installed all dependencies:
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtQ: Can I use my own datasets?
A: Absolutely! Place your data files in the data/ folder and adapt the notebooks. Start with 08_final_pipeline.ipynb as a template for custom data workflows.
Q: Which notebook should I start with?
A: If you're new to pandas, start with 01_data_loading.ipynb. If you're experienced, jump to any topic of interest. Each notebook is self-contained.
Q: Do I need to run notebooks in order?
A: Not necessarily. While they're designed to build on each other, each notebook can run independently. However, notebooks 1-7 are recommended before attempting 8-10.
Q: How do I contribute a new notebook or feature?
A: See the Contributing section below. We welcome all contributions! Submit a pull request with your changes.
Q: Why use Docker?
A: Docker ensures a consistent environment across different machines, eliminating "works on my machine" issues. It's optional but recommended for deployment.
Q: Can I use this project for teaching?
A: Yes! This project is designed for education. Feel free to use it in courses, workshops, or tutorials. Attribution is appreciated but not required under the GPL-3.0 license.
Q: How do I update my fork with the latest changes?
A:
git remote add upstream https://github.com/SatvikPraveen/PandasPlayground.git
git fetch upstream
git merge upstream/mainQ: The Streamlit app isn't loading data. What's wrong?
A: Ensure you've run the pipeline notebooks (especially 08_final_pipeline.ipynb) to generate the required export files in the exports/ directory.
Q: How do I run tests?
A:
make test # Using Makefile
# or
pytest -v # Direct command- pandas - Core data manipulation
- numpy - Numerical computing
- matplotlib, seaborn - Data visualization
- Jupyter, JupyterLab - Interactive notebooks
- openpyxl, pyarrow - File format support
- memory_profiler, psutil - Performance profiling
- Streamlit, Plotly - Interactive dashboards
- pytest - Testing framework
- Dask - Parallel computing (optional)
Comprehensive guides and references:
- ๐ Data Dictionary - Complete dataset documentation
- โก Performance Guidelines - Optimization techniques and benchmarks
- ๐ Notebook Template - Template for creating new notebooks
- ๐ Quick Reference - Pandas cheat sheet
- ๐งฎ NumPyMasterPro โ Master NumPy with modular walkthroughs
Absolutely! Here's an expanded and professional version of the How to Contribute or Fork section to better guide future collaborators:
Whether you're fixing a bug, suggesting an enhancement, or adding new learning notebooks โ contributions are welcome and appreciated!
# Step 1: Fork this repository on GitHub
# Step 2: Clone your fork locally
git clone https://github.com/SatvikPraveen/PandasPlayground.git
cd PandasPlaygroundAlways create a new branch for your changes instead of working on main:
git checkout -b feature/your-feature-name- Add your improvements (e.g., a new notebook, function in
scripts/, or fixes inrequirements.txt) - Follow consistent formatting, naming, and markdown style as used across the project
- Update the README.md or cheatsheets if your change impacts the documentation
- Test your code locally (if it includes logic)
git add .
git commit -m "โจ Added: Short summary of your feature"
git push origin feature/your-feature-name- Go to your fork on GitHub
- Click "Compare & pull request"
- Provide a clear and concise description of your changes
- If applicable, reference any related issue (e.g.,
Fixes #12) - Wait for review or feedback
-
Keep changes modular and atomic โ one feature or fix per pull request
-
Be sure to sync your fork with the upstream repository periodically:
git remote add upstream https://github.com/SatvikPraveen/PandasPlayground.git git pull upstream main
-
If your feature involves code, prefer writing reusable functions in
scripts/and importing them in your notebooks
Every contribution, no matter how small, helps improve this resource for the entire data science community. Letโs build this playground together! ๐
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.
Built with ๐ป and โ by Satvik Praveen Drop a โญ if you find this project helpful!
