This project predicts house sale prices using multiple numerical features from the House Prices - Advanced Regression Dataset (Kaggle). The notebook explores data preprocessing, feature selection, regression modeling, and evaluation.
Goal: Predict house prices from numerical property features using regression techniques.
Dataset: House Prices - Advanced Regression (Kaggle)
Key Steps:
- Select numeric features
- Handle missing values
- Explore correlations with target (SalePrice)
- Scale features and compare models
- Train & evaluate Linear Regression models
- Visualize residuals, predictions, and feature importance
- Loads training (1460 rows, 81 columns) and test (1459 rows, 80 columns) datasets
- Initial exploration shows 35 numeric and 43 categorical features
- Target variable: SalePrice (house sale price)
- SalePrice distribution shows right-skewness
- Statistical summary (mean: $180,921, std: $79,442)
- Visualized using histogram with KDE plot
- Identified outliers using GrLivArea vs SalePrice plot
- Removed houses with GrLivArea > 4000 (extremely large houses with low prices)
- Selected numerical features only
- Handled missing values using SimpleImputer
- Explored feature correlations with SalePrice
- Split data into train/test sets (80/20)
- Scaled features using StandardScaler
- Implemented:
- Linear Regression
- XGBoost (gradient boosting)
- Mean Squared Error (MSE)
- R² Score
- Mean Absolute Error (MAE)
- Residual plots
- Prediction vs actual values
- Feature importance analysis
Libraries Used:
- pandas, numpy: Data manipulation
- matplotlib, seaborn: Visualization
- scikit-learn: Preprocessing and machine learning
- xgboost: Gradient boosting implementation
Key Techniques:
- Correlation analysis for feature selection
- Handling missing values with imputation
- Feature scaling for model performance
- Residual analysis for model diagnostics
- Ensure required libraries are installed
- Place dataset files in
../data/directory:- train.csv
- test.csv
- Run the notebook sequentially
- Models will be trained and evaluated automatically
This project addresses the Kaggle House Prices competition, which challenges participants to predict residential home prices in Ames, Iowa using 79 explanatory features describing various aspects of the properties.