A machine learning project that predicts housing prices in Melbourne using multiple regression algorithms and advanced feature engineering techniques.
This project includes data cleaning, feature engineering, model comparison, hyperparameter tuning, and model interpretation.
The goal of this project is to build a machine learning model capable of predicting house prices in Melbourne using property attributes such as:
- Rooms
- Distance from city
- Property type
- Landsize
- Building area
- Region
- Bathroom count
- Geographic coordinates
- And other property characteristics
The project evaluates multiple regression models and selects the best-performing model using cross-validation and hyperparameter tuning.
The dataset contains 34,857 records and 21 variables describing properties in Melbourne.
- Suburb
- Address
- Rooms: Number of rooms
- Price: Price in Australian dollars, target variable
- Method: S = property sold SP = property sold prior PI = property passed in PN = sold prior not disclosed SN = sold not disclosed NB = no bid VB = vendor bid W = withdrawn prior to auction SA = sold after auction SS = sold after auction price not disclosed N/A = price or highest bid not available.
- Type: br = bedroom(s) h = house,cottage,villa, semi,terrace u = unit, duplex t = townhouse dev site = development site o res = other residential. SellerG: Real Estate Agent
- Date: Date sold
- Distance: Distance from CBD in Kilometres
- Regionname: General Region (West, North West, North, North east ...etc)
- Propertycount: Number of properties that exist in the suburb.
- Bedroom2 : Scraped # of Bedrooms (from different source)
- Bathroom: Number of Bathrooms
- Car: Number of carspots
- Landsize: Land Size in Metres
- BuildingArea: Building Size in Metres
- YearBuilt: Year the house was built
- CouncilArea: Governing council for the area
- Lattitude
- Longtitude
Target Variable:
- Price
Several preprocessing steps were performed.
- Median imputation for numerical features
- Mode imputation for categorical features
New features were created:
- Month_Sold
- Year_Sold
- Day_Sold
- Building_Age
These were derived from Date and YearBuilt columns.
- Log transformation for skewed features
- IQR-based capping for extreme values
Columns removed:
- Address
- Date
- YearBuilt
A Scikit-Learn pipeline was used for preprocessing and model training.
Preprocessing │ ├── Numerical Pipeline │ ├── Median Imputation │ └── StandardScaler │ └── Categorical Pipeline ├── Mode Imputation └── OneHotEncoder
The following regression models were trained and compared:
- Linear Regression
- Ridge Regression
- Lasso Regression
- ElasticNet
- K-Nearest Neighbors
- Decision Tree
- Random Forest
- Gradient Boosting
- AdaBoost
- XGBoost
- LightGBM
- CatBoost
After evaluation, the top-performing models were:
| Model | RMSE | R² |
|---|---|---|
| CatBoost | 0.190 | 0.863 |
| LightGBM | 0.198 | 0.853 |
| XGBoost | 0.198 | 0.852 |
| Random Forest | 0.205 | 0.842 |
The CatBoost model achieved the best performance.
Two techniques were used:
Used for structured parameter tuning.
Used for faster optimization with random parameter sampling.
Best model parameters (CatBoost):
- iterations = 1000
- depth = 9
- learning_rate = 0.1
Top predictive features include:
- Regionname
- Distance
- Property Type
- Rooms
- Postcode
These features have the highest impact on house price prediction.
Cross-validation results:
- R² : 0.859 ± 0.015
- MAE : 0.141 ± 0.012
- RMSE : 0.192 ± 0.014
Real estate companies need accurate house price predictions to help buyers, sellers, and investors make informed decisions.
This project builds a machine learning model to estimate property prices in Melbourne based on property characteristics and location features.
The model can be used for:
• Property valuation
• Real estate investment analysis
• Market trend analysis
• Automated pricing systems