Author: Mohammad Taha — November 2025
This project develops a high-precision regression model to predict vehicle CO₂ emissions based on technical specifications such as engine size, fuel consumption, and vehicle class. Using data from Transport Canada, the final Random Forest Regressor achieves an exceptional R² of 0.9972 on the test set, with an average prediction error of less than 2 g/km.
The model provides actionable insights for environmental policy, vehicle purchasing decisions, and automotive engineering, demonstrating how machine learning can support sustainability goals.
Transportation is a major contributor to greenhouse gas emissions. Accurate prediction of vehicle CO₂ emissions enables:
- Informed consumer choices: Help buyers select environmentally friendly vehicles
- Regulatory compliance: Support emission standards enforcement
- Automotive R&D: Guide manufacturers in designing cleaner vehicles
- Carbon footprint analysis: Enable accurate environmental impact assessments
- Policy evaluation: Test effectiveness of emission reduction strategies
Source: Transport Canada Vehicle Emissions Database (via Kaggle)
Size: 7,385 vehicles
Coverage: Multiple model years and manufacturers
Quality: Complete data with no missing values
| Feature | Type | Description |
|---|---|---|
Make |
Categorical | Vehicle manufacturer (e.g., Toyota, Ford, BMW) |
Model |
Categorical | Specific vehicle model |
Vehicle Class |
Categorical | Body type (SUV, Sedan, Pickup, Compact, etc.) |
Engine Size(L) |
Numeric | Engine displacement in liters |
Cylinders |
Numeric | Number of engine cylinders |
Transmission |
Categorical | Transmission type and number of gears |
Fuel Type |
Categorical | Gasoline, Diesel, E85, etc. |
Fuel Consumption City (L/100 km) |
Numeric | City driving fuel consumption |
Fuel Consumption Hwy (L/100 km) |
Numeric | Highway driving fuel consumption |
Fuel Consumption Comb (L/100 km) |
Numeric | Combined fuel consumption |
Fuel Consumption Comb (mpg) |
Numeric | Combined fuel economy (miles per gallon) |
CO2 Emissions(g/km) |
Numeric (Target) | Carbon dioxide emissions |
- Python 3.x: Core programming language
- scikit-learn: Machine learning models and pipelines
- pandas: Data manipulation and analysis
- NumPy: Numerical computing
- Matplotlib & Seaborn: Data visualization
- KaggleHub: Dataset integration
- Jupyter Notebook: Interactive development
- Completeness check: Verified no missing values
- Duplicate handling: Preserved intentionally (reflect market distribution)
- Outlier analysis: Identified but retained (represent real extreme cases)
- Distribution analysis: Examined target variable and feature relationships
Created three powerful engineered features:
-
engine_per_cylinder: Engine size divided by number of cylinders- Captures engine design efficiency
- Larger values indicate bigger cylinders → higher emissions
-
city_hwy_ratio: City fuel consumption / Highway fuel consumption- Measures urban inefficiency
- Higher values indicate vehicles poorly optimized for city driving
-
total_consumption: Sum of city and highway consumption- Aggregate fuel usage metric
- Categorical encoding: One-hot encoding for Make, Model, Vehicle Class, Transmission, Fuel Type
- Feature scaling: StandardScaler for numeric features (essential for SVR)
- Train/Validation/Test split: 60% / 20% / 20%
- Pipeline construction: Integrated preprocessing and modeling
Four regression algorithms were evaluated:
- Baseline model
- Surprisingly effective (R² = 0.9906)
- Indicates near-linear relationship in transformed feature space
- Ensemble of 100 decision trees
- Handles non-linear relationships
- Robust to outliers and feature interactions
- Non-linear mapping via kernel trick
- Requires careful feature scaling
- Sequential ensemble learning
- Strong performance but slightly behind Random Forest
| Model | R² (Train) | RMSE (Train) | MAE (Train) | R² (Val) | RMSE (Val) | MAE (Val) |
|---|---|---|---|---|---|---|
| Linear Regression | 0.9966 | 3.43 | 1.83 | 0.9906 | 5.64 | 3.24 |
| Random Forest | 0.9995 | 1.32 | 0.80 | 0.9960 | 3.69 | 1.99 |
| SVR (RBF) | 0.9891 | 6.14 | 1.80 | 0.9901 | 5.82 | 3.33 |
| Gradient Boosting | 0.9975 | 2.92 | 2.29 | 0.9954 | 3.97 | 2.47 |
The Random Forest Regressor achieved:
- R² (Test): 0.9972
- RMSE (Test): 3.03 g/km
- MAE (Test): 1.94 g/km
- Exceptional predictive accuracy: R² > 0.99 indicates CO₂ emissions are highly predictable from vehicle specifications
- Low prediction error: Average error under 2 g/km is excellent for practical applications
- Strong generalization: Test performance matches validation performance
- Feature importance: Combined fuel consumption is the dominant predictor, followed by engine design metrics
The Random Forest model identified the most influential predictors:
- Fuel Consumption Comb (L/100 km) — Dominant factor (directly proportional to CO₂)
- engine_per_cylinder — Engine design efficiency metric
- city_hwy_ratio — Urban driving inefficiency indicator
- Vehicle Class — Body type and weight category
- Fuel Type — Different fuels have different carbon content per liter
These findings align perfectly with physical and engineering principles, increasing confidence in the model's reliability.
vehicle-co2-emissions-prediction/
├── vehicle_co2_emissions_prediction.ipynb # Main analysis notebook
├── README.md # This file
├── LICENSE # MIT License
├── requirements.txt # Python dependencies
└── .gitignore # Git ignore rules
- Python 3.8+
- Minimum 4GB RAM
-
Clone the repository:
git clone https://github.com/tahamohmadf19-dev/vehicle-co2-emissions-prediction.git cd vehicle-co2-emissions-prediction -
Install dependencies:
pip install -r requirements.txt
-
Run the notebook:
jupyter notebook vehicle_co2_emissions_prediction.ipynb
- Upload the notebook to Google Colab
- The dataset will be automatically downloaded via KaggleHub
- Run all cells sequentially
- Compare emissions of different vehicles before purchase
- Estimate environmental impact of driving habits
- Calculate carbon footprint for personal sustainability goals
- Evaluate effectiveness of emission standards
- Identify vehicle categories requiring stricter regulations
- Design incentive programs for low-emission vehicles
- Benchmark new designs against market standards
- Optimize engine and transmission configurations
- Predict regulatory compliance before production
- Temporal analysis: Incorporate model year trends to track emission improvements over time
- Electric vehicle integration: Extend model to handle EVs and plug-in hybrids
- Uncertainty quantification: Implement prediction intervals using quantile regression
- Interactive web application: Deploy as a user-friendly emissions calculator
- API development: Create REST API for integration with automotive databases
- Multi-pollutant modeling: Extend to NOx, particulate matter, and other emissions
- Causal inference: Use causal models to identify design changes that reduce emissions
This project demonstrates:
- End-to-end regression modeling workflow
- Feature engineering for domain-specific problems
- Model selection and hyperparameter tuning
- Interpretability and validation of ML models
- Application of machine learning to environmental challenges
- Best practices in data science project documentation
- Geographic scope: Data represents Canadian market; may not generalize to other regions
- Temporal scope: Does not account for technological evolution over time
- Correlation vs causation: Model captures empirical patterns, not causal mechanisms
- Real-world driving: Test cycle data may differ from actual on-road emissions
This project is licensed under the MIT License - see the LICENSE file for details.
- Transport Canada Vehicle Emissions Database
- Kaggle Dataset: CO2 Emission by Vehicles
- scikit-learn Documentation
- IPCC Guidelines for Greenhouse Gas Inventories
Mohammad Taha
📧 Email: tahamohmadf19@gmail.com
🔗 LinkedIn: Mohmad Taha Alhmad
🐙 GitHub: @tahamohmadf19-dev
This project contributes to environmental sustainability by making vehicle emissions transparent and predictable through data science.