Vehicle CO₂ Emissions Prediction

Author: Mohammad Taha — November 2025

Project Overview

This project develops a high-precision regression model to predict vehicle CO₂ emissions based on technical specifications such as engine size, fuel consumption, and vehicle class. Using data from Transport Canada, the final Random Forest Regressor achieves an exceptional R² of 0.9972 on the test set, with an average prediction error of less than 2 g/km.

The model provides actionable insights for environmental policy, vehicle purchasing decisions, and automotive engineering, demonstrating how machine learning can support sustainability goals.

Environmental Context

Transportation is a major contributor to greenhouse gas emissions. Accurate prediction of vehicle CO₂ emissions enables:

Informed consumer choices: Help buyers select environmentally friendly vehicles
Regulatory compliance: Support emission standards enforcement
Automotive R&D: Guide manufacturers in designing cleaner vehicles
Carbon footprint analysis: Enable accurate environmental impact assessments
Policy evaluation: Test effectiveness of emission reduction strategies

Dataset

Source: Transport Canada Vehicle Emissions Database (via Kaggle)
Size: 7,385 vehicles
Coverage: Multiple model years and manufacturers
Quality: Complete data with no missing values

Key Features

Feature	Type	Description
`Make`	Categorical	Vehicle manufacturer (e.g., Toyota, Ford, BMW)
`Model`	Categorical	Specific vehicle model
`Vehicle Class`	Categorical	Body type (SUV, Sedan, Pickup, Compact, etc.)
`Engine Size(L)`	Numeric	Engine displacement in liters
`Cylinders`	Numeric	Number of engine cylinders
`Transmission`	Categorical	Transmission type and number of gears
`Fuel Type`	Categorical	Gasoline, Diesel, E85, etc.
`Fuel Consumption City (L/100 km)`	Numeric	City driving fuel consumption
`Fuel Consumption Hwy (L/100 km)`	Numeric	Highway driving fuel consumption
`Fuel Consumption Comb (L/100 km)`	Numeric	Combined fuel consumption
`Fuel Consumption Comb (mpg)`	Numeric	Combined fuel economy (miles per gallon)
`CO2 Emissions(g/km)`	Numeric (Target)	Carbon dioxide emissions

Technologies and Tools

Python 3.x: Core programming language
scikit-learn: Machine learning models and pipelines
pandas: Data manipulation and analysis
NumPy: Numerical computing
Matplotlib & Seaborn: Data visualization
KaggleHub: Dataset integration
Jupyter Notebook: Interactive development

Methodology

1. Data Exploration and Cleaning

Completeness check: Verified no missing values
Duplicate handling: Preserved intentionally (reflect market distribution)
Outlier analysis: Identified but retained (represent real extreme cases)
Distribution analysis: Examined target variable and feature relationships

2. Feature Engineering

Created three powerful engineered features:

engine_per_cylinder: Engine size divided by number of cylinders
- Captures engine design efficiency
- Larger values indicate bigger cylinders → higher emissions
city_hwy_ratio: City fuel consumption / Highway fuel consumption
- Measures urban inefficiency
- Higher values indicate vehicles poorly optimized for city driving
total_consumption: Sum of city and highway consumption
- Aggregate fuel usage metric

3. Data Preprocessing Pipeline

Categorical encoding: One-hot encoding for Make, Model, Vehicle Class, Transmission, Fuel Type
Feature scaling: StandardScaler for numeric features (essential for SVR)
Train/Validation/Test split: 60% / 20% / 20%
Pipeline construction: Integrated preprocessing and modeling

4. Model Training and Comparison

Four regression algorithms were evaluated:

Linear Regression

Baseline model
Surprisingly effective (R² = 0.9906)
Indicates near-linear relationship in transformed feature space

Random Forest Regressor (Best Model)

Ensemble of 100 decision trees
Handles non-linear relationships
Robust to outliers and feature interactions

Support Vector Regressor (SVR) with RBF Kernel

Non-linear mapping via kernel trick
Requires careful feature scaling

Gradient Boosting Regressor

Sequential ensemble learning
Strong performance but slightly behind Random Forest

Results

Model Performance Comparison

Model	R² (Train)	RMSE (Train)	MAE (Train)	R² (Val)	RMSE (Val)	MAE (Val)
Linear Regression	0.9966	3.43	1.83	0.9906	5.64	3.24
Random Forest	0.9995	1.32	0.80	0.9960	3.69	1.99
SVR (RBF)	0.9891	6.14	1.80	0.9901	5.82	3.33
Gradient Boosting	0.9975	2.92	2.29	0.9954	3.97	2.47

Final Test Set Performance

The Random Forest Regressor achieved:

R² (Test): 0.9972
RMSE (Test): 3.03 g/km
MAE (Test): 1.94 g/km

Key Insights

Exceptional predictive accuracy: R² > 0.99 indicates CO₂ emissions are highly predictable from vehicle specifications
Low prediction error: Average error under 2 g/km is excellent for practical applications
Strong generalization: Test performance matches validation performance
Feature importance: Combined fuel consumption is the dominant predictor, followed by engine design metrics

Feature Importance Analysis

The Random Forest model identified the most influential predictors:

Fuel Consumption Comb (L/100 km) — Dominant factor (directly proportional to CO₂)
engine_per_cylinder — Engine design efficiency metric
city_hwy_ratio — Urban driving inefficiency indicator
Vehicle Class — Body type and weight category
Fuel Type — Different fuels have different carbon content per liter

These findings align perfectly with physical and engineering principles, increasing confidence in the model's reliability.

Project Structure

vehicle-co2-emissions-prediction/
├── vehicle_co2_emissions_prediction.ipynb  # Main analysis notebook
├── README.md                               # This file
├── LICENSE                                 # MIT License
├── requirements.txt                        # Python dependencies
└── .gitignore                              # Git ignore rules

How to Run

Prerequisites

Python 3.8+
Minimum 4GB RAM

Installation

Clone the repository:

git clone https://github.com/tahamohmadf19-dev/vehicle-co2-emissions-prediction.git
cd vehicle-co2-emissions-prediction

Install dependencies:
```
pip install -r requirements.txt
```

Run the notebook:

jupyter notebook vehicle_co2_emissions_prediction.ipynb

Running on Google Colab

Upload the notebook to Google Colab
The dataset will be automatically downloaded via KaggleHub
Run all cells sequentially

Real-World Applications

For Consumers

Compare emissions of different vehicles before purchase
Estimate environmental impact of driving habits
Calculate carbon footprint for personal sustainability goals

For Policymakers

Evaluate effectiveness of emission standards
Identify vehicle categories requiring stricter regulations
Design incentive programs for low-emission vehicles

For Manufacturers

Benchmark new designs against market standards
Optimize engine and transmission configurations
Predict regulatory compliance before production

Future Enhancements

Temporal analysis: Incorporate model year trends to track emission improvements over time
Electric vehicle integration: Extend model to handle EVs and plug-in hybrids
Uncertainty quantification: Implement prediction intervals using quantile regression
Interactive web application: Deploy as a user-friendly emissions calculator
API development: Create REST API for integration with automotive databases
Multi-pollutant modeling: Extend to NOx, particulate matter, and other emissions
Causal inference: Use causal models to identify design changes that reduce emissions

Educational Value

This project demonstrates:

End-to-end regression modeling workflow
Feature engineering for domain-specific problems
Model selection and hyperparameter tuning
Interpretability and validation of ML models
Application of machine learning to environmental challenges
Best practices in data science project documentation

Limitations

Geographic scope: Data represents Canadian market; may not generalize to other regions
Temporal scope: Does not account for technological evolution over time
Correlation vs causation: Model captures empirical patterns, not causal mechanisms
Real-world driving: Test cycle data may differ from actual on-road emissions

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

Transport Canada Vehicle Emissions Database
Kaggle Dataset: CO2 Emission by Vehicles
scikit-learn Documentation
IPCC Guidelines for Greenhouse Gas Inventories

Contact

Mohammad Taha
📧 Email: tahamohmadf19@gmail.com
🔗 LinkedIn: Mohmad Taha Alhmad
🐙 GitHub: @tahamohmadf19-dev

This project contributes to environmental sustainability by making vehicle emissions transparent and predictable through data science.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
vehicle_co2_emissions_prediction.ipynb		vehicle_co2_emissions_prediction.ipynb

Folders and files

Latest commit

History

Repository files navigation

Vehicle CO₂ Emissions Prediction

Project Overview

Environmental Context

Dataset

Key Features

Technologies and Tools

Methodology

1. Data Exploration and Cleaning

2. Feature Engineering

3. Data Preprocessing Pipeline

4. Model Training and Comparison

Linear Regression

Random Forest Regressor (Best Model)

Support Vector Regressor (SVR) with RBF Kernel

Gradient Boosting Regressor

Results

Model Performance Comparison

Final Test Set Performance

Key Insights

Feature Importance Analysis

Project Structure

How to Run

Prerequisites

Installation

Running on Google Colab

Real-World Applications

For Consumers

For Policymakers

For Manufacturers

Future Enhancements

Educational Value

Limitations

License

References

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages