Predict CO2 Emission

The construction industry has a significant impact on the environment, contributing substantially to global CO2 emissions. Understanding the factors behind these emissions is crucial for formulating effective reduction strategies. Besides daily energy consumption, the design, construction, and demolition of buildings also play a role in CO2 emissions.

OBJECTIVE The objective of this study is to develop a robust linear model, essential for its prospective use as a predictive tool on datasets not yet examined. The model is specifically aimed at discerning and quantifying the effect of multiple variables on the quantities of CO2 emitted by a residential building.

Exploratory Analysis

Dataset Description

The dataset used in this study was obtained from the OpenData Portal of the Lombardy Region, specifically from the CENED Database, which focuses on the Energy Certification of Buildings. This data archive, comprising Energy Performance Certificates, consists of 1.52 million observations organized into 40 variables. Due to the large volume of the dataset, systematic sampling was performed, reducing the number of observations to 202,019 units. The investigation was subsequently focused exclusively on buildings with continuous residential use (variable DESTINAZIONE_DI_USO = E.1(1) as per DPR 412/1993), thus excluding those for public use, which led to a further reduction in observations, stabilizing at a total of 16,858 units.

Target Variable

The selected target variable represents CO2 emissions, quantified annually. Specifically, this variable is measured in KgCO2eq/m2year, providing an indicator of greenhouse gas emissions per unit of surface area. This metric facilitates the comparison between buildings of different sizes. From Figure 1, it is observed that emissions are concentrated between 0 and 100 KgCO2eq/m2year.

Figure 1: Target Variable Distribution

Selection of Variables

During the model development phase, the entire range of 40 available variables was not utilized; instead, the selection was limited to 14 variables deemed most relevant, with their descriptions detailed in the Annex section. A graphical analysis of these variables in relation to the dependent variable (Figures 2 and 3) revealed anomalies suggesting the presence of input errors, likely due to the excessively broad scale. These observations will be eliminated in the next steps.

Figure 2: Distribution of Covariates vs. CO2 Emission Variable

Figure 3: Distribution of Covariates vs. CO2 Emission Variable

Missing Data

Before proceeding with the actual analysis, a phase is required where the dataset needs to be cleaned by selecting only the observations that can effectively be used in the analysis. From the examination of Figure 4, it is observed that the variables MOTIVAZIONE_APE and SUPERFICIE_VETRATA_OPACA have few missing values, 2 and 4 respectively, suggesting a possible random absence of data; therefore, these observations will be excluded.

Figure 4: Missing Data

Elimination of Problematic Observations

Initial graphical analyses revealed that the dataset contains some anomalous values. These are presumably due to errors during the data entry process by users. Tables 1 and 2 list the instances where such irregularities occur; specifically, for the variable EFER, which measures the energy contribution of renewable energy systems, negative values were detected, which are not plausible. Similarly, the CO2 variable shows incorrect orders of magnitude.

Optimal Grouping and Transformations of X

After addressing the missing values, before proceeding with the analysis, it is crucial to make adjustments to certain categorical variables, as they have an excessive number of levels. The variables in question are CLASSE_ENERGETICA, PROVINCIA, and MOTIVAZIONE_APE, as shown in Figure 5.

Figure 5: Covariates before grouping

The variable PROVINCIA, shown in the central graph of Figure 5, consists of 12 levels. Therefore, it was decided to categorize it into 3 distinct groups based on population density, as this can better reflect the intensity of energy use and the related infrastructure needs. The new partition is shown in Figure 6.

Figure 6: Provinces by population density

The variable CLASSE_ENERGETICA, initially divided into 8 levels as shown in the right graph of Figure 5, was reorganized into 3 distinct groups as shown in Table 3.

Group	Description	Count
1	High Class	332
2	Medium Class	3,678
3	Low Class	12,834

Table 3: Energy Class

The variable MOTIVAZIONE_APE had 14 levels, so the optimal grouping methodology was adopted to aggregate similar categories based on their similarity in the mean with respect to the target variable, reducing unnecessary levels. Following this process, 6 clusters were identified, visible in the dendrogram shown in Figure 7 and detailed in the Table.

Figure 7: Optimal Grouping

Group	Description	Count
1	New constructions	1,017
2	Major structural interventions	487
3	Energy renovation	950
4	Other	2,219
5	Transfer for consideration	6,976
6	Lease agreement	5,195

Table 4: Reasons for Opening the APE Procedure

Figure 8 shows that the optimal grouping was effectively carried out and that there is a noticeable difference between the groups with respect to the dependent variable.

Figure 8: Covariates after grouping

After addressing the missing values, correcting typographical errors, and recoding the categorical variables to have a manageable number of levels, the pre-processing phase of the dataset is concluded. This conclusion allows for the first estimation of the model to proceed. Specifically, Figure 9 clearly illustrates the distribution of continuous covariates in relation to the target, offering a more intelligible representation compared to the one presented in the previous section. It is observed that some non-linear relationships exist between the covariates and the dependent variable, which could present challenges related to the assumption of linearity within the model.

Figure 9: Covariates after pre-processing

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Co2.RData		Co2.RData
Codice Co2.R		Codice Co2.R
Emissioni di CO2.pdf		Emissioni di CO2.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict CO2 Emission

Table of Contents

Exploratory Analysis

Dataset Description

Target Variable

Selection of Variables

Missing Data

Elimination of Problematic Observations

Optimal Grouping and Transformations of X

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Predict CO2 Emission

Table of Contents

Exploratory Analysis

Dataset Description

Target Variable

Selection of Variables

Missing Data

Elimination of Problematic Observations

Optimal Grouping and Transformations of X

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages