We will deal with predicting life expectancy in different countries and years.
You have access to training data in the file data.csv and evaluation data in the file evaluation.csv.
Year- The year.Status- Status of the country: developed or developing.Life expectancy- Life expectancy in years - the target variable to be predicted.Adult Mortality- Adult mortality rate regardless of gender (probability that individuals who have reached age 15 will die before age 60, per 1,000 individuals).infant deaths- Number of infant deaths per 1,000 population.Alcohol- Recorded alcohol consumption per capita (age 15+) in liters of pure alcohol.percentage expenditure- Health expenditure as a percentage of GDP per capita (%).Hepatitis B- Coverage of vaccination against hepatitis B (HepB) among 1-year-olds (%).Measles- Measles - the number of reported cases per 1,000 population.BMI- Average Body Mass Index (BMI) of the entire population.under-five deaths- Number of deaths of children under five years of age per 1,000 population.Polio- Coverage of vaccination against polio (Pol3) among 1-year-olds (%).Total expenditure- Government expenditure on healthcare as a percentage of total government expenditure (%).Diphtheria- Coverage of vaccination against diphtheria, tetanus, and pertussis (DTP3) among 1-year-olds (%).HIV/AIDS- Number of deaths per 1,000 live births caused by HIV/AIDS (ages 0-4).GDP- Gross Domestic Product per capita (in USD).Population- Population of the country.thinness 1-19 years- Percentage of children aged 10-19 with a BMI less than 2 standard deviations below the median (%).thinness 5-9 years- Percentage of children aged 5-9 with a BMI less than 2 standard deviations below the median (%).Income composition of resources- Human Development Index based on income composition of resources (index range: 0 to 1).Schooling- Average number of years of schooling.
Assignment tasks for which you can earn 25 points:
-
Load the data from the file
data.csvin a notebook. Split the data into subsets suitable for training, model comparison (validation), and final model performance evaluation (test). -
Perform basic data preprocessing:
- Review each feature and transform it into a suitable format for use in the selected regression model.
- Handle missing values appropriately (even a trivial solution is acceptable). Beware of methodological errors!
- Use visualizations as necessary. Provide concise yet adequate commentary.
-
Implement your own random forest regression model. Use the pre-prepared skeleton below.
-
Apply your custom random forest model, as well as one of the following: linear regression or ridge regression, and at least one other model of your choice to the prepared data. For each model:
- Comment on the suitability of the model for this type of problem.
- Experiment with normalization (standardization/min-max scaling) if you expect it to benefit the model.
- Select the key hyperparameters to tune and find their optimal values (based on RMSE).
- For the model with the best hyperparameters on the validation set, calculate its error using RMSE and MAE.
- Provide adequate commentary on the obtained results.
-
From all the tested options in the previous step, select the final model and estimate its expected error (RMSE) on new, unseen data. Avoid methodological errors!
-
Finally, load the evaluation data from the file
evaluation.csv. Use the final model to make predictions for this data. Create a fileresults.csv, where you save the predictions in the columnLife expectancyand identify the individual records using the columnsCountryandYear(maintain the column names!). Submit this file (alongside the notebook in the repository). -
Example of how the first few rows of the
results.csvfile should look (values forLife expectancywill vary):
Country,Year,Life expectancy
Peru,2012,71.4
Peru,2013,72.6
...
- Follow the guidelines on the page https://courses.fit.cvut.cz/BI-ML1/homeworks/index.html.