Skip to content

bursasha/sklearn-pandas-matplotlib-lifespan-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“ˆ 2. Regression

πŸ“‚ Data Source

We will deal with predicting life expectancy in different countries and years.

You have access to training data in the file data.csv and evaluation data in the file evaluation.csv.

πŸ“ Feature List

  • Year - The year.
  • Status - Status of the country: developed or developing.
  • Life expectancy - Life expectancy in years - the target variable to be predicted.
  • Adult Mortality - Adult mortality rate regardless of gender (probability that individuals who have reached age 15 will die before age 60, per 1,000 individuals).
  • infant deaths - Number of infant deaths per 1,000 population.
  • Alcohol - Recorded alcohol consumption per capita (age 15+) in liters of pure alcohol.
  • percentage expenditure - Health expenditure as a percentage of GDP per capita (%).
  • Hepatitis B - Coverage of vaccination against hepatitis B (HepB) among 1-year-olds (%).
  • Measles - Measles - the number of reported cases per 1,000 population.
  • BMI - Average Body Mass Index (BMI) of the entire population.
  • under-five deaths - Number of deaths of children under five years of age per 1,000 population.
  • Polio - Coverage of vaccination against polio (Pol3) among 1-year-olds (%).
  • Total expenditure - Government expenditure on healthcare as a percentage of total government expenditure (%).
  • Diphtheria - Coverage of vaccination against diphtheria, tetanus, and pertussis (DTP3) among 1-year-olds (%).
  • HIV/AIDS - Number of deaths per 1,000 live births caused by HIV/AIDS (ages 0-4).
  • GDP - Gross Domestic Product per capita (in USD).
  • Population - Population of the country.
  • thinness 1-19 years - Percentage of children aged 10-19 with a BMI less than 2 standard deviations below the median (%).
  • thinness 5-9 years - Percentage of children aged 5-9 with a BMI less than 2 standard deviations below the median (%).
  • Income composition of resources - Human Development Index based on income composition of resources (index range: 0 to 1).
  • Schooling - Average number of years of schooling.

πŸ€” Assignment Instructions

Assignment tasks for which you can earn 25 points:

  • Load the data from the file data.csv in a notebook. Split the data into subsets suitable for training, model comparison (validation), and final model performance evaluation (test).

  • Perform basic data preprocessing:

    • Review each feature and transform it into a suitable format for use in the selected regression model.
    • Handle missing values appropriately (even a trivial solution is acceptable). Beware of methodological errors!
    • Use visualizations as necessary. Provide concise yet adequate commentary.
  • Implement your own random forest regression model. Use the pre-prepared skeleton below.

  • Apply your custom random forest model, as well as one of the following: linear regression or ridge regression, and at least one other model of your choice to the prepared data. For each model:

    • Comment on the suitability of the model for this type of problem.
    • Experiment with normalization (standardization/min-max scaling) if you expect it to benefit the model.
    • Select the key hyperparameters to tune and find their optimal values (based on RMSE).
    • For the model with the best hyperparameters on the validation set, calculate its error using RMSE and MAE.
    • Provide adequate commentary on the obtained results.
  • From all the tested options in the previous step, select the final model and estimate its expected error (RMSE) on new, unseen data. Avoid methodological errors!

  • Finally, load the evaluation data from the file evaluation.csv. Use the final model to make predictions for this data. Create a file results.csv, where you save the predictions in the column Life expectancy and identify the individual records using the columns Country and Year (maintain the column names!). Submit this file (alongside the notebook in the repository).

  • Example of how the first few rows of the results.csv file should look (values for Life expectancy will vary):

Country,Year,Life expectancy
Peru,2012,71.4
Peru,2013,72.6
...

✍️ Submission Notes

About

Regression analysis project predicting life expectancy with data preprocessing, feature engineering and Ridge Regression + Random Forest + AdaBoost models πŸ“ˆ

Topics

Resources

Stars

Watchers

Forks

Contributors