This repository implements a full machine learning pipeline for predicting overqualification (underemployment) in recruitment using the NGS (National Graduate Survey) structured hiring dataset. It was developed in the context of the ML Hackathon hosted by the SFU Data Science Student Society, where teams worked with real-world datasets to design, train, and evaluate predictive models in a competitive setting.
The pipeline uses CatBoost as the primary model, with a focus on predictive performance (accuracy on Public/Private leaderboards) and interpretability (feature importance and optional SHAP).
The goal of this project is to:
- Build a robust model that accurately estimates overqualification probability based on candidate attributes: education level, years of experience, skill composition, prior roles, and demographics.
- Work with the NGS dataset and understand its feature structure (survey codes, missing conventions, mixed-type columns).
- Train and tune a CatBoost-based machine learning model with validation feedback and leaderboard-oriented iteration.
- Focus on both predictive performance and interpretability β accuracy on hold-out test sets and feature importance / SHAP-style explanations.
The solution achieved 0.75174 accuracy on the Public leaderboard and 0.70511 on the Private leaderboard, placing it very close to the top-performing teams and demonstrating strong generalization on unseen data.
Exploratory analysis: correlation and feature relationships in the NGS hiring dataset
- Modular ML pipeline (
src/folder): clean separation of data loading, preprocessing, feature engineering, model training, evaluation, and prediction. - NGS-aware preprocessing: handling of special codes (6, 9, 99) and normalization of mixed-type columns (e.g. GENDER2, DDIS_FL, VISBMINP).
- CatBoost classifier with native categorical support, early stopping, and configurable hyperparameters (depth, learning_rate, l2_leaf_reg).
- Stratified K-fold cross-validation and optional grid search for hyperparameter tuning.
- Interpretability: CatBoost feature importance and optional SHAP integration for model explanation.
- Reproducible workflow:
python3 -m src.trainandpython3 -m src.predictfor end-to-end training and submission generation. - Five structured Jupyter notebooks documenting exploration, preprocessing, training/tuning, evaluation/interpretability, and the full pipeline demo.
graduate-underemployment-prediction/
β
βββ data/
β βββ processed/ # Processed/cached data (optional); not in Git
β βββ raw/
β βββ train.csv # Training set (id, features, overqualified)
β βββ test.csv # Test set (id, features; no target)
β
βββ models/ # Saved model artifacts (model.cbm, artifacts.pkl); not in Git
β
βββ notebooks/
β βββ 01_exploration.ipynb # EDA, NGS feature structure, target and correlations
β βββ 02_preprocessing_feature_engineering.ipynb # Cleaning and categorical encoding
β βββ 03_catboost_training_tuning.ipynb # Training, CV, hyperparameter tuning
β βββ 04_evaluation_interpretability.ipynb # Metrics, feature importance, SHAP
β βββ 05_pipeline_demo.ipynb # End-to-end pipeline demonstration
β
βββ submissions/ # Generated submission CSVs (id, overqualified)
β βββ public_leaderboards.png # Public leaderboard screenshot
β βββ submission.csv # Default output from python3 -m src.predict
β
βββ src/
β βββ __init__.py
β βββ config.py # Paths, target/id columns, validation settings
β βββ data.py # Load train/test, split X/y, train/val split
β βββ evaluate.py # Stratified K-fold CV and accuracy
β βββ features.py # Categorical feature preparation for CatBoost
β βββ hyperparameter_tuning.py # Grid search for CatBoost params
β βββ model.py # CatBoost classifier builder
β βββ preprocess.py # NGS cleaning and categorical normalization
β βββ predict.py # Load model, predict on test, write submission
β βββ train.py # End-to-end training pipeline
β
βββ .gitignore # Git ignore rules (venv, models/*, cache, etc.)
βββ LICENSE # MIT license
βββ README.md # Project overview and usage
βββ report.md # Detailed technical write-up
βββ requirements.txt # Python dependencies
ποΈ Note:
Thedata/raw/directory should containtrain.csvandtest.csv. Themodels/directory is where the trained CatBoost model and artifacts are saved after runningpython3 -m src.train,models/is not tracked in Git (it is in.gitignore), so you need to run the training pipeline locally to generate the model. Processed data is not stored on disk; all transformations are applied in memory during training and prediction.
You can run this project on your machine using Python 3.11+ and a virtual environment.
HTTPS (recommended for most users):
git clone https://github.com/florykhan/graduate-underemployment-prediction.git
cd graduate-underemployment-predictionSSH (for users who have SSH keys configured):
git clone git@github.com:florykhan/graduate-underemployment-prediction.git
cd graduate-underemployment-predictionpython3 -m venv venv
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windowspip install -r requirements.txtPlace the NGS hackathon data files in data/raw/:
data/raw/train.csv # Training set (must include column: overqualified)
data/raw/test.csv # Test set (same features, no target)
π₯ Dataset: The NGS structured hiring dataset was provided as part of the SFU Data Science Student Society ML Hackathon. Ensure
train.csvhas anidcolumn and anoverqualified(0/1) target column;test.csvshould have the same feature columns andid.
This step trains the CatBoost model, runs validation (and optional CV), and saves the model and artifacts.
python3 -m src.trainpython3 -m src.predictThis writes submissions/submission.csv with columns id and overqualified (0/1 predictions).
Launch Jupyter and open the notebooks from the project root (so that notebooks/ is the working directory for paths):
jupyter notebookRecommended order:
notebooks/01_exploration.ipynbβ data exploration and NGS feature structurenotebooks/02_preprocessing_feature_engineering.ipynbβ cleaning and categorical encodingnotebooks/03_catboost_training_tuning.ipynbβ CatBoost training, CV, and tuningnotebooks/04_evaluation_interpretability.ipynbβ metrics, feature importance, SHAPnotebooks/05_pipeline_demo.ipynbβ end-to-end pipeline demo
Tip: If you run notebooks from inside
notebooks/, the code usessys.path.insert(0, str(Path().resolve().parent))so thatsrccan be imported correctly.
The hackathon had 14 teams in total. Public leaderboard snapshot:
| Metric | Value |
|---|---|
| Public leaderboard accuracy | 0.75174 (best: 0.76623) |
| Private leaderboard accuracy | 0.70511 (best: 0.71304) |
The tuned CatBoost model placed the solution very close to the top-performing teams and demonstrated strong generalization on the private hold-out set. Validation and cross-validation accuracy (e.g. ~0.67β0.75 depending on split and hyperparameters) are used during development; the leaderboard metrics above reflect the official hackathon evaluation.
β‘οΈ For methodology, preprocessing details, model choices, and full discussion, see: report.md.
The complete technical write-up, including pipeline design, preprocessing and feature engineering, CatBoost training and tuning, validation strategy, and interpretability, is in report.md. This document is intended for reviewers who want the full methodology behind the pipeline and results.
- Expand hyperparameter search: use RandomizedSearchCV or Optuna over a larger CatBoost parameter space.
- Feature engineering: additional derived features (e.g. educationβoccupation match indicators) if metadata is available.
- Ensembles: combine CatBoost with other classifiers (e.g. XGBoost, LightGBM) for potential accuracy gains.
- Experiment tracking: integrate MLflow or Weights & Biases to log metrics and compare runs.
- Production readiness: API (FastAPI/Flask), Docker, or CI/CD for training and deployment.
- Language: Python 3.11+
- Core libraries: pandas, numpy, scikit-learn, CatBoost, matplotlib, seaborn
- Pipeline: Modular
src/package with config, data loading, preprocessing, feature engineering, model, evaluation, tuning, train, and predict - Environment: Jupyter Notebook / VS Code; Git
MIT License, feel free to use and modify with attribution. See the LICENSE file for full details.
Ilian Khankhalaev
BSc Computing Science, Simon Fraser University
π Vancouver, BC | florykhan@gmail.com | GitHub | LinkedIn
Nikolay Deinego
BSc Computing Science, Simon Fraser University
π Vancouver, BC | GitHub | LinkedIn
Arina Veprikova
BSc Data Science, Simon Fraser University
π Vancouver, BC | GitHub | LinkedIn
Anna Cherkashina
BSc Data Science, Simon Fraser University
π Vancouver, BC | GitHub | LinkedIn
