Employee turnover is a multi-billion dollar problem. This project provides HR teams with an Intelligence Report that predicts "Flight Risk" with ~70% accuracy, allowing for proactive intervention. By automating the data extraction from resumes and predicting attrition, this tool directly aims to reduce turnover costs and improve organizational stability.
➤ Human Resources Data Set by Dr. Carla Patalano and Dr. Rich Huebner
- GenAI Resume Parsing: Uses Google Gemini Pro to extract complex employee metrics from uploaded PDFs, bypassing manual data entry.
- Predictive Risk Modeling: Implements a cost-sensitive XGBoost classifier to identify high-risk employees.
- Dynamic HR Dashboard: A sleek, dark-themed UI featuring Risk Gauges, Intelligence Reports, and AI-generated Strategy Recommendations.
- Automated Intelligence Reports: Generates full-scale PDF/CSV detailed reports, enabling HR leads to transition from raw data to board-ready presentations instantly.
- Enterprise-Ready Data Flow: Secure handling of employee records using MongoDB Atlas, ensuring data persistence and historical trend tracking (MDE).
| Category | Technology | Implementation |
|---|---|---|
| Frontend | React (Vite), Tailwind | Responsive SPA with a dark-themed HR workspace. |
| Frontend | Fetch API | Utilizing native browser APIs for asynchronous data fetching and promise-based HTTP requests. |
| Backend | FastAPI | Building a high-performance, asynchronous REST API with automatic OpenAPI documentation. |
| Backend | Pydantic | Enforcing strict data validation and type-safe schemas for incoming employee data. |
| Parsing | PyMuPDF (fitz) | Low-level PDF binary stream extraction before GenAI processing. |
| Security | CORS Middleware | Orchestrated Cross-Origin Resource Sharing for secure Vercel-to-Render communication. |
| Database | Motor (Async MongoDB) | Non-blocking database drivers to ensure high-concurrency performance. |
| AI Engine | Google Gemini Pro | Leveraging Large Language Models (LLMs) for intelligent data extraction from PDF resumes. |
| ML Model | XGBoost | Deploying Gradient Boosted Decision Trees with cost-sensitive weights for risk classification. |
| Deployment | Vercel, Render | Distributed cloud hosting with automated CI/CD. |
- Data Acquisition (Dual Entry):
- Automated Path: User uploads a CV via React; FastAPI orchestrates the file stream to Google Gemini Pro for entity extraction.
- Manual Path: Users can directly input employee metrics into a structured form, bypassing the AI extraction for immediate results.
- Standardization & Validation: Both data paths converge at the Pydantic layer, which enforces strict schema validation and type-safety before the data reaches the ML model.
- Intelligence Layer (Inference): The validated data is processed by a pre-trained XGBoost model.
- Cloud Persistence: Prediction logs and metadata are asynchronously committed to MongoDB Atlas using the Motor driver.
- Real-Time Analytics: The dashboard leverages MongoDB Aggregation Pipelines (
$match,$group,$avg) to offload heavy computations to the database. This enables real-time tracking of Risk Hotspots (high-risk departments) and Organizational Averages. Data is fetched via the native Fetch API for instant UI synchronization.
The Problem: During the integration of Google Gemini, the system frequently hit API rate limits despite low usage volume, and the frontend-to-backend data stream was failing to trigger the extraction logic correctly.
- The Pivot: - Model Switching: Swapped models to optimize token usage and cost-efficiency.
- Robust Debugging: Implemented a comprehensive logging and "Safety Mechanism" layer to catch rate-limit exceptions before they crashed the frontend.
The Problem: Initial attempts to handle dataset imbalance using SMOTE (Synthetic Minority Over-sampling Technique) resulted in lower Precision and Recall, as the synthetic data introduced noise that hindered the model's ability to generalize to real employee behavior.
- The Pivot: I pivoted to Cost-Sensitive Learning by tuning the
scale_pos_weightparameter & I abandoned the resampledX_train_resdataset in favor of the original, authenticX_train. This forced the model to learn from real-world distributions while penalizing the misclassification of flight risks, significantly improving the model's predictive reliability.
| Metric | Old Model (SMOTE) | New Model (Weighted XGBoost) | Impact |
|---|---|---|---|
| Overall Accuracy | 67.00% | 69.84% | +2.84% Improvement |
| Precision (Class 1) | 53.00% | 60.00% | Higher reliability in flags |
| Recall (Class 1) | 41.00% | 41.00% | Consistent detection rate |
| Class 0 Recall | 80.00% | 85.00% | 5% fewer False Positives |
| Data Integrity | Synthetic | Organic (Original) | No "hallucinated" data |
- Navigate to
/backendand create a.envfile with yourGEMINI_API_KEYandMONGO_URI. - Install dependencies:
pip install -r requirements.txt - Run the server:
uvicorn main:app --reload
- Navigate to
/frontendand install dependencies:npm install - Start the development server:
npm run dev
SMOTE • No module error • Deprecated dict in Pydantic • Gemini API quickstart • MongoDB Atlas pipeline stages
Made by Neha K Vallappil • LinkedIn
