🌐 English · Español
Quantifying the gender income gap among small merchants in Latin America using transactional data from a digital payments platform. The analysis combines descriptive exploration, fixed-effects regression, regularized linear models, and machine learning (CART, MARS, KNN, Random Forest) to measure the gap and identify which observable factors explain it.
→ Read the full rendered report — every plot, table, and model output, no R installation required.
Is there a gender income gap among small merchants? If so, how large is it, how dispersed is it across business categories and zones, and how much of it can be explained by observable controls (hours worked, business type, transactions, location, age)?
The gender pay gap is well documented in formal employment, but evidence is much thinner for self-employed micro-merchants — exactly the segment this dataset captures.
The analysis Rmd reads from data/sample.csv — a fully synthetic 540-row sample (60 merchants × 9 weeks) generated to:
- Respect the original schema column-by-column
- Match the marginal distributions of
genero,rubroandcuarentena1 - Approximately reproduce the headline finding (raw gap ~ 20%) and the expected mediation through hours worked and business category
The original 245,000-row commercial dataset is not redistributed because it was provided under the terms of an academic course at the Universidad de Chile. See data/README.md for the full schema.
The platform registers each payment device under a single responsible operator with known address and gender, which makes it a clean proxy for who actually runs the business.
- Univariate distributions of income by gender (kernel density, mean overlay)
- Income by age range and by business category (
rubro) - Time-series view of income across weeks of the year
- COVID lockdown effect on income by gender
| Model | Spec |
|---|---|
m1 |
log(income) ~ gender (raw gap) |
m2 |
+ hours, transactions/hr, business category dummies |
m3 |
+ zone and age-bracket fixed effects |
- Backward / Forward stepwise regression (
step) - Ridge / LASSO via
glmnet(α = 0 / α = 1, λ = 0.05) - Compared adjusted R², RMSE and AIC
Re-estimating the same controls on:
log(hours_worked)— does the gap come from working less?log(income_per_hour)— or from lower hourly productivity?
Train/test split + caret-based tuning:
- CART (
rpart2) — tree depth grid - MARS (
earth) — degree and #terms grid - KNN — k grid
- Random Forest
- Selected on test-set RMSE
- Raw gap ≈ 20.7% — women earn on average ~20.7% less than men per week, with
R² ≈ 0.02(gender alone explains almost nothing of the variance — there are large omitted factors). - Adding hours worked, transactions/hr and business category raises explanatory power to
R² ≈ 0.6and shrinks the gender coefficient meaningfully — much of the raw gap is mediated by hours and category mix. - The gap is heterogeneous across categories: largest in apparel (
vestuario), smallest in services / trades (oficios y otros servicios). - The COVID lockdown variable does not explain the gap — both genders' income series move in parallel through 2020–2021.
- Decomposition: the gap is driven by both fewer hours worked and lower income per hour, with the hourly-productivity channel being the more persistent of the two after controls.
- MARS outperformed CART, KNN and stepwise OLS on test-set RMSE while still capturing the gender effect — selected as the final predictive model.
R fixest glmnet caret earth rpart randomForest ggplot2 dplyr
gender-income-gap/
├── README.md
├── analysis.Rmd # Full analysis (EDA + models + ML)
├── data/
│ ├── README.md # Schema reference
│ └── sample.csv # Synthetic 540-row sample
└── .github/workflows/
└── render.yml # CI: render Rmd → HTML → Pages
Just open https://jsanchez-ds.github.io/gender-income-gap/. The CI rebuilds the report on every push to main.
install.packages(c(
"rmarkdown", "knitr", "tidyverse", "dplyr", "readr", "ggplot2",
"ggcorrplot", "fixest", "glmnet", "caret", "earth", "rpart",
"randomForest", "kableExtra", "modelsummary"
))
rmarkdown::render("analysis.Rmd", output_dir = "docs", output_file = "index.html")The Rmd reads data/sample.csv from the repo, so no data download is needed.
Originally developed for IN5162 — Marketing Engineering, taught by Prof. Marcel Goic at the Universidad de Chile (Industrial Engineering, 2023). Reframed and rewritten in English for portfolio presentation.
Jonathan Sánchez
- GitHub: @jsanchez-ds
- Universidad de Chile — Industrial Engineering