A portfolio-grade analytics project that turns raw “hours spent” into composition-first behavioral insights.
Most screen-time analyses stop at totals (“people spend 8 hours/day”).
This project focuses on how that time is distributed across:
- Social
- Work/Study
- Entertainment
It ships:
- a reproducible scoring pipeline (
python -m src.pipeline) - clean exported datasets + figures
- a Streamlit dashboard for interactive exploration across age_group, primary_device, and internet_type
This project uses the Kaggle dataset:
- Daily Internet Usage Statistics by Age Group by jayjoshi37
- Dataset URL: https://www.kaggle.com/datasets/jayjoshi37/daily-internet-usage-statistics-by-age-group
Thanks to the dataset author and Kaggle for making the data available.
Two users can both have 10 hours/day of screen time, but with completely different patterns:
- One: mostly Work/Study (structured use)
- Another: mostly Social/Entertainment (dominant categories)
Totals alone can’t describe behavior.
Composition metrics can.
For each record, we compute:
- Shares of total screen time for each category
- DBI (Digital Balance Index): how evenly those shares are distributed
- Dominance: how strongly one category “owns” the day
- Practical tiers and a flag: High-load & Skewed
After running the pipeline you get:
-
outputs/scored_rows.csv
Original data + engineered features:- shares (
p_social,p_work,p_entertainment) dbi,dominance,dominant_category- tiers (
dbi_tier,load_tier) - risk flag (
flag_highload_skewed)
- shares (
-
outputs/segment_summary.csv
Aggregated statistics by:age_group × primary_device × internet_type
-
outputs/daily_summary.csv
Daily aggregates:- mean DBI / mean total / sample size (n) per day
-
outputs/metric_cards.json
KPI cards + thresholds and validation checks -
reports/figures/
All figures used in this README
The Streamlit dashboard allows:
- filtering by age group, device, internet type, DBI tier
- exploring DBI distributions and segment comparisons
- safe interpretation notes (“decision safety”)
Let:
- total = total_screen_time
- social = social_media_hours
- work = work_or_study_hours
- ent = entertainment_hours
- p_social = social / total
- p_work = work / total
- p_entertainment = ent / total
These are proportions in [0, 1] and (when total > 0) they sum to ~1.
DBI uses normalized Shannon entropy over the three shares:
- H = - Σ (p_i * ln(p_i)) for i in {social, work, entertainment}
- DBI = H / ln(3)
Interpretation:
- DBI ≈ 1.00 → time is distributed evenly (balanced composition)
- DBI ≈ 0.00 → time is concentrated in one category (skewed composition)
Important note:
- DBI does not say “good” or “bad”
- DBI says “balanced” vs “dominated”
- dominance = max(p_social, p_work, p_entertainment)
Interpretation:
- dominance close to 1.0 → one category dominates
- dominance close to 0.33 → near-even split
DBI tiers:
- Balanced: DBI ≥ 0.80
- Mixed: 0.60–0.79
- Skewed: < 0.60
Load tiers (by quantiles of total screen time):
- Low / Medium / High based on P33 and P66
Flag:
flag_highload_skewed = 1if load_tier == High AND dbi_tier == Skewed
This is intentionally phrased as “attention-worthy,” not “harmful.”
python -m venv .venv
# Windows:
# .venv\Scripts\activate
# macOS/Linux:
# source .venv/bin/activate
pip install -r requirements.txtpython -m src.pipeline --input data/raw/daily_internet_usage_by_age_group.csvYou should see a Done message and the output folders printed.
streamlit run app/app.pydigital-balance-index-dashboard/
data/
raw/
processed/
outputs/
reports/
figures/
src/
io.py
validate.py
scoring.py
aggregates.py
reporting.py
pipeline.py
app/
app.py
README.md
requirements.txt
Below are the key plots generated by the pipeline.
Each bar is an age group. The bar stacks to 1.0 (100% of total screen time), split into:
- Social
- Work/Study
- Entertainment
This answers:
- Do age groups differ more by total time, or by how time is distributed?
-
Look for relative differences in shares (e.g., Work/Study share slightly higher in one group).
-
A similar composition across groups suggests:
- differences may lie more in total screen time than usage mix.
- Don’t interpret this as “who uses more.” This plot is composition, not absolute hours.
- A group can have the same composition but different total hours.
A histogram of DBI values across all records.
This is your “macro fingerprint”:
- If DBI is mostly high → usage is generally mixed/balanced in composition.
- If DBI has a heavy low tail → many users have dominated usage (one category owns most of the day).
-
High DBI dominance suggests many “mixed days,” not necessarily “low usage.”
-
A low-DBI tail is where you’d investigate dominant categories:
- dominated by Social?
- dominated by Entertainment?
- or dominated by Work/Study?
Each dot is a record:
- x-axis: total screen time (hours)
- y-axis: DBI (0–1)
This separates two different questions:
- How much? (load)
- How distributed? (composition)
Even without drawing lines, you can think in quadrants:
-
High total + High DBI Heavy usage, but spread across categories (mixed day)
-
High total + Low DBI Heavy usage, dominated by one category (attention-worthy)
-
Low total + High DBI Light usage, balanced composition
-
Low total + Low DBI Light usage, but concentrated (short + focused)
This plot is descriptive: it suggests where to look, not what to diagnose.
- Solid line: daily mean DBI
- Dashed line: sample size (n) that day
Because daily means are only meaningful if daily n is stable.
- If n swings sharply, the mean can swing even if behavior didn’t change.
Reasonable:
- “DBI is fairly stable across dates.”
- “Some days show deviations, but sample size should be checked.”
Not reasonable:
- “Behavior changed over time for the same people.” This dataset is not a per-person time series; it’s a collection of records by date.
Average DBI per primary device category.
- Quick comparison: are some devices associated with more balanced composition?
- Small differences can be real or just sample noise.
- Device does not “cause” DBI differences; it’s an association.
A strong follow-up is to check:
- DBI distribution by device (not just means)
- segment breakdown: age group × device
Average DBI for WiFi vs Mobile Data.
If values are similar:
- connectivity context may not be a major driver of composition balance
If values differ:
- could reflect behavior contexts (e.g., mobile data used on the go)
Again: this is association, not causation.
Mean DBI for each (age_group, primary_device) combination.
This reveals interaction patterns that “mean by device” hides:
- a device might look balanced overall,
- but show different balance depending on age group.
-
Spot extremes (highest and lowest cells)
-
Use segment_summary.csv to confirm:
- n per segment
- composition means
- high-load & skewed rate
This project is designed for safe analytics:
- DBI is not a mental health score.
- DBI is not a productivity score.
- DBI is not a diagnosis.
Recommended language:
- “composition pattern”
- “dominant category”
- “balanced vs skewed distribution”
- “attention-worthy segment”
Avoid language like:
- “addicted”
- “harmful”
- “unhealthy”
- “caused by WiFi/device”
The pipeline includes:
- schema validation (required columns)
- identity validation: total ≈ social + work + entertainment
- consistent export paths and figures
If you want to level this up further:
- Add bootstrap confidence intervals for segment means (DBI/total)
- Add “What-if composition simulator” (move minutes between categories → new DBI)
- Add dominance + category-specific deep dives
- Add segment stability checks (min n threshold)