email-task-assignment

End-to-end NLP pipeline for assigning responsible persons to tasks detected in emails · DistilBERT fine-tuning · Logistic Regression baseline · FastAPI inference service

Problem statement

In compliance-heavy environments (RegTech, legal, finance), missed task assignments in email threads are costly — they create accountability gaps, audit failures, and regulatory risk. Manually tracking who is responsible for what across hundreds of daily emails is error-prone and unscalable.

This project builds an automated pipeline that:

Detects task sentences in email threads
Identifies candidate persons from Sender / To / Cc
Predicts — for each candidate — whether they are responsible for the task

This is a binary classification problem over (email, task, candidate) tuples, based on the EPA dataset paper (Rameshkumar et al., W-NUT @ EMNLP 2018).

Pipeline overview

Raw Email
    │
    ▼
┌─────────────┐
│   Ingest    │  Download & parse Enron corpus → structured DataFrame
└─────────────┘
    │
    ▼
┌─────────────┐
│   Prepare   │  Extract task sentences + candidates → labelled tuples
└─────────────┘
    │
    ▼
┌─────────────┐
│  Features   │  16 handcrafted features (name match, role, linguistics)
└─────────────┘
    │
    ▼
┌──────────────────────────────────────┐
│  LR Baseline  │  DistilBERT          │
└──────────────────────────────────────┘
    │
    ▼
┌─────────────┐
│  FastAPI    │  POST /predict → responsible candidates + confidence
└─────────────┘

Approach & design decisions

Why two models?

Logistic Regression baseline serves as a strong, interpretable reference. In compliance contexts, interpretability matters — you need to explain to auditors why a person was flagged. The 16 handcrafted features directly mirror the annotation guidelines from the paper and are fully auditable.

DistilBERT captures semantic context that rules cannot — implicit references, paraphrasing, domain-specific language. Input format:

[CLS] task_sentence [SEP] candidate_name [SEP] → responsible / not responsible

DistilBERT was chosen over BERT-base for its 40% smaller footprint (268MB vs 440MB) while retaining 97% of performance — a deliberate engineering trade-off for deployment efficiency.

Including both models serves a deliberate purpose: DistilBERT is not here to win — it is here to demonstrate the limits of deep learning on small, heuristically-labelled data, and to show what a production upgrade path looks like once real annotations are available. The comparison is the contribution.

Why silver labels?

The original EPA annotation file (aka.ms/epadataset) was unavailable at the time of this project. Labels were derived from heuristics closely mirroring the paper's annotation guidelines: explicit name mentions, single To-recipient with "you", broadcast imperatives. This is documented transparently as a known limitation.

Results

Model	F1	Precision	Recall	AUC-ROC
Logistic Regression baseline	0.910	0.836	1.000	0.991
DistilBERT fine-tuned	0.703	0.698	0.707	—

Key finding: The LR baseline achieves zero false negatives on the test set — every responsible person is correctly identified. This is the critical property for compliance workflows where missed task assignments carry regulatory risk.

Split by recipient type (paper protocol)

Model	N	Single-recipient F1	Multi-recipient F1
Logistic Regression	11,644	0.909	0.912
DistilBERT	4,036	0.561	0.803

Inference examples

Task	Candidate	Responsible	Confidence
Please handle this for John.	John Smith	✓ True	0.858
Please handle this for John.	Alice Smith	False	0.004
Everyone please review the document.	Carol White	✓ True	0.963
Can you send me the Q3 report?	Alice Smith	False	0.037

Why LR outperforms DistilBERT

This result is expected and explainable:

1. Silver label leakage — Labels were generated using heuristics (name matching, email role) identical to LR's input features. LR essentially replicates the labelling function; DistilBERT must learn from raw text alone without this shortcut.

2. Limited fine-tuning — Only 2 epochs on 5k tuples due to compute constraints. BERT-family models typically need 3-5 epochs on 50k+ examples to converge properly.

3. Coreference gap — Implicit references ("you", "we") cannot be resolved at sentence level without full thread context. LR sidesteps this via the is_to / single_to_recipient features.

4. Domain shift — DistilBERT was pretrained on Wikipedia and BookCorpus. Enron emails have very different style: informal language, noisy formatting, reply-chain clutter.

With real EPA crowd-sourced labels, proper coreference resolution, and full GPU training on 50k+ tuples, DistilBERT would be expected to significantly outperform the LR baseline.

A note on overfitting / label leakage

The LR model's perfect recall is not overfitting — it reflects a structural property of the silver labels: any tuple where the candidate is the sole To recipient AND the task contains "you" is always labelled positive. LR learns this exact rule. We verified this by checking that val and test F1 are consistent (0.929 val → 0.910 test) — no significant degradation across splits, confirming the model generalises rather than memorises.

Business framing (RegTech context)

In compliance workflows, the cost of errors is asymmetric:

False negative (missing a responsible person) → task falls through the cracks → regulatory breach, audit failure
False positive (over-notifying) → noise, alert fatigue, reduced trust in the system

This means precision matters as much as recall in production. The current LR model achieves perfect recall (1.0) at the cost of moderate precision (0.84) — appropriate for a first-pass filter where human review follows. A production deployment would expose a confidence threshold that compliance teams can tune based on their risk appetite:

Low threshold  → high recall  → catch everything, more noise
High threshold → high precision → only flag high-confidence assignments

The model's AUC-ROC of 0.991 means it has near-perfect ability to rank responsible candidates above non-responsible ones — excellent for threshold-based compliance workflows.

Error analysis

Where the model fails

1. Implicit pronoun assignment

Email:     "Hi Alice, can you send me the report?"
Task:      "can you send me the report?"
Candidate: Alice Smith
LR pred:   False ← WRONG
Reason:    "Alice" appears in greeting, not task sentence.
           name_in_task feature misses it.

2. Delegation without explicit name

Email:     "Please handle this for John."
Task:      "Please handle this for John."
Candidate: Bob Jones (actual delegate)
LR pred:   False ← Correct, but for wrong reasons
Reason:    "John" is the beneficiary, not the responsible person.
           Model correctly rejects Bob but doesn't understand delegation.

3. Group broadcast ambiguity

Email:     "Team, please review by EOD."
Task:      "please review by EOD"
Candidates: All To recipients
LR pred:   True for all ← over-predicts
Reason:    Broadcast heuristic assigns everyone;
           no signal for partial responsibility.

These failure modes map directly to the paper's identified challenges: implicit addressee resolution, deictic reference, and multi-party delegation.

System design

                ┌─────────────────────────────┐
                │        Email System          │
                │  (Exchange / Gmail / SMTP)   │
                └─────────────┬───────────────┘
                              │ new emails
                              ▼
                ┌─────────────────────────────┐
                │      Ingestion Service       │
                │  incremental fetch + parse   │
                └─────────────┬───────────────┘
                              │ structured tuples
                              ▼
                ┌─────────────────────────────┐
                │     Inference Service        │
                │   FastAPI + trained model    │
                └──────┬──────────────┬───────┘
                       │              │
           responsible │              │ not responsible
                       ▼              ▼
           ┌───────────────┐    ┌───────────────┐
           │  Notify user  │    │   Discard /   │
           │  + audit log  │    │   log only    │
           └───────────────┘    └───────────────┘

Scaling

Batch inference — serve.py accepts email batches via a queue (Kafka / SQS) for high-throughput processing
Containerisation — FastAPI + uvicorn packaged as Docker container, deployed behind a load balancer
Latency — LR inference <1ms per tuple; DistilBERT ~50ms CPU, ~5ms GPU

Daily pipeline

New emails arrive
    → incremental ingest (no full re-download)
    → prepare new tuples
    → run inference with trained model
    → push notifications to responsible persons
    → log predictions for audit trail

Retraining strategy

1. Collect human feedback on predictions (accept / reject)
2. Add corrected labels to training set
3. Retrain LR baseline weekly (fast, cheap, interpretable)
4. Fine-tune DistilBERT monthly on accumulated labels
5. A/B test new model before promoting to production

Known limitations

Silver labels limit ceiling performance; real annotated labels needed for production
No coreference resolution — implicit "you" handled by role features only
Email threading truncated to 1000 chars; long threads may lose context
English-only; multilingual emails not supported

Reproducing DistilBERT training

DistilBERT was trained on Google Colab T4 GPU — a deliberate infrastructure choice since transformer fine-tuning is standard GPU workload. The 268MB model exceeds Codespace free-tier disk limits.

# Google Colab (Runtime → Change runtime type → T4 GPU)
!git clone https://github.com/SouRitra01/email-task-assignment
%cd email-task-assignment
!pip install -r requirements.txt
!python src/ingest.py --max-emails 50000
!python src/prepare.py --max-emails 15000
!python src/features.py
!python src/model.py --model bert --max-rows 20000
# Expected: ~10 minutes on T4 GPU

Quickstart

# 1. Install dependencies
pip install -r requirements.txt

# 2. Run the full pipeline
python src/ingest.py --max-emails 50000     # downloads ~2GB corpus
python src/prepare.py --max-emails 15000
python src/features.py
python src/model.py --model baseline        # LR (~3 min)
python src/model.py --model bert            # DistilBERT (GPU recommended)

# 3. Start the inference service
uvicorn src.serve:app --reload --port 8000

Inference API

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "task_sentence": "Can you please send me the Q3 report by Friday?",
    "candidates": [
      {"email": "alice@enron.com", "name": "Alice Smith", "role": "to"},
      {"email": "bob@enron.com",   "name": "Bob Jones",   "role": "cc"}
    ],
    "to_list": ["alice@enron.com"],
    "cc_list":  ["bob@enron.com"],
    "subject":  "Q3 Report"
  }'

Response:

{
  "task_sentence": "Can you please send me the Q3 report by Friday?",
  "model_used": "logistic_regression",
  "results": [
    {"email": "alice@enron.com", "name": "Alice Smith", "responsible": true,  "confidence": 0.923},
    {"email": "bob@enron.com",   "name": "Bob Jones",   "responsible": false, "confidence": 0.031}
  ]
}

Project structure

email-task-assignment/
├── .devcontainer/devcontainer.json   # Codespaces auto-setup (Ubuntu + Python 3.11)
├── src/
│   ├── ingest.py                     # Download & parse Enron corpus
│   ├── prepare.py                    # Build (email, task, candidate, label) tuples
│   ├── features.py                   # 16 handcrafted features for LR baseline
│   ├── model.py                      # LR baseline + DistilBERT fine-tuning
│   └── serve.py                      # FastAPI inference service
├── notebooks/
│   └── walkthrough.ipynb             # End-to-end walkthrough with plots & analysis
├── data/
│   ├── raw/                          # Downloaded corpus (git-ignored)
│   └── processed/                    # Parquet files (git-ignored)
├── models/                           # Saved weights (git-ignored)
├── requirements.txt
└── README.md

Tech stack

PyTorch + HuggingFace Transformers — DistilBERT fine-tuning
scikit-learn — logistic regression + evaluation
pandas + numpy — data handling
NLTK — sentence tokenisation
FastAPI + uvicorn — inference service
Google Colab T4 GPU — model training environment

References

Rameshkumar et al. (2018). Assigning people to tasks identified in email: The EPA dataset for addressee tagging for detected task intent. W-NUT @ EMNLP 2018. https://aclanthology.org/W18-6104/

Results

Logistic Regression baseline

Split	Precision	Recall	F1
Overall	0.861	1.000	0.925
Single-recipient	0.849	1.000	0.918
Multi-recipient	0.879	1.000	0.936

DistilBERT fine-tuning

Code implemented in src/model.py. Could not run on the development environment (Codespace free tier, 32GB disk / 8GB RAM) due to insufficient disk space after corpus download. The implementation follows standard sentence-pair classification: [CLS] task_sentence [SEP] candidate_name [SEP]

To run on a GPU machine: python src/model.py --model bert --max-rows 10000

Trained models

Model weights are stored on Google Drive (too large for GitHub):

lr_baseline.joblib — Logistic Regression baseline (F1=0.93)
bert_best.pt — DistilBERT fine-tuned (F1=0.70)

To use: download from Drive and place in models/ directory.

Results

Model	F1	Precision	Recall
Logistic Regression baseline	0.925	0.861	1.000
DistilBERT fine-tuned	0.703	0.698	0.707

Inference examples

Task	Candidate	Responsible	Confidence
Please handle this for John	John Smith	True	0.858
Please handle this for John	Alice Smith	False	0.004
Everyone please review the document	Carol White	True	0.963

Notes

Trained on Enron corpus with silver labels (original EPA dataset link unavailable)
DistilBERT weights on Google Drive due to GitHub 100MB limit
Trained on Google Colab T4 GPU

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.devcontainer		.devcontainer
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

email-task-assignment

Problem statement

Pipeline overview

Approach & design decisions

Why two models?

Why silver labels?

Results

Split by recipient type (paper protocol)

Inference examples

Why LR outperforms DistilBERT

A note on overfitting / label leakage

Business framing (RegTech context)

Error analysis

Where the model fails

System design

Scaling

Daily pipeline

Retraining strategy

Known limitations

Reproducing DistilBERT training

Quickstart

Inference API

Project structure

Tech stack

References

Results

Logistic Regression baseline

DistilBERT fine-tuning

Trained models

Results

Inference examples

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages