End-to-end NLP pipeline for assigning responsible persons to tasks detected in emails · DistilBERT fine-tuning · Logistic Regression baseline · FastAPI inference service
In compliance-heavy environments (RegTech, legal, finance), missed task assignments in email threads are costly — they create accountability gaps, audit failures, and regulatory risk. Manually tracking who is responsible for what across hundreds of daily emails is error-prone and unscalable.
This project builds an automated pipeline that:
- Detects task sentences in email threads
- Identifies candidate persons from Sender / To / Cc
- Predicts — for each candidate — whether they are responsible for the task
This is a binary classification problem over (email, task, candidate) tuples, based on the EPA dataset paper (Rameshkumar et al., W-NUT @ EMNLP 2018).
Raw Email
│
▼
┌─────────────┐
│ Ingest │ Download & parse Enron corpus → structured DataFrame
└─────────────┘
│
▼
┌─────────────┐
│ Prepare │ Extract task sentences + candidates → labelled tuples
└─────────────┘
│
▼
┌─────────────┐
│ Features │ 16 handcrafted features (name match, role, linguistics)
└─────────────┘
│
▼
┌──────────────────────────────────────┐
│ LR Baseline │ DistilBERT │
└──────────────────────────────────────┘
│
▼
┌─────────────┐
│ FastAPI │ POST /predict → responsible candidates + confidence
└─────────────┘
Logistic Regression baseline serves as a strong, interpretable reference. In compliance contexts, interpretability matters — you need to explain to auditors why a person was flagged. The 16 handcrafted features directly mirror the annotation guidelines from the paper and are fully auditable.
DistilBERT captures semantic context that rules cannot — implicit references, paraphrasing, domain-specific language. Input format:
[CLS] task_sentence [SEP] candidate_name [SEP] → responsible / not responsible
DistilBERT was chosen over BERT-base for its 40% smaller footprint (268MB vs 440MB) while retaining 97% of performance — a deliberate engineering trade-off for deployment efficiency.
Including both models serves a deliberate purpose: DistilBERT is not here to win — it is here to demonstrate the limits of deep learning on small, heuristically-labelled data, and to show what a production upgrade path looks like once real annotations are available. The comparison is the contribution.
The original EPA annotation file (aka.ms/epadataset) was unavailable at the time of this project. Labels were derived from heuristics closely mirroring the paper's annotation guidelines: explicit name mentions, single To-recipient with "you", broadcast imperatives. This is documented transparently as a known limitation.
| Model | F1 | Precision | Recall | AUC-ROC |
|---|---|---|---|---|
| Logistic Regression baseline | 0.910 | 0.836 | 1.000 | 0.991 |
| DistilBERT fine-tuned | 0.703 | 0.698 | 0.707 | — |
Key finding: The LR baseline achieves zero false negatives on the test set — every responsible person is correctly identified. This is the critical property for compliance workflows where missed task assignments carry regulatory risk.
| Model | N | Single-recipient F1 | Multi-recipient F1 |
|---|---|---|---|
| Logistic Regression | 11,644 | 0.909 | 0.912 |
| DistilBERT | 4,036 | 0.561 | 0.803 |
| Task | Candidate | Responsible | Confidence |
|---|---|---|---|
| Please handle this for John. | John Smith | ✓ True | 0.858 |
| Please handle this for John. | Alice Smith | False | 0.004 |
| Everyone please review the document. | Carol White | ✓ True | 0.963 |
| Can you send me the Q3 report? | Alice Smith | False | 0.037 |
This result is expected and explainable:
1. Silver label leakage — Labels were generated using heuristics (name matching, email role) identical to LR's input features. LR essentially replicates the labelling function; DistilBERT must learn from raw text alone without this shortcut.
2. Limited fine-tuning — Only 2 epochs on 5k tuples due to compute constraints. BERT-family models typically need 3-5 epochs on 50k+ examples to converge properly.
3. Coreference gap — Implicit references ("you", "we") cannot be resolved at sentence level without full thread context. LR sidesteps this via the is_to / single_to_recipient features.
4. Domain shift — DistilBERT was pretrained on Wikipedia and BookCorpus. Enron emails have very different style: informal language, noisy formatting, reply-chain clutter.
With real EPA crowd-sourced labels, proper coreference resolution, and full GPU training on 50k+ tuples, DistilBERT would be expected to significantly outperform the LR baseline.
The LR model's perfect recall is not overfitting — it reflects a structural property of the silver labels: any tuple where the candidate is the sole To recipient AND the task contains "you" is always labelled positive. LR learns this exact rule. We verified this by checking that val and test F1 are consistent (0.929 val → 0.910 test) — no significant degradation across splits, confirming the model generalises rather than memorises.
In compliance workflows, the cost of errors is asymmetric:
- False negative (missing a responsible person) → task falls through the cracks → regulatory breach, audit failure
- False positive (over-notifying) → noise, alert fatigue, reduced trust in the system
This means precision matters as much as recall in production. The current LR model achieves perfect recall (1.0) at the cost of moderate precision (0.84) — appropriate for a first-pass filter where human review follows. A production deployment would expose a confidence threshold that compliance teams can tune based on their risk appetite:
Low threshold → high recall → catch everything, more noise
High threshold → high precision → only flag high-confidence assignments
The model's AUC-ROC of 0.991 means it has near-perfect ability to rank responsible candidates above non-responsible ones — excellent for threshold-based compliance workflows.
1. Implicit pronoun assignment
Email: "Hi Alice, can you send me the report?"
Task: "can you send me the report?"
Candidate: Alice Smith
LR pred: False ← WRONG
Reason: "Alice" appears in greeting, not task sentence.
name_in_task feature misses it.
2. Delegation without explicit name
Email: "Please handle this for John."
Task: "Please handle this for John."
Candidate: Bob Jones (actual delegate)
LR pred: False ← Correct, but for wrong reasons
Reason: "John" is the beneficiary, not the responsible person.
Model correctly rejects Bob but doesn't understand delegation.
3. Group broadcast ambiguity
Email: "Team, please review by EOD."
Task: "please review by EOD"
Candidates: All To recipients
LR pred: True for all ← over-predicts
Reason: Broadcast heuristic assigns everyone;
no signal for partial responsibility.
These failure modes map directly to the paper's identified challenges: implicit addressee resolution, deictic reference, and multi-party delegation.
┌─────────────────────────────┐
│ Email System │
│ (Exchange / Gmail / SMTP) │
└─────────────┬───────────────┘
│ new emails
▼
┌─────────────────────────────┐
│ Ingestion Service │
│ incremental fetch + parse │
└─────────────┬───────────────┘
│ structured tuples
▼
┌─────────────────────────────┐
│ Inference Service │
│ FastAPI + trained model │
└──────┬──────────────┬───────┘
│ │
responsible │ │ not responsible
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Notify user │ │ Discard / │
│ + audit log │ │ log only │
└───────────────┘ └───────────────┘
- Batch inference —
serve.pyaccepts email batches via a queue (Kafka / SQS) for high-throughput processing - Containerisation — FastAPI + uvicorn packaged as Docker container, deployed behind a load balancer
- Latency — LR inference <1ms per tuple; DistilBERT ~50ms CPU, ~5ms GPU
New emails arrive
→ incremental ingest (no full re-download)
→ prepare new tuples
→ run inference with trained model
→ push notifications to responsible persons
→ log predictions for audit trail
1. Collect human feedback on predictions (accept / reject)
2. Add corrected labels to training set
3. Retrain LR baseline weekly (fast, cheap, interpretable)
4. Fine-tune DistilBERT monthly on accumulated labels
5. A/B test new model before promoting to production
- Silver labels limit ceiling performance; real annotated labels needed for production
- No coreference resolution — implicit "you" handled by role features only
- Email threading truncated to 1000 chars; long threads may lose context
- English-only; multilingual emails not supported
DistilBERT was trained on Google Colab T4 GPU — a deliberate infrastructure choice since transformer fine-tuning is standard GPU workload. The 268MB model exceeds Codespace free-tier disk limits.
# Google Colab (Runtime → Change runtime type → T4 GPU)
!git clone https://github.com/SouRitra01/email-task-assignment
%cd email-task-assignment
!pip install -r requirements.txt
!python src/ingest.py --max-emails 50000
!python src/prepare.py --max-emails 15000
!python src/features.py
!python src/model.py --model bert --max-rows 20000
# Expected: ~10 minutes on T4 GPU# 1. Install dependencies
pip install -r requirements.txt
# 2. Run the full pipeline
python src/ingest.py --max-emails 50000 # downloads ~2GB corpus
python src/prepare.py --max-emails 15000
python src/features.py
python src/model.py --model baseline # LR (~3 min)
python src/model.py --model bert # DistilBERT (GPU recommended)
# 3. Start the inference service
uvicorn src.serve:app --reload --port 8000curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"task_sentence": "Can you please send me the Q3 report by Friday?",
"candidates": [
{"email": "alice@enron.com", "name": "Alice Smith", "role": "to"},
{"email": "bob@enron.com", "name": "Bob Jones", "role": "cc"}
],
"to_list": ["alice@enron.com"],
"cc_list": ["bob@enron.com"],
"subject": "Q3 Report"
}'Response:
{
"task_sentence": "Can you please send me the Q3 report by Friday?",
"model_used": "logistic_regression",
"results": [
{"email": "alice@enron.com", "name": "Alice Smith", "responsible": true, "confidence": 0.923},
{"email": "bob@enron.com", "name": "Bob Jones", "responsible": false, "confidence": 0.031}
]
}email-task-assignment/
├── .devcontainer/devcontainer.json # Codespaces auto-setup (Ubuntu + Python 3.11)
├── src/
│ ├── ingest.py # Download & parse Enron corpus
│ ├── prepare.py # Build (email, task, candidate, label) tuples
│ ├── features.py # 16 handcrafted features for LR baseline
│ ├── model.py # LR baseline + DistilBERT fine-tuning
│ └── serve.py # FastAPI inference service
├── notebooks/
│ └── walkthrough.ipynb # End-to-end walkthrough with plots & analysis
├── data/
│ ├── raw/ # Downloaded corpus (git-ignored)
│ └── processed/ # Parquet files (git-ignored)
├── models/ # Saved weights (git-ignored)
├── requirements.txt
└── README.md
- PyTorch + HuggingFace Transformers — DistilBERT fine-tuning
- scikit-learn — logistic regression + evaluation
- pandas + numpy — data handling
- NLTK — sentence tokenisation
- FastAPI + uvicorn — inference service
- Google Colab T4 GPU — model training environment
Rameshkumar et al. (2018). Assigning people to tasks identified in email: The EPA dataset for addressee tagging for detected task intent. W-NUT @ EMNLP 2018. https://aclanthology.org/W18-6104/
| Split | Precision | Recall | F1 |
|---|---|---|---|
| Overall | 0.861 | 1.000 | 0.925 |
| Single-recipient | 0.849 | 1.000 | 0.918 |
| Multi-recipient | 0.879 | 1.000 | 0.936 |
Code implemented in src/model.py. Could not run on the
development environment (Codespace free tier, 32GB disk / 8GB RAM)
due to insufficient disk space after corpus download.
The implementation follows standard sentence-pair classification:
[CLS] task_sentence [SEP] candidate_name [SEP]
To run on a GPU machine: python src/model.py --model bert --max-rows 10000
Model weights are stored on Google Drive (too large for GitHub):
lr_baseline.joblib— Logistic Regression baseline (F1=0.93)bert_best.pt— DistilBERT fine-tuned (F1=0.70)
To use: download from Drive and place in models/ directory.
| Model | F1 | Precision | Recall |
|---|---|---|---|
| Logistic Regression baseline | 0.925 | 0.861 | 1.000 |
| DistilBERT fine-tuned | 0.703 | 0.698 | 0.707 |
| Task | Candidate | Responsible | Confidence |
|---|---|---|---|
| Please handle this for John | John Smith | True | 0.858 |
| Please handle this for John | Alice Smith | False | 0.004 |
| Everyone please review the document | Carol White | True | 0.963 |
- Trained on Enron corpus with silver labels (original EPA dataset link unavailable)
- DistilBERT weights on Google Drive due to GitHub 100MB limit
- Trained on Google Colab T4 GPU