Detect and redact personally identifiable information using Hugging Face models
- Overview
- Features
- Quick Start
- Installation
- Usage
- Configuration
- Input / Output Format
- Project Structure
- Troubleshooting
- Contributing
- License
- Credits
A PII (Personally Identifiable Information) detection and redaction tool that processes text data to identify and mask sensitive information such as names, email addresses, and phone numbers. Uses the piiranha-v1 model from Hugging Face for token-level classification.
- Detect PII types: given names, surnames, emails, phone numbers
- Batch processing with configurable batch size
- Two redaction modes: aggregate (
[SENSITIVE_INFO]) or type-specific ([I-GIVENNAME]) - Selective PII cleaning — choose which PII types to redact
- Debug mode for fast iteration on subsets
- Automatic device selection (Apple Silicon MPS / CUDA / CPU)
- Long text handling via chunking
# 1. Clone
git clone https://github.com/unesco/pii-detector.git
cd pii-detector
# 2. Install
pip install torch transformers pandas
# 3. Place your CSV in data/
cp your_data.csv data/DATA_PII.csv
# 4. Run
python main.py- Python 3.9+
- PyTorch (with MPS, CUDA, or CPU support)
poetry install
poetry run python main.pypip install torch transformers pandas
python main.pyThe model is downloaded automatically on first run and cached in huggingface/transformers/.
# Process data
python main.pyThe script reads from data/DATA_PII.csv and writes results to data/ANSWERS_pii.csv.
Configuration is set via constants at the top of main.py:
| Parameter | Default | Description |
|---|---|---|
MODEL_NAME |
iiiorg/piiranha-v1-detect-personal-information |
Hugging Face model ID |
DEBUG_MODE |
True |
Process only 1000 random rows |
BATCH_SIZE |
64 |
Rows per batch |
PII_RESULTS_FULL |
True |
Include detailed redaction in output |
INCLUDE_ANSWER_LABEL |
True |
Include original text in output |
PII_CLEAN |
True |
Apply PII cleaning |
PII_CLEAN_LIST |
["I-GIVENNAME", "I-SURNAME", "I-EMAIL", "I-PHONE"] |
PII types to redact |
REMOVE_NO_PII |
True |
Exclude rows with no PII detected |
COLUMN_TO_ANALYZE |
"value" |
Column name containing text to analyze |
ID_COLUMN |
"id" |
Column name for unique identifiers |
| Column | Type | Description |
|---|---|---|
id |
string/int | Unique identifier |
value |
string | Text to analyze for PII |
| Column | Type | Description |
|---|---|---|
id |
string/int | Unique identifier |
pii |
string | "true" or "false" |
pii_results |
JSON string | PII type counts (e.g., {"I-GIVENNAME": 2}) |
value |
string | Original text (if INCLUDE_ANSWER_LABEL=True) |
value_pii_clean |
string | Text with PII redacted (if PII_CLEAN=True) |
pii_results_full |
string | Detailed type-specific redaction (if PII_RESULTS_FULL=True) |
pii-detector/
├── main.py # Main processing script
├── pyproject.toml # Dependencies
├── LICENSE
├── README.md
├── .gitignore
└── data/
└── .gitkeep # Place your CSV files here
| Issue | Solution |
|---|---|
| Model download fails | Check internet connection; model is ~500MB |
| Out of memory | Reduce BATCH_SIZE or switch to CPU |
| MPS not available | Falls back to CUDA, then CPU automatically |
| Column not found | Update COLUMN_TO_ANALYZE and ID_COLUMN to match your CSV |
| No output rows | Set REMOVE_NO_PII = False to keep all rows |
Contributions are welcome. Please read the contributing guidelines before submitting a pull request.
This project is licensed under the MIT License.
- iiiorg/piiranha-v1-detect-personal-information — PII detection model
- Hugging Face — Model hosting and Transformers library