Skip to content

unesco/pii-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PII Detector

Detect and redact personally identifiable information using Hugging Face models

UNESCO Data & AI Python License: MIT Hugging Face

Table of Contents


Overview

A PII (Personally Identifiable Information) detection and redaction tool that processes text data to identify and mask sensitive information such as names, email addresses, and phone numbers. Uses the piiranha-v1 model from Hugging Face for token-level classification.

Features

  • Detect PII types: given names, surnames, emails, phone numbers
  • Batch processing with configurable batch size
  • Two redaction modes: aggregate ([SENSITIVE_INFO]) or type-specific ([I-GIVENNAME])
  • Selective PII cleaning — choose which PII types to redact
  • Debug mode for fast iteration on subsets
  • Automatic device selection (Apple Silicon MPS / CUDA / CPU)
  • Long text handling via chunking

Quick Start

# 1. Clone
git clone https://github.com/unesco/pii-detector.git
cd pii-detector

# 2. Install
pip install torch transformers pandas

# 3. Place your CSV in data/
cp your_data.csv data/DATA_PII.csv

# 4. Run
python main.py

Installation

Prerequisites

  • Python 3.9+
  • PyTorch (with MPS, CUDA, or CPU support)

With Poetry

poetry install
poetry run python main.py

With pip

pip install torch transformers pandas
python main.py

The model is downloaded automatically on first run and cached in huggingface/transformers/.

Usage

# Process data
python main.py

The script reads from data/DATA_PII.csv and writes results to data/ANSWERS_pii.csv.

Configuration

Configuration is set via constants at the top of main.py:

Parameter Default Description
MODEL_NAME iiiorg/piiranha-v1-detect-personal-information Hugging Face model ID
DEBUG_MODE True Process only 1000 random rows
BATCH_SIZE 64 Rows per batch
PII_RESULTS_FULL True Include detailed redaction in output
INCLUDE_ANSWER_LABEL True Include original text in output
PII_CLEAN True Apply PII cleaning
PII_CLEAN_LIST ["I-GIVENNAME", "I-SURNAME", "I-EMAIL", "I-PHONE"] PII types to redact
REMOVE_NO_PII True Exclude rows with no PII detected
COLUMN_TO_ANALYZE "value" Column name containing text to analyze
ID_COLUMN "id" Column name for unique identifiers

Input / Output Format

Input CSV (data/DATA_PII.csv)

Column Type Description
id string/int Unique identifier
value string Text to analyze for PII

Output CSV (data/ANSWERS_pii.csv)

Column Type Description
id string/int Unique identifier
pii string "true" or "false"
pii_results JSON string PII type counts (e.g., {"I-GIVENNAME": 2})
value string Original text (if INCLUDE_ANSWER_LABEL=True)
value_pii_clean string Text with PII redacted (if PII_CLEAN=True)
pii_results_full string Detailed type-specific redaction (if PII_RESULTS_FULL=True)

Project Structure

pii-detector/
├── main.py            # Main processing script
├── pyproject.toml     # Dependencies
├── LICENSE
├── README.md
├── .gitignore
└── data/
    └── .gitkeep       # Place your CSV files here

Troubleshooting

Issue Solution
Model download fails Check internet connection; model is ~500MB
Out of memory Reduce BATCH_SIZE or switch to CPU
MPS not available Falls back to CUDA, then CPU automatically
Column not found Update COLUMN_TO_ANALYZE and ID_COLUMN to match your CSV
No output rows Set REMOVE_NO_PII = False to keep all rows

Contributing

Contributions are welcome. Please read the contributing guidelines before submitting a pull request.

License

This project is licensed under the MIT License.

Credits

About

Detect and redact personally identifiable information using Hugging Face models

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages