PII Detector

Detect and redact personally identifiable information using Hugging Face models

Overview

A PII (Personally Identifiable Information) detection and redaction tool that processes text data to identify and mask sensitive information such as names, email addresses, and phone numbers. Uses the piiranha-v1 model from Hugging Face for token-level classification.

Features

Detect PII types: given names, surnames, emails, phone numbers
Batch processing with configurable batch size
Two redaction modes: aggregate ([SENSITIVE_INFO]) or type-specific ([I-GIVENNAME])
Selective PII cleaning — choose which PII types to redact
Debug mode for fast iteration on subsets
Automatic device selection (Apple Silicon MPS / CUDA / CPU)
Long text handling via chunking

Quick Start

# 1. Clone
git clone https://github.com/unesco/pii-detector.git
cd pii-detector

# 2. Install
pip install torch transformers pandas

# 3. Place your CSV in data/
cp your_data.csv data/DATA_PII.csv

# 4. Run
python main.py

Installation

Prerequisites

Python 3.9+
PyTorch (with MPS, CUDA, or CPU support)

With Poetry

poetry install
poetry run python main.py

With pip

pip install torch transformers pandas
python main.py

The model is downloaded automatically on first run and cached in huggingface/transformers/.

Usage

# Process data
python main.py

The script reads from data/DATA_PII.csv and writes results to data/ANSWERS_pii.csv.

Configuration

Configuration is set via constants at the top of main.py:

Parameter	Default	Description
`MODEL_NAME`	`iiiorg/piiranha-v1-detect-personal-information`	Hugging Face model ID
`DEBUG_MODE`	`True`	Process only 1000 random rows
`BATCH_SIZE`	`64`	Rows per batch
`PII_RESULTS_FULL`	`True`	Include detailed redaction in output
`INCLUDE_ANSWER_LABEL`	`True`	Include original text in output
`PII_CLEAN`	`True`	Apply PII cleaning
`PII_CLEAN_LIST`	`["I-GIVENNAME", "I-SURNAME", "I-EMAIL", "I-PHONE"]`	PII types to redact
`REMOVE_NO_PII`	`True`	Exclude rows with no PII detected
`COLUMN_TO_ANALYZE`	`"value"`	Column name containing text to analyze
`ID_COLUMN`	`"id"`	Column name for unique identifiers

Input / Output Format

Input CSV (`data/DATA_PII.csv`)

Column	Type	Description
`id`	string/int	Unique identifier
`value`	string	Text to analyze for PII

Output CSV (`data/ANSWERS_pii.csv`)

Column	Type	Description
`id`	string/int	Unique identifier
`pii`	string	`"true"` or `"false"`
`pii_results`	JSON string	PII type counts (e.g., `{"I-GIVENNAME": 2}`)
`value`	string	Original text (if `INCLUDE_ANSWER_LABEL=True`)
`value_pii_clean`	string	Text with PII redacted (if `PII_CLEAN=True`)
`pii_results_full`	string	Detailed type-specific redaction (if `PII_RESULTS_FULL=True`)

Project Structure

pii-detector/
├── main.py            # Main processing script
├── pyproject.toml     # Dependencies
├── LICENSE
├── README.md
├── .gitignore
└── data/
    └── .gitkeep       # Place your CSV files here

Troubleshooting

Issue	Solution
Model download fails	Check internet connection; model is ~500MB
Out of memory	Reduce `BATCH_SIZE` or switch to CPU
MPS not available	Falls back to CUDA, then CPU automatically
Column not found	Update `COLUMN_TO_ANALYZE` and `ID_COLUMN` to match your CSV
No output rows	Set `REMOVE_NO_PII = False` to keep all rows

Contributing

Contributions are welcome. Please read the contributing guidelines before submitting a pull request.

License

This project is licensed under the MIT License.

Credits

iiiorg/piiranha-v1-detect-personal-information — PII detection model
Hugging Face — Model hosting and Transformers library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PII Detector

Table of Contents

Overview

Features

Quick Start

Installation

Prerequisites

With Poetry

With pip

Usage

Configuration

Input / Output Format

Input CSV (`data/DATA_PII.csv`)

Output CSV (`data/ANSWERS_pii.csv`)

Project Structure

Troubleshooting

Contributing

License

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
doc		doc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

PII Detector

Table of Contents

Overview

Features

Quick Start

Installation

Prerequisites

With Poetry

With pip

Usage

Configuration

Input / Output Format

Input CSV (data/DATA_PII.csv)

Output CSV (data/ANSWERS_pii.csv)

Project Structure

Troubleshooting

Contributing

License

Credits

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Input CSV (`data/DATA_PII.csv`)

Output CSV (`data/ANSWERS_pii.csv`)

Packages