This project provides a complete Jupyter notebook for fine-tuning Google's FLAN-T5-small model for anonymizing Personally Identifiable Information (PII) in Italian text.
- Model: FLAN-T5-small (80M parameters)
- Task: PII Anonymization
- Language: Italian
- Approach: Text-to-text format (original text → anonymized text with placeholders)
- ✅ Complete training pipeline
- ✅ Synthetic Italian PII dataset with 30+ examples
- ✅ Text-to-text anonymization format
- ✅ Training notebook
- ✅ Inference examples in notebook
- ✅ ONNX Conversion Notebook and Inference Test
- ✅ Example Kubeflow Pipeline for training
- ✅ GPU support (automatically detected)
The model learns to replace the following types of PII with generic placeholders:
- [NOME]: Person names (e.g., Mario Rossi → [NOME])
- [INDIRIZZO]: Street addresses (e.g., Via Roma 25 → [INDIRIZZO])
- [TELEFONO]: Phone numbers (e.g., 339-1234567 → [TELEFONO])
- [EMAIL]: Email addresses (e.g., mario@email.it → [EMAIL])
- [CARTA_CREDITO]: Credit card numbers (e.g., 4532-1234-5678-9010 → [CARTA_CREDITO])
- [CODICE_FISCALE]: Italian fiscal codes/tax IDs (e.g., RSSMRA85M01H501X → [CODICE_FISCALE])
- [DATA_NASCITA]: Dates of birth (e.g., 15/03/1985 → [DATA_NASCITA])
- Clone or download this repository
- Install dependencies:
uv init && uv syncTraining dataset is really (really) small and has been generated using AI. Add more datapoints as needed.
- Minimum: CPU with 8GB RAM (slower training)
- Recommended: GPU with 8GB+ VRAM (CUDA-enabled)
anonymize: [Italian text with PII]
[Same text with PII replaced by placeholders]
Input:
anonymize: Mi chiamo Mario Rossi, il mio numero è 339-1234567 e abito in Via Roma 25.
Output:
Mi chiamo [NOME], il mio numero è [TELEFONO] e abito in [INDIRIZZO].
Input:
anonymize: Per contattare Giulia Rossi chiamare il 339-8765432 o scrivere a giulia.rossi@email.it
Output:
Per contattare [NOME] chiamare il [TELEFONO] o scrivere a [EMAIL]
Input:
anonymize: Il paziente Marco Esposito, nato il 25/08/1982, codice fiscale SPSMRC82M25H501Z.
Output:
Il paziente [NOME], nato il [DATA_NASCITA], codice fiscale [CODICE_FISCALE].
After training, load and use the model:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load the model
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
model = AutoModelForSeq2SeqLM.from_pretrained(OUTPUT_DIR)
# Anonymize text
text = "Il signor Paolo Conti abita in Via Garibaldi 45."
input_text = f"anonymize: {text}"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Output: Il signor [NOME] abita in [INDIRIZZO].The notebook uses synthetic data. To improve performance, you can:
- Add more examples to the
dataset/ita_dataset.jsonlist in the notebook - Include more variations of PII patterns
- Add domain-specific examples (medical, legal, financial)
- Use real (properly consented) data if available
For better performance, use a larger model:
model_name = "google/flan-t5-base" # 250M parameters
# or
model_name = "google/flan-t5-large" # 780M parametersNote: Larger models require more memory and training time.
The notebooks and pipelines use secrets to connect to S3 storage and Hugging Face Repositories.
You need to create these secrets in the Kubernetes Project you run all experiments:
- An Hugging Face Token Secret:
$ oc create secret generic huggingface-secret --from-literal=HF_TOKEN=hf_api_token --from-literal=HF_HOME=hf_home_path- A secret holding the API Token to interface with Weights and Biases (optional)
$ oc create secret generic wandb-secret --from-literal=WB_APITOKEN=wandb-token --from-literal=WB_PROJECTNAME=project-nameAdditionally, on OCP AI you need to create connections for interacting with S3 Storage Buckets. For example, for the s3-artifacts connection you need to deploy a manifest like this:
kind: Secret
apiVersion: v1
metadata:
name: s3-artifacts
namespace: flan-t5-finetune
labels:
opendatahub.io/dashboard: 'true'
opendatahub.io/managed: 'true'
annotations:
opendatahub.io/connection-type: s3
opendatahub.io/connection-type-ref: s3
openshift.io/description: ''
openshift.io/display-name: s3-artifacts
data:
AWS_ACCESS_KEY_ID: <ACCESS_KEY_BASE64>
AWS_DEFAULT_REGION: <REGION_BASE64>
AWS_S3_BUCKET: <BUCKET_NAME_BASE64>
AWS_S3_ENDPOINT: <S3_API_ENDPOINT_BASE64>
AWS_SECRET_ACCESS_KEY: <SECRET_KEY_BASE64>
type: OpaqueThe required secrets (and corresponding buckets on S3) are:
s3-artifacts: for storing artifacts during training/experimentings3-models: for storing model checkpointss3-pipelines: for storing data used by kubeflows3-datasets: for storing datasets
- If you are using a proxy to connect to external dependencies (s3 buckets, huggingface, etc):
$ oc create configmap network-settings --from-literal=HTTP_PROXY="proxy address" --from-literal=HTTPS_PROXY="proxy address" --from-literal=NO_PROXY="localhost,127.0.0.1,.svc.cluster.local,kubernetes.default.svc,metadata-grpc-service,0,1,2,3,4,5,6,7,8,9"