PresidioRedactionProcessor

The Presidio Redaction Processor is an OpenTelemetry processor designed to analyze and eliminate Personally Identifiable Information (PII) from OpenTelemetry Logs & Traces. Utilization of this Processor requires some knowledge of Presidio, and OpenTelemetry

Architectural Overview (Read this first)
Processor Configuration
Deployment
Adding custom recognizers
Performance Benchmarks

Architectural Overview:

The Presidio Redaction processor relies on the capabilities built into Microsoft Presidio, an open source tool for identification and anonymization of PII data in text.

The Processor has been built with flexibility in mind, and hence there are 2 deployment options when using this processor.

Option 1: External Mode

Deploy the Collector with Presidio, and then deploy separate Presidio Containers into your environment. When running this configuration, the Processor will communicate with the Presidio containers via HTTP to analyze and anonymize instances of PII in your logs & traces.

Pros/Cons of this approach:

Pros	Cons
The Presidio containers being deployed are maintained and updated regularly by the Open Source maintainers of Presidio	As the containers are pre-built, there is limited flexibility for adding/altering custom recognizers
	Depending on your environment, communicating with the Presidio Containers via HTTP can add significant overhead

Option 2: Embedded Mode

This code repository contains an implementation of Presidio with a grpc wrapper around it. This allows you to deploy a working instance of Presidio inside the OpenTelemetry Collector, eliminating the need to deploy additional Presidio Containers and make additional HTTP Calls.

Pros/Cons of this approach:

Pros	Cons
No requirement for external HTTP calls to other containers	As this relies on a custom implementation of Presidio, there are no maintenance guarantees. The implementation of Presidio is provided as-is, and must be maintained by yourself.
PII Data doesn't leave the OpenTelemetry Collector container
The custom implementation of Presidio allows you to add additional recognizers to suit your PII detection requirements.

Processor Configuration

Please refer to the schema.yaml or the config.go for all configuration options.

Example configuration:

processors:
  presidio_redaction:
    # Specifies whether Presidio is deployed externally to the collector or internally.
    # Refer to the architecture section of this README for more info.
    mode: "embedded"
    # Sets the behaviour of the processor if it encounters an error
    error_mode: "propagate"
    analyzer:
      language: "en"
      score_threshold: 0.5
    anonymizer:
    # Defines how PII gets anonymized when detected
      anonymizers:
        - entity: "default"
          type: "HASH"
          hash_type: "sha256"
   # Utilizes OTTL (OpenTelemetry Transformation Language) to set flags
   # It is recommended to set these flags and pass in attributes indicating
   # which logs/traces you believe contains PII. This will avoid adding
   # unneccessary overhead to the Otel Pipeline
   process_trace_if:
    - 'attributes["contains_pii"] == true'
   process_log_if:
    - 'resource.attributes["service.name"] == "sample-service" and severity_text == "INFO"'

OpenTelemetry Transformation Language

Deploying the Processor into your environment:

The Presidio Processor is intended to be run inside an OpenTelemetry Collector.

If you already have a custom OpenTelemetry Collector deployed into your environment:

Add the gomod reference to your builder-config:

- gomod: github.com/RKapadia01/presidioredactionprocessor/presidioredactionprocessor v0.1.0
Populate the config.yaml with the relevant configuration. Refer to the schema.yaml or the config.go for all configuration options

Deploying a pre-built collector with the Presidio Redaction Processor:

This repository contains a set of Dockerfiles which provide pre-built and pre-configured OpenTelemetry Collectors that contain the Presidio Redaction Processor.

Firstly, please refer to the Architectural Overview to understand the deployment options (embedded vs. external).

If running presidio in embedded mode, from the root of the repository, build the Dockerfile:

docker build . -f CollectorWithPresidio.Dockerfile

If running presidio in external mode:

docker build . -f CollectorOnly.Dockerfile

then, pull and run the Presidio Services from MCR:

docker run --rm -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest
docker run --rm -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest

Adding Custom Recognizers:

In order to identify additional types of PII that are not supported by Presidio out of the box, you need to develop and add a custom recognizer. Please refer to Presidio's documentation for information on how to customize the Presidio Analyzer: https://microsoft.github.io/presidio/samples/python/customizing_presidio_analyzer/

The Presidio Redaction Processor supports custom recognizers when running in "embedded mode" (Refer to Architectural Overview for more info). Once a custom recognizer has been developed, open the server.py file, and add the recognizer to the registry:

analyzer_registry.add_recognizer(CustomRecognizer())

Rebuild the Dockerfile to incorporate the new recognizer into your Otel Collector:

docker build . -f CollectorWithPresidio.Dockerfile

Performance Benchmarks

Performance testing on the Presidio Processor has revealed minimal latency impact in both External and Embedded modes.

Collector -> HTTP * 2 -> Presidio

Target URL:          http://localhost:4318/v1/traces
Max requests:        1000
Concurrent clients:  32
Running on cores:    16
Agent:               none

Completed requests:  1000
Total errors:        0
Total time:          7.905 s
Mean latency:        248.8 ms
Effective rps:       127

Percentage of requests served within a certain time
  50%      240 ms
  90%      321 ms
  95%      337 ms
  99%      352 ms
 100%      354 ms (longest request)

Collector -> gRPC -> Presidio

Target URL:          http://localhost:4318/v1/traces
Max requests:        1000
Concurrent clients:  32
Running on cores:    16
Agent:               none

Completed requests:  1000
Total errors:        0
Total time:          7.736 s
Mean latency:        242.7 ms
Effective rps:       129

Percentage of requests served within a certain time
  50%      243 ms
  90%      266 ms
  95%      271 ms
  99%      285 ms
 100%      295 ms (longest request)

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.github/workflows		.github/workflows
diagrams		diagrams
docker		docker
presidio_grpc_wrapper		presidio_grpc_wrapper
presidioredactionprocessor		presidioredactionprocessor
.gitignore		.gitignore
README.md		README.md
presidio.proto		presidio.proto
requirements.txt		requirements.txt
schema.yaml		schema.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PresidioRedactionProcessor

Architectural Overview:

Processor Configuration

Deploying the Processor into your environment:

Adding Custom Recognizers:

Performance Benchmarks

Collector -> HTTP * 2 -> Presidio

Collector -> gRPC -> Presidio

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PresidioRedactionProcessor

Architectural Overview:

Processor Configuration

Deploying the Processor into your environment:

Adding Custom Recognizers:

Performance Benchmarks

Collector -> HTTP * 2 -> Presidio

Collector -> gRPC -> Presidio

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages