An AI-powered document redaction system for the Tucson Police Department that automatically identifies and applies redactions to scanned police documents based on configurable guidelines. The system uses Llama 4 for semantic analysis to determine what must be redacted.
Customers are responsible for making their own independent assessment of the information in this document.
This document:
(a) is for informational purposes only,
(b) references AWS product offerings and practices, which are subject to change without notice,
(c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided "as is" without warranties, representations, or conditions of any kind, whether express or implied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers, and
(d) is not to be considered a recommendation or viewpoint of AWS.
Additionally, you are solely responsible for testing, security and optimizing all code and assets on GitHub repo, and all such code and assets should be considered:
(a) as-is and without warranties or representations of any kind,
(b) not suitable for production environments, or on production or other critical data, and
(c) to include shortcuts in order to support rapid prototyping such as, but not limited to, relaxed authentication and authorization and a lack of strict adherence to security best practices.
All work produced is open source. More information can be found in the GitHub repo.
- Features
- Architecture Overview
- How It Works
- Project Structure
- Backend Integration
- Deployment
- API Reference
- Guidelines Management
- User Guide
- Stack Management
- OCR Text Extraction: Tesseract-based OCR extracts all text with precise bounding boxes from scanned documents
- AI-Powered Semantic Analysis: Amazon Nova Pro identifies what must be redacted based on configurable guideline rules
- Cross-Page Consistency: Entity index tracks names, identifiers, and PII across all pages to ensure consistent redaction
- Parallel Processing: Concurrent Nova API calls for faster processing of multi-page documents
- Human Review Workflow: AI proposes redactions, reviewers approve or edit before final application
- Permanent Redaction: PyMuPDF applies redactions that permanently destroy underlying content — not visual overlays
- PDF-to-JSON Conversion: Upload redaction guideline PDFs and automatically convert them to structured rule sets
- Rule-Based Configuration: Each rule specifies what to redact, who it applies to, conditions, and exceptions
- Version Management: Activate, deactivate, and update guidelines independently of case processing
- Full Lifecycle Tracking: Cases progress through defined statuses from upload to completion
- S3-Based Storage: All documents (unredacted, proposals, edited, redacted) stored securely in S3
- DynamoDB Metadata: Case status, processing metadata, and error tracking in DynamoDB
- Presigned URL Access: Secure document upload and download via presigned S3 URLs
The application uses a serverless architecture on AWS:
- Frontend: React application (separate repository)
- API Layer: REST API Gateway with Lambda proxy integration
- Database Lambda: Handles all API operations — case CRUD, guideline management, presigned URLs, and triggers Bedrock processing
- Bedrock Lambda: Performs document processing — OCR extraction, Nova Pro analysis, and redaction application
- Storage: S3 for documents, DynamoDB for case and guideline metadata
- AI/ML: Amazon Nova Pro via Bedrock Converse API for semantic analysis, Tesseract OCR for text extraction
- Split OCR + Semantic Analysis: OCR provides pixel-precise text locations while Nova Pro focuses purely on semantic understanding of what to redact — each tool does what it's best at
- Human-in-the-Loop: AI proposes, humans approve — critical for legal compliance in law enforcement document handling
- Serverless: Lambda functions scale to demand with no idle infrastructure costs
- Configurable Guidelines: Redaction rules can be updated without code changes by uploading new guideline PDFs
- Permanent Redaction: PyMuPDF's
apply_redactions()destroys underlying content, meeting legal standards for document redaction
- Upload: Scanned police document (PDF) is uploaded to S3 via presigned URL
- OCR Extraction: Tesseract extracts all text from every page with word-level bounding boxes in PDF point coordinates
- Entity Indexing: Cross-page index identifies recurring names, phone numbers, SSNs, and other PII patterns across the document
- Document Summary: Nova Pro generates a summary from page 1 to provide context for per-page analysis
- Semantic Analysis: Nova Pro receives each page's OCR text map (not the image) plus guidelines, entity index, and document summary — identifies which text blocks to redact and which rules justify each redaction
- Coordinate Resolution: Bounding boxes are resolved from OCR output (not estimated by the AI), with fuzzy text matching as fallback
- Human Review: Redaction proposals are presented in the frontend for review and editing
- Application: Approved redactions are permanently applied to the PDF using PyMuPDF black rectangles
CASE_CREATED → UNREDACTED_UPLOADED → PROCESSING → REVIEW_READY → REVIEWING → APPLYING_REDACTIONS → COMPLETED
If processing fails, the status reverts to allow retry:
PROCESSINGfailure → reverts toUNREDACTED_UPLOADEDAPPLYING_REDACTIONSfailure → reverts toREVIEWING
TucsonPD/
├── .gitignore
├── README.md
├── setup.sh # Main deployment orchestration script
├── frontend_build.sh # Frontend build automation script
│
├── tucson-pd-backend/ # Backend CDK Stack
│ ├── app.py # CDK app entry point
│ ├── cdk.json # CDK configuration
│ ├── requirements.txt # Python dependencies
│ │
│ ├── tucson_pd_backend/
│ │ ├── __init__.py
│ │ └── tucson_pd_backend_stack.py # Infrastructure definition (CDK)
│ │ # Creates: DynamoDB tables, S3 bucket, Lambda functions,
│ │ # Lambda layers, API Gateway, IAM roles
│ │
│ └── lambdas/
│ ├── bedrock_lambda/ # Document processing Lambda
│ │ ├── lambda_function.py # Main handler — routes process/apply/convert actions
│ │ ├── process_document.py # OCR extraction → Nova analysis → bbox resolution
│ │ ├── apply_redactions.py # Applies permanent black-box redactions to PDF
│ │ ├── convert_guidelines.py # Converts guideline PDFs to structured JSON rules
│ │ ├── ocr_extraction.py # Tesseract OCR with word-level bounding boxes
│ │ ├── utils.py # AWS service helpers (S3, DynamoDB, Bedrock)
│ │ ├── bedrock_config.py # Model IDs, temperatures, prompt retrieval
│ │ ├── constants.py # Environment variables and configuration
│ │ └── prompts/
│ │ ├── document_summary.py # Prompt for document summarization
│ │ ├── page_analysis.py # Prompt for per-page redaction identification
│ │ ├── guidelines_conversion.py # Prompt for PDF-to-JSON guideline conversion
│ │ └── error.py # Fallback error prompt
│ │
│ ├── database_management_lambda/ # API operations Lambda
│ │ └── lambda_function.py # Case CRUD, guidelines, presigned URLs
│ │
│ └── layers/
│ ├── pdf-processing-layer.zip # PyMuPDF and pdfplumber
│ └── tesseract-ocr-layer.zip # Tesseract binary + English language data
│
└── tucson-pd-frontend/ # Frontend CDK Stack
├── app.py # CDK app entry point
├── cdk.json # CDK configuration
├── requirements.txt # Python dependencies
├── build.zip # Generated by frontend_build.sh (gitignored)
│
├── tucson_pd_frontend/
│ ├── __init__.py
│ └── tucson_pd_frontend_stack.py # Frontend infrastructure definition
│
├── lambdas/
│ ├── amplify_deployment_lambda/ # Amplify deployment automation
│ ├── create_admin_user_lambda/ # Admin user creation
│ └── post_confirmation_lambda/ # Post-signup confirmation
│
└── frontend/ # React application
└── ... # Frontend source code
The application uses a REST API Gateway with Lambda proxy integration. All API operations are handled by the Database Management Lambda, which triggers the Bedrock Lambda for processing tasks.
After deployment, the API endpoint is available in the CDK stack outputs as ApiGatewayUrl.
https://{api-id}.execute-api.{region}.amazonaws.com/prod/
Frontend → API Gateway → Database Lambda → (DynamoDB / S3 / Bedrock Lambda)
For document processing, the Database Lambda invokes the Bedrock Lambda asynchronously and returns immediately. The Bedrock Lambda updates DynamoDB status as it progresses, which the frontend can poll.
Ensure you have access to the Nova Pro LLM model in your AWS account and permissions to deploy resources.
Required AWS Services Access:
- Amazon Bedrock (Nova Pro model:
us.amazon.nova-pro-v1:0) - AWS Lambda
- Amazon S3
- Amazon DynamoDB
- Amazon API Gateway
- AWS IAM
Network Configuration:
- Default VPC available in your AWS region
- If default VPC doesn't exist, create using: AWS VPC Documentation
-
Navigate to EC2 Console
- Log into your AWS Console
- Search for "EC2" in the services search bar
- Click on "EC2" under Services
-
Launch Instance
- Click "Launch Instance" button
- Configure instance settings:
- Name: "TucsonPD-Deployment"
- AMI: Amazon Linux 2023
- Instance type: t2.micro (or larger)
- Use default VPC configuration
- Click "Launch Instance"
- Wait for instance to reach "running" state
-
Connect to Instance
- Navigate to the Instances page
- Select your created instance
- Click "Connect"
- Use EC2 Instance Connect or SSH with username:
ec2-user
Note: Save your instance details for future management operations. This instance will be needed for updates and stack deletion.
Execute the following commands in your EC2 terminal:
Note: When password prompts appear, press Enter. Use Ctrl+Shift+V for terminal paste operations.
-
Install Git and Node.js
sudo yum install -y git nodejs
-
Install AWS CDK
npm install -g aws-cdk
-
Install Python dependencies
pip install aws-cdk-lib constructs
-
Verify Installations
git --version cdk --version python3 --version
-
Clone the repository:
git clone https://github.com/ASUCICREPO/TucsonPD
-
Navigate to the project directory:
cd TucsonPD -
Make the setup script executable:
chmod +x setup.sh
-
Run the deployment script:
./setup.sh deploy
-
Enter your preferred admin email and password
-
Wait for the system to be deployed (may take up to 30 minutes)
-
View the application at the deployed Amplify link, which can be found in the Amplify console page
-
Stop the EC2 Instance
- Exit the EC2 instance
- Navigate to EC2 console
- Select your instance
- Click "Instance state" → "Stop instance"
Important: Keep the EC2 instance for future deployment management, updates, and deletion operations. Do not terminate the instance.
| Method | Endpoint | Description |
|---|---|---|
POST |
/cases |
Create a new case |
GET |
/cases |
List all cases |
GET |
/cases/{case_id} |
Get case details |
DELETE |
/cases/{case_id} |
Delete a case |
PUT |
/cases/{case_id}/status |
Update case status |
PUT |
/cases/{case_id}/s3-path |
Update S3 paths |
| Method | Endpoint | Description |
|---|---|---|
POST |
/presigned-url/upload |
Get presigned URL for document upload |
POST |
/presigned-url/download |
Get presigned URL for document download |
| Method | Endpoint | Description |
|---|---|---|
POST |
/guidelines/upload |
Upload a new guideline PDF |
GET |
/guidelines/all |
List all guidelines |
GET |
/guidelines/active |
Get the active guideline |
PUT |
/guidelines/{guideline_id} |
Update guideline JSON |
DELETE |
/guidelines/{guideline_id} |
Delete a guideline |
POST |
/guidelines/{guideline_id}/process |
Convert guideline PDF to JSON |
PUT |
/guidelines/{guideline_id}/activate |
Activate a guideline |
GET |
/guidelines/{guideline_id}/rules |
Get guideline rules JSON |
GET |
/guidelines/{guideline_id}/document |
Get presigned URL for guideline PDF |
Each guideline PDF is converted to a JSON rule set with this structure:
{
"version": "1.0",
"rules": [
{
"id": 1,
"description": "Redact victim name, DOB, SSN, and DL number per ARS 13-4434",
"applies_to": ["victim_adult"],
"conditions": ["all cases"],
"exceptions": ["Victim provides signed notarized request form"]
}
]
}| Value | Description |
|---|---|
all |
Applies regardless of role |
victim_adult |
Adult victims |
victim_juvenile |
Juvenile victims |
victim_sexual_assault |
Sexual assault victims |
witness_adult |
Adult witnesses |
witness_juvenile |
Juvenile witnesses |
suspect_adult |
Adult suspects |
suspect_juvenile |
Juvenile suspects |
arrestee_adult |
Adult arrestees |
arrestee_juvenile |
Juvenile arrestees |
government_employee |
Government employees |
confidential_informant |
Confidential informants |
undercover_officer |
Undercover officers |
Upload PDF → Process (convert to JSON) → Review rules → Activate
Only one guideline can be active at a time. Activating a new guideline automatically deactivates the previous one.
- Create a Case: Create a new case in the system with relevant metadata
- Upload Document: Upload the scanned police document PDF via presigned URL
- Ensure Active Guideline: Verify an active guideline exists (the system will fail with a clear error if not)
- Trigger Processing: Initiate document processing — the system will OCR the document, analyze each page, and generate redaction proposals
- Review Proposals: Review the AI-generated redaction proposals in the frontend — each proposal shows the text to redact, the justifying rules, and the bounding box location
- Edit if Needed: Add missed redactions, remove false positives, or adjust bounding boxes
- Apply Redactions: Submit approved redactions — the system permanently applies black boxes to the PDF
- Download Result: Download the final redacted document
- Upload: Upload a redaction guidelines PDF (e.g., department policy document)
- Process: Trigger PDF-to-JSON conversion — Nova Pro extracts structured rules
- Review: Inspect the generated rules for accuracy and completeness
- Edit: Modify rules if needed via the API
- Activate: Set the guideline as active for document processing
-
Start EC2 Instance
- Navigate to EC2 console in AWS
- Select the "TucsonPD-Deployment" instance
- Click "Instance state" → "Start instance"
- Wait for instance to reach "running" state
-
Connect to Instance
- Select your instance
- Click "Connect"
- Use EC2 Instance Connect
-
Navigate to Project Directory
cd TucsonPD -
Run Destroy Script
./setup.sh destroy
- Confirm deletion when prompted
- Wait for stack deletion to complete
-
Manual Cleanup (if needed)
- Navigate to S3 console and delete the redaction bucket if it wasn't automatically removed
- Navigate to Amplify console and delete the application if it wasn't automatically removed
- Check DynamoDB console for any remaining tables
-
Stop or Terminate EC2 Instance
- Once deletion is complete, you can terminate the EC2 instance
- Navigate to EC2 console → Select instance → "Instance state" → "Terminate instance"
Note: The destroy process removes Lambda functions, API Gateway, S3 buckets, DynamoDB tables, and related resources. S3 buckets with versioned objects may require manual deletion.
-
Start EC2 Instance
- Restart the "TucsonPD-Deployment" instance
- Connect using EC2 Instance Connect
-
Navigate to Project Directory
cd TucsonPD -
Pull Latest Changes
git pull
-
Redeploy with Updates
./setup.sh destroy ./setup.sh deploy
-
Stop EC2 Instance
- Exit and stop the instance when complete
Important: Always maintain the EC2 instance for future updates and management. Terminating it will require a new instance setup for future operations.
