WhatsApp Voice Message Processing System

This project provides an AWS CDK stack for processing WhatsApp voice messages with transcription capabilities. It allows you to receive voice messages via WhatsApp, transcribe them using either Amazon Whisper (via Amazon Bedrock Marketplace) or Amazon Transcribe, and send the transcription back to the user. The system can also optionally respond with audio messages using Amazon Polly text-to-speech conversion.

This solution provides the building blocks and blueprint for processing inbound voice messages and sending out voice messages using WhatsApp and AWS.

Demo

The demo above shows the complete voice-to-voice workflow: sending a WhatsApp voice message, receiving the transcription as text, and getting the same transcribed text converted back to speech using Amazon Polly as an audio response.

Architecture

The system consists of the following components:

SNS Topic: Inbound WhatsApp messages and events are published there
SQS Queue: Subscribes to the SNS topic and buffers messages for processing
Lambda Function: Processes voice messages from the queue
S3 Buckets: Temporarily stores audio files and access logs
Amazon Polly: Converts text to speech for audio responses
AWS KMS: Provides encryption for SNS, SQS, and S3 data

Features

Secure Communication: All data is encrypted using AWS KMS
Flexible Configuration: Use existing SNS topics or create new ones
Dual Transcription Options: Choose between Whisper or Amazon Transcribe
Audio Responses: Optional text-to-speech responses using Amazon Polly
Bidirectional Communication: Process both text and audio messages

Prerequisites

AWS Account with appropriate permissions
Node.js 14.x or later
AWS CDK installed (npm install -g aws-cdk)
WhatsApp Business account with a registered phone number
For Whisper: Deploy it through Amazon Bedrock marketplace model deployment

Configuration

The system is configured through the config.params.json file:

{
    "CdkProjectName": "WhatsappVoiceStack",
    "Engine": "whisper",
    "WhisperEndpointName": "your-whisper-endpoint-name",
    "WhatsAppPhoneNumberId": "YOUR_WHATSAPP_PHONE_NUMBER_ID",
    "WhatsAppSNSTopicArn": "",
    "CreateNewSnsTopic": true,
    "EnableAudioResponses": true,
    "PollyVoiceId": "Joanna",
    "Tags": {
        "Project": "WhatsAppVoice",
        "Environment": "Development"
    }
}

Configuration Options

Parameter	Description
`CdkProjectName`	Name of the CDK stack
`Engine`	Transcription engine to use (`whisper` or `transcribe`)
`WhisperEndpointName`	Name of the endpoint running Whisper, can be found in Amazon Bedrock => Tune => Marketplace model deployment => Managed deployments (required if Engine is `whisper`)
`WhatsAppPhoneNumberId`	Your WhatsApp phone number ID
`WhatsAppSNSTopicArn`	ARN of an existing SNS topic (leave empty to create a new one)
`CreateNewSnsTopic`	Whether to create a new SNS topic (`true`) or use existing (`false`)
`EnableAudioResponses`	Whether to enable audio responses using Polly (`true` or `false`)
`PollyVoiceId`	The voice ID to use for Polly text-to-speech (e.g., `Joanna`, `Matthew`)
`Tags`	AWS resource tags

Deployment

Clone this repository
Update the config.params.json file with your settings
Install dependencies:
```
npm install
```
Build the project:
```
npm run build
```
Deploy the stack:
```
cdk deploy
```

Usage

Once deployed, the system will automatically process WhatsApp messages:

Text Messages:
- A user sends a text message to your WhatsApp Business number
- The message is published to the SNS topic
- The SQS queue receives the message
- The Lambda function processes the text message and sends a response
- If audio responses are enabled, it also converts the text to speech using Polly and sends an audio response
Voice Messages:
- A user sends a voice message to your WhatsApp Business number
- The message is published to the SNS topic
- The SQS queue receives the message
- The Lambda function processes the voice message:
  - Downloads the audio file
  - Transcribes it using the configured engine
  - Sends the transcription back to the user
  - If audio responses are enabled, it also converts the transcription to speech using Polly and sends an audio response
  - Stores the audio in S3

Lambda Function Structure

The Lambda function consists of several modules:

whatsapp-processor.ts: Main handler for processing messages
services/WhatsAppService.ts: Service for interacting with WhatsApp API
services/S3Service.ts: Service for S3 operations
services/WTranscribeService.ts: Service for Whisper transcription
services/TranscribeService.ts: Service for Amazon Transcribe
services/PollyService.ts: Service for Amazon Polly text-to-speech

FFmpeg Lambda Layer

The system includes an FFmpeg Lambda layer for audio processing:

Located in layers/ffmpeg/
Contains the FFmpeg binary executable in bin/ffmpeg
Used for converting audio formats (OGG to WAV/PCM) before transcription
Automatically attached to the Lambda function during deployment

Security Considerations

All data in transit and at rest is encrypted
SNS, SQS, and S3 use AWS KMS for encryption
S3 buckets enforce SSL and block public access
IAM policies follow the principle of least privilege

Monitoring and Logging

CloudWatch Logs for Lambda function
S3 access logs for bucket operations
CloudWatch Metrics for SNS, SQS, and Lambda

Cleanup

To remove all resources created by this stack:

cdk destroy

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bin		bin
layers/ffmpeg/bin		layers/ffmpeg/bin
lib		lib
src/lambda		src/lambda
test		test
.gitignore		.gitignore
.npmignore		.npmignore
Architecture diagram.png		Architecture diagram.png
README.md		README.md
WhatsApp-Voice-to-Voice-demo.gif		WhatsApp-Voice-to-Voice-demo.gif
architecture.png		architecture.png
cdk.json		cdk.json
config.params.json		config.params.json
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WhatsApp Voice Message Processing System

Demo

Architecture

Features

Prerequisites

Configuration

Configuration Options

Deployment

Usage

Lambda Function Structure

FFmpeg Lambda Layer

Security Considerations

Monitoring and Logging

Cleanup

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WhatsApp Voice Message Processing System

Demo

Architecture

Features

Prerequisites

Configuration

Configuration Options

Deployment

Usage

Lambda Function Structure

FFmpeg Lambda Layer

Security Considerations

Monitoring and Logging

Cleanup

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages