This project provides an AWS CDK stack for processing WhatsApp voice messages with transcription capabilities. It allows you to receive voice messages via WhatsApp, transcribe them using either Amazon Whisper (via Amazon Bedrock Marketplace) or Amazon Transcribe, and send the transcription back to the user. The system can also optionally respond with audio messages using Amazon Polly text-to-speech conversion.
This solution provides the building blocks and blueprint for processing inbound voice messages and sending out voice messages using WhatsApp and AWS.
The demo above shows the complete voice-to-voice workflow: sending a WhatsApp voice message, receiving the transcription as text, and getting the same transcribed text converted back to speech using Amazon Polly as an audio response.
The system consists of the following components:
- SNS Topic: Inbound WhatsApp messages and events are published there
- SQS Queue: Subscribes to the SNS topic and buffers messages for processing
- Lambda Function: Processes voice messages from the queue
- S3 Buckets: Temporarily stores audio files and access logs
- Amazon Polly: Converts text to speech for audio responses
- AWS KMS: Provides encryption for SNS, SQS, and S3 data
- Secure Communication: All data is encrypted using AWS KMS
- Flexible Configuration: Use existing SNS topics or create new ones
- Dual Transcription Options: Choose between Whisper or Amazon Transcribe
- Audio Responses: Optional text-to-speech responses using Amazon Polly
- Bidirectional Communication: Process both text and audio messages
- AWS Account with appropriate permissions
- Node.js 14.x or later
- AWS CDK installed (
npm install -g aws-cdk) - WhatsApp Business account with a registered phone number
- For Whisper: Deploy it through Amazon Bedrock marketplace model deployment
The system is configured through the config.params.json file:
{
"CdkProjectName": "WhatsappVoiceStack",
"Engine": "whisper",
"WhisperEndpointName": "your-whisper-endpoint-name",
"WhatsAppPhoneNumberId": "YOUR_WHATSAPP_PHONE_NUMBER_ID",
"WhatsAppSNSTopicArn": "",
"CreateNewSnsTopic": true,
"EnableAudioResponses": true,
"PollyVoiceId": "Joanna",
"Tags": {
"Project": "WhatsAppVoice",
"Environment": "Development"
}
}| Parameter | Description |
|---|---|
CdkProjectName |
Name of the CDK stack |
Engine |
Transcription engine to use (whisper or transcribe) |
WhisperEndpointName |
Name of the endpoint running Whisper, can be found in Amazon Bedrock => Tune => Marketplace model deployment => Managed deployments (required if Engine is whisper) |
WhatsAppPhoneNumberId |
Your WhatsApp phone number ID |
WhatsAppSNSTopicArn |
ARN of an existing SNS topic (leave empty to create a new one) |
CreateNewSnsTopic |
Whether to create a new SNS topic (true) or use existing (false) |
EnableAudioResponses |
Whether to enable audio responses using Polly (true or false) |
PollyVoiceId |
The voice ID to use for Polly text-to-speech (e.g., Joanna, Matthew) |
Tags |
AWS resource tags |
- Clone this repository
- Update the
config.params.jsonfile with your settings - Install dependencies:
npm install
- Build the project:
npm run build
- Deploy the stack:
cdk deploy
Once deployed, the system will automatically process WhatsApp messages:
-
Text Messages:
- A user sends a text message to your WhatsApp Business number
- The message is published to the SNS topic
- The SQS queue receives the message
- The Lambda function processes the text message and sends a response
- If audio responses are enabled, it also converts the text to speech using Polly and sends an audio response
-
Voice Messages:
- A user sends a voice message to your WhatsApp Business number
- The message is published to the SNS topic
- The SQS queue receives the message
- The Lambda function processes the voice message:
- Downloads the audio file
- Transcribes it using the configured engine
- Sends the transcription back to the user
- If audio responses are enabled, it also converts the transcription to speech using Polly and sends an audio response
- Stores the audio in S3
The Lambda function consists of several modules:
whatsapp-processor.ts: Main handler for processing messagesservices/WhatsAppService.ts: Service for interacting with WhatsApp APIservices/S3Service.ts: Service for S3 operationsservices/WTranscribeService.ts: Service for Whisper transcriptionservices/TranscribeService.ts: Service for Amazon Transcribeservices/PollyService.ts: Service for Amazon Polly text-to-speech
The system includes an FFmpeg Lambda layer for audio processing:
- Located in
layers/ffmpeg/ - Contains the FFmpeg binary executable in
bin/ffmpeg - Used for converting audio formats (OGG to WAV/PCM) before transcription
- Automatically attached to the Lambda function during deployment
- All data in transit and at rest is encrypted
- SNS, SQS, and S3 use AWS KMS for encryption
- S3 buckets enforce SSL and block public access
- IAM policies follow the principle of least privilege
- CloudWatch Logs for Lambda function
- S3 access logs for bucket operations
- CloudWatch Metrics for SNS, SQS, and Lambda
To remove all resources created by this stack:
cdk destroy
