Skip to content

inclusion-international/speech-jokey

Repository files navigation

Speech Jokey

Speech-Jokey is a speech synthesis application with prosody parametrization for eye-tracking users.

  • Main features and objectives
    • Creation of synthesized audio using various speech APIs
    • Optimized for Eye-Tracking users (PwD: Cerbral Palsy,...)
    • Simplified input of prosody parametrization to change parameters like pitch, speaking rate, or emotion of generated audio.

speech-jokey

Demo video

Demo video of the speech jokey program showing the generation of synthetic speech using the Elevenlabs API.

Screencast of Speech Jokey showing user interface options and different synthetic voices

Install and Run

  1. Download and extract the current release.
  2. Double-click onto speech-jokey.exe (or the respective command for your OS)

Developer documentation

Please read the developer documentation which describes how to locally install and run the program.

User documentation

This is what the application currently looks like.

Main screen

picture of application

Settings dialog

picture of application

Speech Provider

The application currently supports the following speech synthesis providers:

ElevenLabs

ElevenLabs Speech Synthesis is an online TTS API and transforms written text into highly realistic speech by analyzing context and emotion, adjusting intonation and pacing accordingly. It supports over 30 languages, enabling users to generate lifelike audio content across diverse linguistic contexts.

To create an API-Key, create an ElevenLabs account, click on "My Account" > "API Keys" > "Create API Key". Copy the API key and paste it in the settings of the Speech-Jokey app. The free tier has 10,000 credits per month (~10 min audio).

ElevenLabs account add api key

OpenAI

OpenAI's online Text-to-Speech API converts written text into natural-sounding spoken language using the latest advances in machine learning. To create an API-Key, create an OpenAI account, go to the OpenAI website, click on "Your profile" in the upper right corner, click on "API Keys" > "Create new secret key". Copy the API key and paste it in the settings of the Speech-Jokey App.

OpenAI account add api key

OpenAI TTS has no free tier, there has to be added a credit balance with a credit or debit card. The pricing is $0.015 per 1000 characters with the tts−1 model and $0.030 per 1000 characters with tts-1-hd. Unfortunately, OpenAI TTS does not have support for SSML or any prosody parametrization.

Amazon Polly

Amazon Polly is a cloud-based service provided by Amazon Web Services (AWS) that uses advanced deep learning technologies to convert text into lifelike speech. It enables developers to integrate speech synthesis into their applications, making them more engaging and accessible.

To start, go to the Amazon Polly Website, click on "Get started with Amazon Polly" and create an account. Go to the Aws console and click on "Create API Key". You will see two keys: A normal access key and a secret access key. Together with your preferred region (e.g eu-north-1) copy the credentials and paste them in the settings of the Speech-Jokey app.

Amazon Polly account add api key

Amazon Polly has a 12-month trial phase in which 5 million characters for standard voices and 1 million characters for neural voices per month are free. After the free trial or outside of the free quota, Amazon Polly standard voices are charged $4 and neural voices with $16 per 1 million characters.

Amazon Polly has support for some SSML tags like break, emphasis, lang, prosody or say-as.

Microsoft Azure

Microsoft Azure Text-to-Speech online API supports Neural Text-to-Speech (Neural TTS), which enhances pronunciation, prosody, and overall voice quality. The API includes over 500 voices.

Create a Microsoft Azure Account and follow the steps of this guide to create a speech resource. On the site https://portal.azure.com/#home you can see your resources. Click on your text-to-speech resource and copy and paste your key and region into the settings of the Speech-Jokey app. The free subscription (F0) has 0.5 million characters free per month, the Standard (S0) is a pay-as-you-go service and TTS costs $15 per 1M characters.

Microsoft Azure supports a wide range of SSML content where each voice has their own supported styles.

gTTS

Google Text-to-Speech (gTTS) is a Python library and command-line utility that allows developers to convert text into speech using Google's Text-to-Speech online API without the need of an API-Key or a subscription. gTTS does not support SSML or prosody parametrization.

Ofai

Ofai is a TTS for Austrian dialects developed by the Austrian Research Institute for Artificial Intelligence (OFAI) that is specifically designed to handle Austrian German dialects. It is an online API and does not need an API-Key or subscription. Ofai does not support SSML or prosody parametrization.

Editing the text

To simplify editing the text, the cursor set via eye tracker is always placed at the end of a word. To move the cursor one position to the left or right, the user can use the arrow buttons at the bottom left of the application.

The editing feature is addressed especially to people who need eye tracking devices to move the cursor.

Selecting voice

The voice can be selected using the voice selection button or in the settings. All available voices are listed. On the selection of a voice a Popup will appear and the selected voice is displayed.

The currently selected voice is always displayed on the voice selection button.

Using the voice selection button: picture of application

Use Emojis as SSML tags

The App currently uses these emojis for following SSML tags:

⏸️: <break time="2s">

Adds a break of 2s.

😐: <emphasis level="reduced">
🙂: <emphasis level="moderate">
😁: <emphasis level="strong">

Sets the emphasis level of the voice. The emphasis of words affects the speed and volume of speech. With more emphasis, the text is spoken louder and slower. With less emphasis, words are spoken quieter and faster. You specify the level of emphasis with the level attribute.

🔈: <prosody volume="silent">
🔉: <prosody volume="medium">
🔊: "<prosody volume="loud">

Sets the prosody volume of the voice. Sets the volume to a predefined value for the current voice.

🐌: "<prosody rate="slow">
🚶: "<prosody rate="medium">
🏃: "<prosody rate="fast">

Sets the prosody rate of the voice. Sets the speaking speed to a predefined value for the selected voice.

🗣️⬇️: "<prosody pitch="low">
🗣️⬆️: "<prosody pitch="high">
🗣️⏫: "<prosody pitch="x-high">

Sets the prosody pitch of the voice. Sets the pitch to a predefined value for the selected voice.

🌏: "<lang xml:lang="en-US">

Use the tag to specify a different language for a word (set to english), phrase or sentence. Foreign-language words and phrases are usually better spoken if they are enclosed in a pair of tags.

The popup looks like this:

Emoji Popup for SSML

The buttons on the popup place emojis in the textfield of the App. Text enclosed between two identical emojis executes the action of the respective SSML tag. Here, the first sentence will be spoken with a lower voice and the second one with a higher voice.

Text with emojis for SSML

Here, there is a pause after the "Hello". The next sentence is spoken with a strong emphasis level. Text with emojis for SSML

Prosody parametrization

Several simplified textual shortcuts can be used to parametrize the prosody of the generated audio.

Pause

The following punctuation marks can be used to add a speaking pause between words:

  • , adds a break of 0.0s
  • . adds a break of 0.5s
  • ; adds a break of 0.5s

Synthesizing audio

An audio file is generated using the synthesizing button.

picture of application

Playing audio

Before playing the audio file, an audio file has to be synthesized using the synthesizing button.

The file can then be played with the play button.

Saving the text file

The final version of the edited text can be saved as a text file.

Setting up Speech Provider

To setup a speech provider (e.g. ElevenLabs) the API Key must be entered in the respective settings dialog. Please check the documenation of your speech provider to get an API key.

picture of application

About

A speech synthesis software with integration of several TTS APIs, SSML support and optimizations for users with motor impairment. (course ASSIST HEIDI WS2023 and SS2024)

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors