This script performs automatic speech recognition (speech-to-text) locally using the openai/whisper-base model via the Hugging Face transformers library.
It is designed to be flexible:
- It will prioritize using a local audio file path that you specify within the script.
- If the specified local file is not found, it will automatically download a short sample audio clip from the Hugging Face Hub (
hf-internal-testing/librispeech_asr_dummy) and use that instead for demonstration purposes.
- Performs ASR locally on your machine.
- Uses the efficient
openai/whisper-basemodel. - Prioritizes user-specified local audio file.
- Includes a fallback to download a sample audio file if the local file is not found.
- Leverages the Hugging Face
transformersanddatasetslibraries. - Optionally utilizes GPU for faster processing if available and
torchis installed with CUDA support.
- ASR Model:
openai/whisper-base
Before running the script, ensure you have the following installed:
- Python: Python 3.8 or later recommended.
- System Dependencies (Ubuntu/Debian):
libsndfile: Required by thesoundfilePython library for reading/writing audio files.ffmpeg: Required by underlying libraries (likelibrosaortransformers) for decoding various audio formats when loading from a filename.
sudo apt update && sudo apt install libsndfile1 ffmpeg - Python Libraries: You can install these using pip. It is highly recommended to use a Python virtual environment.
pip install transformers torch datasets soundfile librosa
transformers: The core Hugging Face library.torch: The deep learning framework backend (PyTorch). Or installtensorflow.datasets: Used to download the sample audio file if your local file isn't found.soundfile: Used for handling audio file operations.librosa: Provides advanced audio analysis features, often required bydatasetsortransformersfor audio loading/processing.
- Clone or Download: Get the
run_asr_flexible.pyscript onto your local machine. - Create Virtual Environment (Recommended):
(Use
python3 -m venv .venv source .venv/bin/activate.\.venv\Scripts\activateon Windows) - Install System Dependencies: Follow the instructions in the Prerequisites section above.
- Install Python Libraries: Run the pip command from the Prerequisites section within your activated virtual environment.
-
Configure Audio Input:
- Open the
run_asr_flexible.pyscript in a text editor. - Locate the line:
user_audio_path = "my_audio.wav" - Option A (Recommended): Change the path
"my_audio.wav"to the exact path of the audio file you want to transcribe on your computer (e.g.,/home/user/recordings/meeting.wavorC:/Users/user/Documents/sound.mp3). - Option B: Place your audio file in the same directory as the
run_asr_flexible.pyscript and ensure its name is exactlymy_audio.wav. - Note: If the script does not find a file at the specified
user_audio_path, it will attempt to download and use the sample audio. Common audio formats like WAV and FLAC are generally well-supported. Other formats like MP3 often depend on theffmpeginstallation.
- Open the
-
Run the Script:
- Open your terminal or command prompt.
- Make sure your virtual environment is activated (if you created one).
- Navigate to the directory containing the script.
- Execute the script using Python:
python run_asr_flexible.py
The script will print status messages to the console, including:
-
Which audio source is being used (your local file or the downloaded sample).
-
Confirmation of model loading and the device being used (CPU or GPU).
-
The final transcription result, like:
--- Transcription Result --- Recognized Text: "The birch canoe slid on the smooth planks." ----------------------------(The exact text will depend on the audio input)
libsndfileerrors: Ensure you ransudo apt install libsndfile1.ffmpeg was not found: Ensure you ransudo apt install ffmpeg.datasetslibrary not found: Runpip install datasets(only needed for the fallback sample).soundfile/librosanot found: Runpip install soundfile librosa.- Errors during transcription: Ensure your audio file is not corrupted and is in a reasonably common format. Check console for specific error messages. Very long audio files might require adjusting pipeline parameters (not implemented in this basic script).
- CPU: The script will run on a CPU, but transcription speed will depend on your processor and the audio length.
- GPU: If you have an NVIDIA GPU with CUDA set up correctly and the appropriate version of
torchinstalled, the script will automatically use it for significantly faster processing. - RAM: Model loading and processing require a moderate amount of RAM (a few GB should be sufficient for the
whisper-basemodel).
- The
run_asr_flexible.pyscript itself is provided as an example (consider adding an MIT License if distributing). - The Hugging Face libraries (
transformers,datasets) are typically under the Apache 2.0 License. - The
openai/whisper-basemodel has its own license terms (generally permissive for research/use, but check the model card on Hugging Face Hub for specifics).