Skip to content

Commit 4df0fc9

Browse files
authored
Minor edit, add banner
1 parent faf8f67 commit 4df0fc9

2 files changed

Lines changed: 9 additions & 5 deletions

File tree

_data/blog/2025-01-02-thonburian-whisper.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ template: BlogPost
33
path: /blog/thonburian-whisper-2024
44
date: 2025-01-02T07:08:53.137Z
55
title: From Thonburian Whisper to Multimodal Applications
6-
thumbnail: "/assets/blogs/thonburian-whisper.jpg"
6+
thumbnail: "/assets/blogs/thonburian-whisper-banner.jpg"
77
metaDescription:
88
---
99

@@ -13,19 +13,23 @@ metaDescription:
1313
Thonburian Whisper
1414
</p>
1515

16-
Hello January 2025! Thai automatic speech recognition (ASR) has made significant strides in recent years. I am particularly impressed by the development from Ekapol's lab on foundation techniques in ASR, NECTEC on releasing Pathumma (series of models including LLM, ASR, Multimodal models), and SCB10X on recently releasing Typhoon 2. Our lab is also excited to be a part of the journey by releasing Thonburian Whisper, a fine-tuned Whisper for Thai.
16+
Hello January 2025! Thai automatic speech recognition (ASR) has made significant strides in recent years. I am particularly impressed by the development from Ekapol's lab on foundation techniques in ASR, NECTEC on releasing Pathumma (series of models including LLM, ASR, Multimodal models), and SCB 10X on recently releasing Typhoon 2 (series of LLM and multimodal LLMs). Our lab is also excited to be a part of the journey by releasing Thonburian Whisper, a fine-tuned Whisper for Thai.
1717

1818
## Origins
1919

2020
In December 2022, our lab and Wordsense (an affiliated company with Looloo Technology) released Thonburian Whisper as a part of Huggingface's Whisper fine-tuning event. When we first released the model, our goal was to address the challenges that vanilla Whisper models faced with Thai speech. Through careful combination of audio datasets, strategic augmentation, and improved segmentation, we achieved significant improvements in Word Error Rate (WER) across different model sizes. The distilled models proved particularly interesting, achieving strong performance with less than 1,500 hours of audio data. We recently publish our model at [ICNLSP 2024](https://aclanthology.org/2024.icnlsp-1.17/) where we have grown community using our model on [Github](https://github.com/biodatlab/thonburian-whisper).
2121

2222
## Community Adoption
2323

24-
It's been incredibly encouraging to see the adoption and extension of our work by the Thai AI community. I'd like to particularly thank the team at NECTEC for their work on [Pathumma Whisper](https://huggingface.co/nectec/Pathumma-whisper-th-large-v3), which builds upon and references our Thonburian Whisper approach. I also would like to thank [DMIND](https://aimet.tech/en/all-projects/dmind/), a national depression screening application, developed from AIMET and [PresScribe](https://looloohealth.com/en/), application that summarizes patient conversations into medical records, that use our model for national depression screening. If you use our model for Thai speech-to-text application, please feel free send message to us anytime.
24+
It's been incredibly encouraging to see the adoption and extension of our work by the Thai AI community. We'd like to particularly thank the team at NECTEC for their work on [Pathumma Whisper](https://huggingface.co/nectec/Pathumma-whisper-th-large-v3), which builds upon and references our Thonburian Whisper approach. We also would like to thank
25+
- [DMIND](https://aimet.tech/en/all-projects/dmind/), a national depression screening application, developed from AIMET
26+
- and [PresScribe](https://looloohealth.com/en/), application that summarizes patient conversations into medical records
2527

26-
## Expanding into Multimodal Applications
28+
that use our model as a base model. If you use our model for Thai speech-to-text application, please feel free send message.
2729

28-
The impact of Thonburian Whisper has extended beyond pure speech recognition. We're particularly excited about its integration into larger multimodal language models. The recent [Typhoon2-Audio project](https://arxiv.org/abs/2412.13702) demonstrates this perfectly, using a fine-tuned version of Thonburian Whisper Large as part of its speech encoding stack with BEATs for audio encoder. I highly recommend reading their Arxiv paper where they went in details of their training recipes.
30+
## Expanding into Multimodal LLM
31+
32+
The impact of Thonburian Whisper has also extended beyond pure speech recognition. We're particularly excited about its integration into larger multimodal language models. The recent [Typhoon2-Audio project](https://arxiv.org/abs/2412.13702) demonstrates this perfectly, using a fine-tuned version of Thonburian Whisper Large as part of its speech encoding stack with BEATs for audio encoder. We highly recommend reading their Arxiv paper where they went in details of their Audio LLM training recipes.
2933

3034
<p align="center">
3135
<img src="/assets/blogs/typhoon2-audio.jpg" width=400>
63.5 KB
Loading

0 commit comments

Comments
 (0)