Skip to content

Commit 58678bb

Browse files
otosaku-aiclaude
andcommitted
Initial commit: NeMoConformerASR-Android with ONNX Runtime
- NeMo Conformer CTC Small model for speech recognition - ONNX Runtime for cross-device compatibility - Support for audio up to 20 seconds (auto-chunking for longer) - Returns timestamped segments for long audio - Example app with recording and file import 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
0 parents  commit 58678bb

27 files changed

Lines changed: 1724 additions & 0 deletions

.gitignore

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Built application files
2+
*.apk
3+
*.ap_
4+
*.aab
5+
6+
# Files for the ART/Dalvik VM
7+
*.dex
8+
9+
# Java class files
10+
*.class
11+
12+
# Generated files
13+
bin/
14+
gen/
15+
out/
16+
build/
17+
18+
# Gradle files
19+
.gradle/
20+
build/
21+
22+
# Local configuration file (sdk path, etc)
23+
local.properties
24+
25+
# Log Files
26+
*.log
27+
28+
# Android Studio Navigation editor temp files
29+
.navigation/
30+
31+
# Android Studio captures folder
32+
captures/
33+
34+
# IntelliJ
35+
*.iml
36+
.idea/
37+
38+
# Keystore files
39+
*.jks
40+
*.keystore
41+
42+
# Google Services
43+
google-services.json
44+
45+
# MacOS
46+
.DS_Store
47+
48+
# Model files (user should add these to app/src/main/assets/)
49+
*.pte
50+
*.onnx
51+
app/src/main/assets/conformer_encoder.onnx
52+
app/src/main/assets/conformer_decoder.onnx
53+
app/src/main/assets/vocabulary.json
54+
app/src/main/assets/sample_audio.wav

README.md

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# NeMoConformerASR-Android
2+
3+
Kotlin library for speech recognition using NVIDIA NeMo Conformer CTC model on Android with ONNX Runtime.
4+
5+
## Features
6+
7+
- NVIDIA NeMo Conformer CTC Small model (13M parameters)
8+
- **ONNX Runtime** for reliable cross-device inference
9+
- Returns both full text and timestamped segments
10+
- Automatic audio chunking for long audio (>20 seconds)
11+
- BPE tokenization (1024 vocabulary)
12+
- Pure Kotlin implementation
13+
14+
## Requirements
15+
16+
- Android API 26+
17+
- Any ARM or x86 device (ONNX Runtime handles compatibility)
18+
19+
## Installation
20+
21+
### JitPack
22+
23+
Add JitPack to your root `settings.gradle.kts`:
24+
25+
```kotlin
26+
dependencyResolutionManagement {
27+
repositories {
28+
maven { url = uri("https://jitpack.io") }
29+
}
30+
}
31+
```
32+
33+
Add the dependency to your module's `build.gradle.kts`:
34+
35+
```kotlin
36+
dependencies {
37+
implementation("com.github.Otosaku:NeMoConformerASR-Android:1.0.0")
38+
}
39+
```
40+
41+
### Download Models
42+
43+
Download the ONNX models from Google Drive:
44+
45+
**[Download Models (65 MB)](https://drive.google.com/file/d/1F2QBIyvxONhufgIA5xD0aN07wuN6Bn9r/view?usp=sharing)**
46+
47+
The archive contains:
48+
- `conformer_encoder.onnx` - Conformer encoder (64 MB)
49+
- `conformer_decoder.onnx` - CTC decoder (0.7 MB)
50+
- `vocabulary.json` - BPE vocabulary (1024 tokens)
51+
52+
Models should be downloaded to app's internal storage (not bundled in APK to reduce app size).
53+
54+
## Usage
55+
56+
### Basic Recognition
57+
58+
```kotlin
59+
import com.otosaku.nemoconformerasr.NeMoConformerASR
60+
61+
// Initialize with model file paths
62+
val asr = NeMoConformerASR(
63+
context = context,
64+
encoderPath = "${context.filesDir}/conformer_encoder.onnx",
65+
decoderPath = "${context.filesDir}/conformer_decoder.onnx",
66+
vocabularyPath = "${context.filesDir}/vocabulary.json"
67+
)
68+
69+
// Recognize speech (samples must be 16kHz mono Float32)
70+
val audioSamples: FloatArray = loadAudio()
71+
val result = asr.recognize(audioSamples)
72+
73+
// Full recognized text
74+
println(result.text)
75+
76+
// Individual segments with timestamps
77+
for (segment in result.segments) {
78+
println("[${segment.start}s - ${segment.end}s]: ${segment.text}")
79+
}
80+
81+
// Audio duration
82+
println("Duration: ${result.audioDuration}s")
83+
84+
// Don't forget to close when done
85+
asr.close()
86+
```
87+
88+
### ASRResult Structure
89+
90+
```kotlin
91+
data class ASRResult(
92+
val text: String, // Full recognized text
93+
val segments: List<ASRSegment>, // Timestamped segments
94+
val audioDuration: Double // Total audio duration in seconds
95+
)
96+
97+
data class ASRSegment(
98+
val start: Double, // Start time in seconds
99+
val end: Double, // End time in seconds
100+
val text: String // Recognized text for this segment
101+
)
102+
```
103+
104+
### Supported Input Durations
105+
106+
The model accepts up to 20 seconds of audio per inference. Longer audio is automatically split into chunks.
107+
108+
| Duration | Samples | Mel Frames | Encoded Frames |
109+
|----------|---------|------------|----------------|
110+
| 5 sec | 80,000 | 501 | 126 |
111+
| 10 sec | 160,000 | 1,001 | 251 |
112+
| 15 sec | 240,000 | 1,501 | 376 |
113+
| 20 sec | 320,000 | 2,001 | 501 |
114+
115+
### Long Audio Processing
116+
117+
For audio longer than 20 seconds, the library automatically:
118+
1. Splits audio into 20-second chunks
119+
2. Processes each chunk independently
120+
3. Combines results with proper timestamps
121+
122+
## Example Project
123+
124+
The repository includes a complete example app with audio recording and file import.
125+
126+
### Running the Example
127+
128+
1. Open the project in Android Studio
129+
130+
2. Download and add models:
131+
- Download models from the link above
132+
- Unzip the archive
133+
- Copy files to `app/src/main/assets/`:
134+
- `conformer_encoder.onnx`
135+
- `conformer_decoder.onnx`
136+
- `vocabulary.json`
137+
138+
3. Build and run on device
139+
140+
### Example Features
141+
142+
- **Record Audio**: Hold button to record from microphone
143+
- **Test File**: Import audio file for testing
144+
- **Results**: Shows recognized text, duration, and processing time
145+
146+
## Model Information
147+
148+
- **Model**: nvidia/stt_en_conformer_ctc_small
149+
- **Parameters**: 13.15M
150+
- **Architecture**: Conformer encoder (16 layers) + CTC decoder
151+
- **Hidden dim**: 176
152+
- **Attention heads**: 4
153+
- **Vocabulary**: 1024 BPE tokens + 1 blank
154+
155+
## Audio Requirements
156+
157+
- Sample rate: 16,000 Hz
158+
- Channels: Mono
159+
- Format: Float32
160+
161+
## Model Architecture
162+
163+
| Component | Input | Output | Size |
164+
|-----------|-------|--------|------|
165+
| Feature Extractor | audio (16kHz) | mel (80, frames) | - |
166+
| Encoder | mel (1, 80, 2001) | hidden (1, 176, 501) | 64 MB |
167+
| Decoder | hidden (1, 176, 501) | logits (1, 501, 1025) | 0.7 MB |
168+
169+
## Dependencies
170+
171+
- [ONNX Runtime Android](https://onnxruntime.ai/) - ML inference runtime
172+
- [NeMoFeatureExtractor-Android](https://github.com/Otosaku/NeMoFeatureExtractor-Android) - Mel spectrogram extraction
173+
- [Gson](https://github.com/google/gson) - JSON parsing
174+
175+
## License
176+
177+
MIT License
178+
179+
## Acknowledgments
180+
181+
- [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) - Original model and training
182+
- [ONNX Runtime](https://onnxruntime.ai/) - Cross-platform ML inference

app/build.gradle.kts

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
plugins {
2+
id("com.android.application")
3+
id("org.jetbrains.kotlin.android")
4+
}
5+
6+
android {
7+
namespace = "com.otosaku.conformerexample"
8+
compileSdk = 34
9+
10+
defaultConfig {
11+
applicationId = "com.otosaku.conformerexample"
12+
minSdk = 26
13+
targetSdk = 34
14+
versionCode = 1
15+
versionName = "1.0"
16+
}
17+
18+
buildTypes {
19+
release {
20+
isMinifyEnabled = false
21+
proguardFiles(
22+
getDefaultProguardFile("proguard-android-optimize.txt"),
23+
"proguard-rules.pro"
24+
)
25+
}
26+
}
27+
28+
compileOptions {
29+
sourceCompatibility = JavaVersion.VERSION_1_8
30+
targetCompatibility = JavaVersion.VERSION_1_8
31+
}
32+
33+
kotlinOptions {
34+
jvmTarget = "1.8"
35+
}
36+
37+
buildFeatures {
38+
viewBinding = true
39+
}
40+
}
41+
42+
dependencies {
43+
implementation(project(":library"))
44+
45+
implementation("androidx.core:core-ktx:1.12.0")
46+
implementation("androidx.appcompat:appcompat:1.6.1")
47+
implementation("com.google.android.material:material:1.11.0")
48+
implementation("androidx.activity:activity-ktx:1.8.2")
49+
implementation("androidx.lifecycle:lifecycle-runtime-ktx:2.7.0")
50+
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.7.3")
51+
}

app/proguard-rules.pro

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Add project specific ProGuard rules here.

app/src/main/AndroidManifest.xml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
<?xml version="1.0" encoding="utf-8"?>
2+
<manifest xmlns:android="http://schemas.android.com/apk/res/android">
3+
4+
<uses-permission android:name="android.permission.RECORD_AUDIO" />
5+
6+
<application
7+
android:allowBackup="true"
8+
android:icon="@mipmap/ic_launcher"
9+
android:label="@string/app_name"
10+
android:roundIcon="@mipmap/ic_launcher_round"
11+
android:supportsRtl="true"
12+
android:theme="@style/Theme.ConformerExample">
13+
<activity
14+
android:name=".MainActivity"
15+
android:exported="true">
16+
<intent-filter>
17+
<action android:name="android.intent.action.MAIN" />
18+
<category android:name="android.intent.category.LAUNCHER" />
19+
</intent-filter>
20+
</activity>
21+
</application>
22+
23+
</manifest>

0 commit comments

Comments
 (0)