|
| 1 | +# NeMoConformerASR-Android |
| 2 | + |
| 3 | +Kotlin library for speech recognition using NVIDIA NeMo Conformer CTC model on Android with ONNX Runtime. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- NVIDIA NeMo Conformer CTC Small model (13M parameters) |
| 8 | +- **ONNX Runtime** for reliable cross-device inference |
| 9 | +- Returns both full text and timestamped segments |
| 10 | +- Automatic audio chunking for long audio (>20 seconds) |
| 11 | +- BPE tokenization (1024 vocabulary) |
| 12 | +- Pure Kotlin implementation |
| 13 | + |
| 14 | +## Requirements |
| 15 | + |
| 16 | +- Android API 26+ |
| 17 | +- Any ARM or x86 device (ONNX Runtime handles compatibility) |
| 18 | + |
| 19 | +## Installation |
| 20 | + |
| 21 | +### JitPack |
| 22 | + |
| 23 | +Add JitPack to your root `settings.gradle.kts`: |
| 24 | + |
| 25 | +```kotlin |
| 26 | +dependencyResolutionManagement { |
| 27 | + repositories { |
| 28 | + maven { url = uri("https://jitpack.io") } |
| 29 | + } |
| 30 | +} |
| 31 | +``` |
| 32 | + |
| 33 | +Add the dependency to your module's `build.gradle.kts`: |
| 34 | + |
| 35 | +```kotlin |
| 36 | +dependencies { |
| 37 | + implementation("com.github.Otosaku:NeMoConformerASR-Android:1.0.0") |
| 38 | +} |
| 39 | +``` |
| 40 | + |
| 41 | +### Download Models |
| 42 | + |
| 43 | +Download the ONNX models from Google Drive: |
| 44 | + |
| 45 | +**[Download Models (65 MB)](https://drive.google.com/file/d/1F2QBIyvxONhufgIA5xD0aN07wuN6Bn9r/view?usp=sharing)** |
| 46 | + |
| 47 | +The archive contains: |
| 48 | +- `conformer_encoder.onnx` - Conformer encoder (64 MB) |
| 49 | +- `conformer_decoder.onnx` - CTC decoder (0.7 MB) |
| 50 | +- `vocabulary.json` - BPE vocabulary (1024 tokens) |
| 51 | + |
| 52 | +Models should be downloaded to app's internal storage (not bundled in APK to reduce app size). |
| 53 | + |
| 54 | +## Usage |
| 55 | + |
| 56 | +### Basic Recognition |
| 57 | + |
| 58 | +```kotlin |
| 59 | +import com.otosaku.nemoconformerasr.NeMoConformerASR |
| 60 | + |
| 61 | +// Initialize with model file paths |
| 62 | +val asr = NeMoConformerASR( |
| 63 | + context = context, |
| 64 | + encoderPath = "${context.filesDir}/conformer_encoder.onnx", |
| 65 | + decoderPath = "${context.filesDir}/conformer_decoder.onnx", |
| 66 | + vocabularyPath = "${context.filesDir}/vocabulary.json" |
| 67 | +) |
| 68 | + |
| 69 | +// Recognize speech (samples must be 16kHz mono Float32) |
| 70 | +val audioSamples: FloatArray = loadAudio() |
| 71 | +val result = asr.recognize(audioSamples) |
| 72 | + |
| 73 | +// Full recognized text |
| 74 | +println(result.text) |
| 75 | + |
| 76 | +// Individual segments with timestamps |
| 77 | +for (segment in result.segments) { |
| 78 | + println("[${segment.start}s - ${segment.end}s]: ${segment.text}") |
| 79 | +} |
| 80 | + |
| 81 | +// Audio duration |
| 82 | +println("Duration: ${result.audioDuration}s") |
| 83 | + |
| 84 | +// Don't forget to close when done |
| 85 | +asr.close() |
| 86 | +``` |
| 87 | + |
| 88 | +### ASRResult Structure |
| 89 | + |
| 90 | +```kotlin |
| 91 | +data class ASRResult( |
| 92 | + val text: String, // Full recognized text |
| 93 | + val segments: List<ASRSegment>, // Timestamped segments |
| 94 | + val audioDuration: Double // Total audio duration in seconds |
| 95 | +) |
| 96 | + |
| 97 | +data class ASRSegment( |
| 98 | + val start: Double, // Start time in seconds |
| 99 | + val end: Double, // End time in seconds |
| 100 | + val text: String // Recognized text for this segment |
| 101 | +) |
| 102 | +``` |
| 103 | + |
| 104 | +### Supported Input Durations |
| 105 | + |
| 106 | +The model accepts up to 20 seconds of audio per inference. Longer audio is automatically split into chunks. |
| 107 | + |
| 108 | +| Duration | Samples | Mel Frames | Encoded Frames | |
| 109 | +|----------|---------|------------|----------------| |
| 110 | +| 5 sec | 80,000 | 501 | 126 | |
| 111 | +| 10 sec | 160,000 | 1,001 | 251 | |
| 112 | +| 15 sec | 240,000 | 1,501 | 376 | |
| 113 | +| 20 sec | 320,000 | 2,001 | 501 | |
| 114 | + |
| 115 | +### Long Audio Processing |
| 116 | + |
| 117 | +For audio longer than 20 seconds, the library automatically: |
| 118 | +1. Splits audio into 20-second chunks |
| 119 | +2. Processes each chunk independently |
| 120 | +3. Combines results with proper timestamps |
| 121 | + |
| 122 | +## Example Project |
| 123 | + |
| 124 | +The repository includes a complete example app with audio recording and file import. |
| 125 | + |
| 126 | +### Running the Example |
| 127 | + |
| 128 | +1. Open the project in Android Studio |
| 129 | + |
| 130 | +2. Download and add models: |
| 131 | + - Download models from the link above |
| 132 | + - Unzip the archive |
| 133 | + - Copy files to `app/src/main/assets/`: |
| 134 | + - `conformer_encoder.onnx` |
| 135 | + - `conformer_decoder.onnx` |
| 136 | + - `vocabulary.json` |
| 137 | + |
| 138 | +3. Build and run on device |
| 139 | + |
| 140 | +### Example Features |
| 141 | + |
| 142 | +- **Record Audio**: Hold button to record from microphone |
| 143 | +- **Test File**: Import audio file for testing |
| 144 | +- **Results**: Shows recognized text, duration, and processing time |
| 145 | + |
| 146 | +## Model Information |
| 147 | + |
| 148 | +- **Model**: nvidia/stt_en_conformer_ctc_small |
| 149 | +- **Parameters**: 13.15M |
| 150 | +- **Architecture**: Conformer encoder (16 layers) + CTC decoder |
| 151 | +- **Hidden dim**: 176 |
| 152 | +- **Attention heads**: 4 |
| 153 | +- **Vocabulary**: 1024 BPE tokens + 1 blank |
| 154 | + |
| 155 | +## Audio Requirements |
| 156 | + |
| 157 | +- Sample rate: 16,000 Hz |
| 158 | +- Channels: Mono |
| 159 | +- Format: Float32 |
| 160 | + |
| 161 | +## Model Architecture |
| 162 | + |
| 163 | +| Component | Input | Output | Size | |
| 164 | +|-----------|-------|--------|------| |
| 165 | +| Feature Extractor | audio (16kHz) | mel (80, frames) | - | |
| 166 | +| Encoder | mel (1, 80, 2001) | hidden (1, 176, 501) | 64 MB | |
| 167 | +| Decoder | hidden (1, 176, 501) | logits (1, 501, 1025) | 0.7 MB | |
| 168 | + |
| 169 | +## Dependencies |
| 170 | + |
| 171 | +- [ONNX Runtime Android](https://onnxruntime.ai/) - ML inference runtime |
| 172 | +- [NeMoFeatureExtractor-Android](https://github.com/Otosaku/NeMoFeatureExtractor-Android) - Mel spectrogram extraction |
| 173 | +- [Gson](https://github.com/google/gson) - JSON parsing |
| 174 | + |
| 175 | +## License |
| 176 | + |
| 177 | +MIT License |
| 178 | + |
| 179 | +## Acknowledgments |
| 180 | + |
| 181 | +- [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) - Original model and training |
| 182 | +- [ONNX Runtime](https://onnxruntime.ai/) - Cross-platform ML inference |
0 commit comments