TDNN-F + HMM VS TRANSFORMER UNTUK SPEECH RECOGNITION BAHASA INGGRIS PADA EDGE DEVICE
Keywords:
Automatic Speech Recognition (ASR), Edge Computing, TDNN-F HMM, Transformer, Performance BenchmarkAbstract
Implementasi Automatic Speech Recognition (ASR) pada perangkat edge menghadapi tantangan rekayasa dalam menyeimbangkan akurasi transkripsi dengan efisiensi komputasi. Penelitian ini melakukan perbandingan kinerja secara empiris antara arsitektur ASR hibrida, yaitu Time-Delay Neural Network-Factorized + Hidden Markov Model (TDNN-F+HMM) yang diimplementasikan melalui Vosk, dengan arsitektur end-to-end modern, yaitu Transformer, yang diimplementasikan menggunakan model Whisper varian tiny. Pengujian dilakukan secara langsung pada perangkat edge Raspberry Pi 4 Model B menggunakan subset data dari korpus LibriSpeech, dengan mengevaluasi dua metrik utama: Word Error Rate (WER) untuk akurasi dan waktu eksekusi untuk efisiensi. Hasil eksperimen menunjukkan bahwa arsitektur Transformer (Whisper) secara konsisten mencapai akurasi yang lebih unggul, dengan skor WER rata-rata 0.096 dibandingkan 0.108 untuk Vosk, yang merepresentasikan penurunan tingkat kesalahan relatif sebesar 11.1%. Namun, dalam hal efisiensi, arsitektur TDNN-F+HMM (Vosk) terbukti secara signifikan lebih cepat, dengan waktu eksekusi rata-rata 4.043 detik, hampir 80% lebih cepat dibandingkan Whisper yang mencatatkan 7.290 detik. Studi ini menyimpulkan bahwa terdapat trade-off yang jelas antara kedua arsitektur: Whisper menawarkan akurasi yang lebih tinggi, sementara Vosk memberikan latensi yang jauh lebih rendah. Temuan ini memberikan panduan berbasis bukti yang esensial bagi para pengembang dalam memilih arsitektur ASR yang optimal sesuai dengan prioritas kasus penggunaan spesifik, baik itu untuk aplikasi yang menuntut presisi tinggi maupun yang memerlukan responsivitas waktu nyata.
References
Alharbi, S., et al. (2021). Automatic speech recognition: Systematic literature review. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3112535
Avasalcai, C., Zarrin, B., & Dustdar, S. (2022). EdgeFlow—Developing and deploying latency-sensitive IoT edge applications. IEEE Internet of Things Journal, 9(5), 3877–3888. https://doi.org/10.1109/JIOT.2021.3101449
Bhandari, S. R., & Ghimire, S. (2025). Expanding horizon of English language as a lingua franca.
Chitty-Venkata, K. T., Emani, M., Vishwanath, V., & Somani, A. K. (2022). Neural architecture search for transformers: A survey. IEEE Access, 10, 108374–108412. https://doi.org/10.1109/ACCESS.2022.3212767
Gong, Y., Lai, C.-I., Chung, Y.-A., & Glass, J. (2022). SSAST: Self-supervised audio spectrogram transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 10699–10709. https://doi.org/10.1609/aaai.v36i10.21315
Ing, J. A. Y., Pascual, R. M., & Dimzon, F. D. (2022). A hybrid TDNN-HMM automatic speech recognizer for Filipino children’s speech. In 2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET) (pp. 1–6). https://doi.org/10.1109/IICAIET55139.2022.9936815
Kipyatkova, I. (2017). Experimenting with hybrid TDNN/HMM acoustic models for Russian speech recognition. In A. Karpov, R. Potapova, & I. Mporas (Eds.), Speech and Computer (pp. 362–369). Springer International Publishing.
Lee, J., Bahk, I., Kim, H., Jeong, S., Lee, S., & Min, D. (2024). An autonomous parallelization of transformer model inference on heterogeneous edge devices. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS ’24) (pp. 50–61). Association for Computing Machinery. https://doi.org/10.1145/3650200.3656628
Li, Y., Gan, J., Lin, X., Qiu, Y., Zhan, H., & Tian, H. (2024). DS-TDNN: Dual-stream time-delay neural network with global-aware filter for speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 2814–2827. https://doi.org/10.1109/TASLP.2024.3402072
Loubser, A., De Villiers, P., & De Freitas, A. (2024). End-to-end automated speech recognition using a character-based small scale transformer architecture. Expert Systems with Applications, 252. https://doi.org/10.1016/j.eswa.2024.124119
Lyu, B., Yuan, H., Lu, L., & Zhang, Y. (2022). Resource-constrained neural architecture search on edge devices. IEEE Transactions on Network Science and Engineering, 9(1), 134–142. https://doi.org/10.1109/TNSE.2021.3054583
Malik, M., Malik, M. K., Mehmood, K., & Makhdoom, I. (2021). Automatic speech recognition: A survey. Multimedia Tools and Applications, 80(6), 9411–9457. https://doi.org/10.1007/s11042-020-10073-7
Mao, S., Tao, D., Zhang, G., Ching, P. C., & Lee, T. (2019). Revisiting hidden Markov models for speech emotion recognition. In ICASSP 2019 - IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6715–6719). https://doi.org/10.1109/ICASSP.2019.8683172
Palermo, F., et al. (2025). Advancements in context recognition for edge devices and smart eyewear: Sensors and applications. IEEE Access, 13, 57062–57100. https://doi.org/10.1109/ACCESS.2025.3555426
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206–5210). https://doi.org/10.1109/ICASSP.2015.7178964
Povey, D., et al. (2018). Semi-orthogonal low-rank matrix factorization for deep neural networks. In INTERSPEECH 2018 (pp. 3743–3747). https://doi.org/10.21437/Interspeech.2018-1417
Rahali, A., & Akhloufi, M. A. (2023). End-to-end transformer-based models in textual-based NLP. AI, 4(1). https://doi.org/10.3390/ai4010004
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision.
Sun, M., et al. (2017). Compressed time delay neural network for small-footprint keyword spotting. In INTERSPEECH 2017 (pp. 3607–3611). https://doi.org/10.21437/Interspeech.2017-480
Xu, J., Hu, S., Liu, X., & Meng, H. (2022). Towards green ASR: Lossless 4-bit quantization of a hybrid TDNN system on the 300-hr Switchboard corpus. arXiv. https://doi.org/10.48550/arXiv.2206.11643.




3.png)
1.png)
1.png)
