TDNN-F + HMM VS TRANSFORMER UNTUK SPEECH RECOGNITION BAHASA INGGRIS PADA EDGE DEVICE

Authors

  • Azarya Aditya Krisna Moeljono Departemen Teknik Elektro dan Informatika, Universitas Negeri Malang, Malang, Indonesia
  • Muhammad Jauharul Fuady Departemen Teknik Elektro dan Informatika, Universitas Negeri Malang, Malang, Indonesia

Keywords:

Automatic Speech Recognition (ASR), Edge Computing, TDNN-F HMM, Transformer, Performance Benchmark

Abstract

Implementasi Automatic Speech Recognition (ASR) pada perangkat edge menghadapi tantangan rekayasa dalam menyeimbangkan akurasi transkripsi dengan efisiensi komputasi. Penelitian ini melakukan perbandingan kinerja secara empiris antara arsitektur ASR hibrida, yaitu Time-Delay Neural Network-Factorized + Hidden Markov Model (TDNN-F+HMM) yang diimplementasikan melalui Vosk, dengan arsitektur end-to-end modern, yaitu Transformer, yang diimplementasikan menggunakan model Whisper varian tiny. Pengujian dilakukan secara langsung pada perangkat edge Raspberry Pi 4 Model B menggunakan subset data dari korpus LibriSpeech, dengan mengevaluasi dua metrik utama: Word Error Rate (WER) untuk akurasi dan waktu eksekusi untuk efisiensi. Hasil eksperimen menunjukkan bahwa arsitektur Transformer (Whisper) secara konsisten mencapai akurasi yang lebih unggul, dengan skor WER rata-rata 0.096 dibandingkan 0.108 untuk Vosk, yang merepresentasikan penurunan tingkat kesalahan relatif sebesar 11.1%. Namun, dalam hal efisiensi, arsitektur TDNN-F+HMM (Vosk) terbukti secara signifikan lebih cepat, dengan waktu eksekusi rata-rata 4.043 detik, hampir 80% lebih cepat dibandingkan Whisper yang mencatatkan 7.290 detik. Studi ini menyimpulkan bahwa terdapat trade-off yang jelas antara kedua arsitektur: Whisper menawarkan akurasi yang lebih tinggi, sementara Vosk memberikan latensi yang jauh lebih rendah. Temuan ini memberikan panduan berbasis bukti yang esensial bagi para pengembang dalam memilih arsitektur ASR yang optimal sesuai dengan prioritas kasus penggunaan spesifik, baik itu untuk aplikasi yang menuntut presisi tinggi maupun yang memerlukan responsivitas waktu nyata.

References

Alharbi, S., et al. (2021). Automatic speech recognition: Systematic literature review. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3112535

Avasalcai, C., Zarrin, B., & Dustdar, S. (2022). EdgeFlow—Developing and deploying latency-sensitive IoT edge applications. IEEE Internet of Things Journal, 9(5), 3877–3888. https://doi.org/10.1109/JIOT.2021.3101449

Bhandari, S. R., & Ghimire, S. (2025). Expanding horizon of English language as a lingua franca.

Chitty-Venkata, K. T., Emani, M., Vishwanath, V., & Somani, A. K. (2022). Neural architecture search for transformers: A survey. IEEE Access, 10, 108374–108412. https://doi.org/10.1109/ACCESS.2022.3212767

Gong, Y., Lai, C.-I., Chung, Y.-A., & Glass, J. (2022). SSAST: Self-supervised audio spectrogram transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 10699–10709. https://doi.org/10.1609/aaai.v36i10.21315

Ing, J. A. Y., Pascual, R. M., & Dimzon, F. D. (2022). A hybrid TDNN-HMM automatic speech recognizer for Filipino children’s speech. In 2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET) (pp. 1–6). https://doi.org/10.1109/IICAIET55139.2022.9936815

Kipyatkova, I. (2017). Experimenting with hybrid TDNN/HMM acoustic models for Russian speech recognition. In A. Karpov, R. Potapova, & I. Mporas (Eds.), Speech and Computer (pp. 362–369). Springer International Publishing.

Lee, J., Bahk, I., Kim, H., Jeong, S., Lee, S., & Min, D. (2024). An autonomous parallelization of transformer model inference on heterogeneous edge devices. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS ’24) (pp. 50–61). Association for Computing Machinery. https://doi.org/10.1145/3650200.3656628

Li, Y., Gan, J., Lin, X., Qiu, Y., Zhan, H., & Tian, H. (2024). DS-TDNN: Dual-stream time-delay neural network with global-aware filter for speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 2814–2827. https://doi.org/10.1109/TASLP.2024.3402072

Loubser, A., De Villiers, P., & De Freitas, A. (2024). End-to-end automated speech recognition using a character-based small scale transformer architecture. Expert Systems with Applications, 252. https://doi.org/10.1016/j.eswa.2024.124119

Lyu, B., Yuan, H., Lu, L., & Zhang, Y. (2022). Resource-constrained neural architecture search on edge devices. IEEE Transactions on Network Science and Engineering, 9(1), 134–142. https://doi.org/10.1109/TNSE.2021.3054583

Malik, M., Malik, M. K., Mehmood, K., & Makhdoom, I. (2021). Automatic speech recognition: A survey. Multimedia Tools and Applications, 80(6), 9411–9457. https://doi.org/10.1007/s11042-020-10073-7

Mao, S., Tao, D., Zhang, G., Ching, P. C., & Lee, T. (2019). Revisiting hidden Markov models for speech emotion recognition. In ICASSP 2019 - IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6715–6719). https://doi.org/10.1109/ICASSP.2019.8683172

Palermo, F., et al. (2025). Advancements in context recognition for edge devices and smart eyewear: Sensors and applications. IEEE Access, 13, 57062–57100. https://doi.org/10.1109/ACCESS.2025.3555426

Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206–5210). https://doi.org/10.1109/ICASSP.2015.7178964

Povey, D., et al. (2018). Semi-orthogonal low-rank matrix factorization for deep neural networks. In INTERSPEECH 2018 (pp. 3743–3747). https://doi.org/10.21437/Interspeech.2018-1417

Rahali, A., & Akhloufi, M. A. (2023). End-to-end transformer-based models in textual-based NLP. AI, 4(1). https://doi.org/10.3390/ai4010004

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision.

Sun, M., et al. (2017). Compressed time delay neural network for small-footprint keyword spotting. In INTERSPEECH 2017 (pp. 3607–3611). https://doi.org/10.21437/Interspeech.2017-480

Xu, J., Hu, S., Liu, X., & Meng, H. (2022). Towards green ASR: Lossless 4-bit quantization of a hybrid TDNN system on the 300-hr Switchboard corpus. arXiv. https://doi.org/10.48550/arXiv.2206.11643.

Downloads

Published

10-06-2025

Issue

Section

Articles