Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması

Tahir Bekiryazıcı; Gürkan Aydemir; Hakan Gürkan

doi:10.46387/bjesr.1452937

Research Article

Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması

Year 2024, Volume: 6 Issue: 1, 113 - 124, 30.04.2024

Tahir Bekiryazıcı Gürkan Aydemir Hakan Gürkan

https://doi.org/10.46387/bjesr.1452937

Abstract

Bu çalışmada, konuşma işaretlerini sıkıştırmak için derin öğrenme tabanlı oto kodlayıcı ve artık vektör nicemlemesini temel alan sıkıştırma yöntemi önerilmiştir. Önerilen sıkıştırma yönteminde, öncelikle giriş konuşma işaretini daha düşük boyutlu bir uzaya atayan oto kodlayıcı kullanılmakta ve ardından oto kodlayıcı çıkışı, artık vektör nicemlemesi ile daha da sıkıştırılmaktadır. Sıkıştırma yöntemi, birbirine paralel çalışan iki farklı kod çözücü yapısı ve iki kod kitapçığı sayesinde farklı oranlarda sıkıştırma oranı sunmaktadır. Yöntemin başarımı konuşma kalitesini algısal değerlendirme metriği kullanılarak TIMIT veri kümesi ile test edilmiştir. Önerilen konuşma sıkıştırma yöntemi, 1.25 ve 2.5 kbps iletim hızları için sırasıyla 1.665 ve 1.985 konuşma kalitesini algısal değerlendirme skorları elde etmiştir.

Keywords

Konuşma Sıkıştırma, Nedensel Evrişimsel Sinir Ağları, Artık Vektör Nicemlemesi, Derin Oto kodlayıcı

Supporting Institution

Bursa Teknik Üniversitesi

Project Number

230D005

Thanks

Bu çalışma Bursa Teknik Üniversitesi Bilimsel Araştırma Projeleri birimi tarafından desteklenmiştir. Proje no: 230D005

References

P.K. Mongia, and R.K. Sharma, “Estimation and statistical analysis of human voice parameters to investigate the influence of psychological stress and to determine the vocal tract transfer function of an individual,” Journal of Computer Networks and Communications, vol. 2014, no. 17, pp. 1-17, 2014.
T.F. Quatieri, “Discrete-time speech signal processing: principles and practice,” Pearson Education India, 2002.
P. Warkade, and A. Mishra, “Lossless Speech Compression Techniques: A Literature Review,” International Journal of Innovative Research in Computer Science & Technology, vol. 3, pp. 25-32, 2015.
T. Ogunfunmi, and M. Narasimha, “Principles of speech coding.” CRC Press, 2010.
L. Rabiner, and R. Schafer, “Theory and applications of digital speech processing.” Prentice Hall Press, USA, 2010.
D. O'Shaughnessy, “Linear predictive coding”, IEEE potentials, vol. 7, pp. 29-32, 1988
M. Schroeder, and B. Atal, “Code-excited linear prediction (CELP): High-quality speech at very low bit rates”, IEEE International Conference on Acoustics, Speech, and Signal Processing. vol. 10, pp. 937-940, 1985.
T. Unno, T.P. Barnwell, and K. Truong, “An improved mixed excitation linear prediction (MELP) coder,” IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings., vol. 1, pp. 245-248, 1999.
Ü. Güz, H. Gürkan, and B.S. Yarman, “A new method to represent speech signals via predefined signature and envelope sequences,” EURASIP Journal on Advances in Signal Processing, vol. 2007, pp. 1–17, 2006.
B.S. Yarman, Ü. Güz, and H. Gürkan, “On the comparative results of ‘sympes: A new method of speech modeling’,” AEU-International Journal of Electronics and Communications, vol. 60, no. 6, pp. 421–427, 2006.
A. van den Oord et al., “WaveNet: A Generative Model for Raw Audio,” arXiv, Sep. 19, 2016.
S. Kankanahalli, “End-to-end optimized speech coding with deep neural networks,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2521-2525, 2018.
H.Y. Keles, J. Rozhon, H.G. Ilk, and Voznak, M., “DeepVoCoder: A CNN model for compression and coding of narrow band speech,” IEEE Access, vol. 7, pp. 75081-75089, 2019.
Y.T. Lo, S.S. Wang, Y. Tsao, and S.Y.A. Peng, “Pruned-CELP Speech Codec Using Denoising Autoencoder with Spectral Compensation for Quality and Intelligibility Enhancement,” IEEE International Conference on Artificial Intelligence Circuits and Systems, pp. 150-151, 2019.
K. Zhen, J. Sung, M.S. Lee, S. Beack, and M. Kim, “Cascaded cross-module residual learning towards lightweight end-to-end speech coding,” arXiv:1906.07769, 2019.
D.N. Rim, I. Jang, and H. Choi, "Deep neural networks and end-to-end learning for audio compression," arXiv:2105.11681, 2021.
J. Byun, S. Shin, Y. Park, J. Sung, and S. Beack, “Optimization of deep neural network (DNN) speech coder using a multi time scale perceptual loss function,” in Proceedings of the Annual Conference of the International Speech Communication Association, pp. 4411–4415, 2022.
H. Yang, K. Zhen, S. Beack, and M. Kim, “Source-aware neural speech coding for noisy speech compression,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 706-10, 2021.
J. Zhang, C. Zhao, and W. Gao, “Optimization-inspired compact deep compressive sensing,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 4, pp. 765-774, 2020.
M. Zhang, S. Liu, and Y. Wu, “Compression and Enhancement of Speech with Low SNR based on Deep Learning,” IEEE International Conference on Machine Learning, Big Data and Business Intelligence, pp. 242-248, 2022.
K. Zhen, J. Sung, M. S. Lee, S. Beack, and M. Kim, “Scalable and efficient neural speech coding: A hybrid design,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 12-25, 2022.
R. Lotfidereshgi, and P. Gournay, “Practical cognitive speech compression,” IEEE Data Science and Learning Workshop, pp. 1-6, 2022.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, and Rabinovich, “A. Going deeper with convolutions,” IEEE conference on computer vision and pattern recognition, pp. 1-9, 2015.
N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495-507, 2021.
D.P. Kingma, and J. Ba, “Adam: A method for stochastic optimization,”, arXiv:1412.6980, 2014.
J.S. Garofolo, “TIMIT acoustic-phonetic continuous speech corpus LDC93S1,” Linguistic Data Consortium, 1993. [Online]. Available: https://catalog.ldc.upenn.edu/LDC93S1
R.F. Kubichek, “Standards and technology issues in objective voice quality assessment,” Digital Signal Processing, vol. 1, no. 2, pp. 38–44, 1991.
A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” IEEE international conference on acoustics, speech, and signal processing, vol. 2, pp. 749-752, 2001.

Deep Convolutional Autoencoder and Residual Vector Quantization-Based Compression of Speech Signals

Year 2024, Volume: 6 Issue: 1, 113 - 124, 30.04.2024

Tahir Bekiryazıcı Gürkan Aydemir Hakan Gürkan

https://doi.org/10.46387/bjesr.1452937

Abstract

This paper proposes a compression method based on deep convolutional autoencoder and residual vector quantization to compress speech signals. In the proposed method, the first encoder part of an autoencoder is utilized to map the input speech signal to a lower dimensional (code) space, and then the code is further compressed via residual vector quantization. The compression method offers different ratios due to two different decoder structures operating in parallel and the two codebooks. The method's performance is evaluated with the TIMIT dataset using the Perceptual Evaluation of Speech Quality metric. The proposed speech compression method achieved perceptual evaluation of speech quality scores of 1.665 and 1,985 for 1.25 and 2.5 kbps transmission rates, respectively.

Keywords

Speech Compression, Causal Convolutional Neural Network, Residual Vector Quantization, Deep Autoencoder

Project Number

230D005

References

P.K. Mongia, and R.K. Sharma, “Estimation and statistical analysis of human voice parameters to investigate the influence of psychological stress and to determine the vocal tract transfer function of an individual,” Journal of Computer Networks and Communications, vol. 2014, no. 17, pp. 1-17, 2014.
T.F. Quatieri, “Discrete-time speech signal processing: principles and practice,” Pearson Education India, 2002.
P. Warkade, and A. Mishra, “Lossless Speech Compression Techniques: A Literature Review,” International Journal of Innovative Research in Computer Science & Technology, vol. 3, pp. 25-32, 2015.
T. Ogunfunmi, and M. Narasimha, “Principles of speech coding.” CRC Press, 2010.
L. Rabiner, and R. Schafer, “Theory and applications of digital speech processing.” Prentice Hall Press, USA, 2010.
D. O'Shaughnessy, “Linear predictive coding”, IEEE potentials, vol. 7, pp. 29-32, 1988
M. Schroeder, and B. Atal, “Code-excited linear prediction (CELP): High-quality speech at very low bit rates”, IEEE International Conference on Acoustics, Speech, and Signal Processing. vol. 10, pp. 937-940, 1985.
T. Unno, T.P. Barnwell, and K. Truong, “An improved mixed excitation linear prediction (MELP) coder,” IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings., vol. 1, pp. 245-248, 1999.
Ü. Güz, H. Gürkan, and B.S. Yarman, “A new method to represent speech signals via predefined signature and envelope sequences,” EURASIP Journal on Advances in Signal Processing, vol. 2007, pp. 1–17, 2006.
B.S. Yarman, Ü. Güz, and H. Gürkan, “On the comparative results of ‘sympes: A new method of speech modeling’,” AEU-International Journal of Electronics and Communications, vol. 60, no. 6, pp. 421–427, 2006.
A. van den Oord et al., “WaveNet: A Generative Model for Raw Audio,” arXiv, Sep. 19, 2016.
S. Kankanahalli, “End-to-end optimized speech coding with deep neural networks,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2521-2525, 2018.
H.Y. Keles, J. Rozhon, H.G. Ilk, and Voznak, M., “DeepVoCoder: A CNN model for compression and coding of narrow band speech,” IEEE Access, vol. 7, pp. 75081-75089, 2019.
Y.T. Lo, S.S. Wang, Y. Tsao, and S.Y.A. Peng, “Pruned-CELP Speech Codec Using Denoising Autoencoder with Spectral Compensation for Quality and Intelligibility Enhancement,” IEEE International Conference on Artificial Intelligence Circuits and Systems, pp. 150-151, 2019.
K. Zhen, J. Sung, M.S. Lee, S. Beack, and M. Kim, “Cascaded cross-module residual learning towards lightweight end-to-end speech coding,” arXiv:1906.07769, 2019.
D.N. Rim, I. Jang, and H. Choi, "Deep neural networks and end-to-end learning for audio compression," arXiv:2105.11681, 2021.
J. Byun, S. Shin, Y. Park, J. Sung, and S. Beack, “Optimization of deep neural network (DNN) speech coder using a multi time scale perceptual loss function,” in Proceedings of the Annual Conference of the International Speech Communication Association, pp. 4411–4415, 2022.
H. Yang, K. Zhen, S. Beack, and M. Kim, “Source-aware neural speech coding for noisy speech compression,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 706-10, 2021.
J. Zhang, C. Zhao, and W. Gao, “Optimization-inspired compact deep compressive sensing,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 4, pp. 765-774, 2020.
M. Zhang, S. Liu, and Y. Wu, “Compression and Enhancement of Speech with Low SNR based on Deep Learning,” IEEE International Conference on Machine Learning, Big Data and Business Intelligence, pp. 242-248, 2022.
K. Zhen, J. Sung, M. S. Lee, S. Beack, and M. Kim, “Scalable and efficient neural speech coding: A hybrid design,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 12-25, 2022.
R. Lotfidereshgi, and P. Gournay, “Practical cognitive speech compression,” IEEE Data Science and Learning Workshop, pp. 1-6, 2022.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, and Rabinovich, “A. Going deeper with convolutions,” IEEE conference on computer vision and pattern recognition, pp. 1-9, 2015.
N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495-507, 2021.
D.P. Kingma, and J. Ba, “Adam: A method for stochastic optimization,”, arXiv:1412.6980, 2014.
J.S. Garofolo, “TIMIT acoustic-phonetic continuous speech corpus LDC93S1,” Linguistic Data Consortium, 1993. [Online]. Available: https://catalog.ldc.upenn.edu/LDC93S1
R.F. Kubichek, “Standards and technology issues in objective voice quality assessment,” Digital Signal Processing, vol. 1, no. 2, pp. 38–44, 1991.
A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” IEEE international conference on acoustics, speech, and signal processing, vol. 2, pp. 749-752, 2001.

There are 28 citations in total.

Details

Primary Language	Turkish
Subjects	Coding, Information Theory and Compression, Deep Learning
Journal Section	Research Articles
Authors	Tahir Bekiryazıcı 0000-0002-0664-649X Gürkan Aydemir 0000-0001-9213-576X Hakan Gürkan 0000-0002-7008-4778
Project Number	230D005
Early Pub Date	April 27, 2024
Publication Date	April 30, 2024
Submission Date	March 14, 2024
Acceptance Date	April 15, 2024
Published in Issue	Year 2024 Volume: 6 Issue: 1

Cite

APA	Bekiryazıcı, T., Aydemir, G., & Gürkan, H. (2024). Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması. Mühendislik Bilimleri Ve Araştırmaları Dergisi, 6(1), 113-124. https://doi.org/10.46387/bjesr.1452937
AMA	Bekiryazıcı T, Aydemir G, Gürkan H. Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması. BJESR. April 2024;6(1):113-124. doi:10.46387/bjesr.1452937
Chicago	Bekiryazıcı, Tahir, Gürkan Aydemir, and Hakan Gürkan. “Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı Ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması”. Mühendislik Bilimleri Ve Araştırmaları Dergisi 6, no. 1 (April 2024): 113-24. https://doi.org/10.46387/bjesr.1452937.
EndNote	Bekiryazıcı T, Aydemir G, Gürkan H (April 1, 2024) Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması. Mühendislik Bilimleri ve Araştırmaları Dergisi 6 1 113–124.
IEEE	T. Bekiryazıcı, G. Aydemir, and H. Gürkan, “Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması”, BJESR, vol. 6, no. 1, pp. 113–124, 2024, doi: 10.46387/bjesr.1452937.
ISNAD	Bekiryazıcı, Tahir et al. “Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı Ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması”. Mühendislik Bilimleri ve Araştırmaları Dergisi 6/1 (April 2024), 113-124. https://doi.org/10.46387/bjesr.1452937.
JAMA	Bekiryazıcı T, Aydemir G, Gürkan H. Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması. BJESR. 2024;6:113–124.
MLA	Bekiryazıcı, Tahir et al. “Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı Ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması”. Mühendislik Bilimleri Ve Araştırmaları Dergisi, vol. 6, no. 1, 2024, pp. 113-24, doi:10.46387/bjesr.1452937.
Vancouver	Bekiryazıcı T, Aydemir G, Gürkan H. Konuşma İşaretlerinin Derin Evrişimsel Oto Kodlayıcı ve Artık Vektör Nicemleme Tabanlı Sıkıştırılması. BJESR. 2024;6(1):113-24.

Download Cover Image

Article Files

Full Text