Bilingual Speech Emotion Recognition Using Neural Networks: A Case Study for Turkish and English Languages

Özsönmez D. B., ACARMAN T., PARLAK İ. B.

International Conference on Intelligent and Fuzzy Systems, INFUS 2021, İstanbul, Türkiye, 24 - 26 Ağustos 2021, cilt.308, ss.313-320, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası: 308
Doi Numarası: 10.1007/978-3-030-85577-2_37
Basıldığı Şehir: İstanbul
Basıldığı Ülke: Türkiye
Sayfa Sayıları: ss.313-320
Anahtar Kelimeler: Deep learning, Emotion detection, Machine learning, Natural language processing, Speech analysis
Galatasaray Üniversitesi Adresli: Evet

Özet

© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.Emotion extraction and detection are considered as complex tasks due to the nature of data and subjects involved in the acquisition of sentiments. Speech analysis becomes a critical gateway in deep learning where the acoustic features would be trained to obtain more accurate descriptors to disentangle sentiments, customs in natural language. Speech feature extraction varies by the quality of audio records and linguistic properties. The speech nature is handled through a broad spectrum of emotions regarding the age, the gender and the social effects of subjects. Speech emotion analysis is fostered in English and German languages through multilevel corpus. The emotion features disseminate the acoustic analysis in videos or texts. In this study, we propose a multilingual analysis of emotion extraction using Turkish and English languages. MFCC (Mel-Frequency Cepstrum Coefficients), Mel Spectrogram, Linear Predictive Coding (LPC) and PLP-RASTA techniques are used to extract acoustic features. Three different data sets are analyzed using feed forward neural network hierarchy. Different emotion states such as happy, calm, sad and angry are compared in bilingual speech records. The accuracy and precision metrics are reached at level higher than 80%. Turkish language emotion classification is concluded to be more accurate regarding speech features.