Article

Multimodal Emotion Recognition in English Conversations Using Fusion of NLP and Computer Vision Techniques

A. Venu Gopal ReddySiddhartha Academy of Higher Education (SAHE) Deemed to be University,Department of English,Vijayawada,Andhrapradesh,IndiaRashmi BVemana Institute of technology,Department of ECE,Bengaluru,Karnataka,India,560034Dilfuza GulyamovaUniversity of Information Technologies named after Muhammad al-Khwarizmi,Computer Engineering Department,Tashkent,UzbekistanG. S. BansodeKL (Deemed to Be) University (KLEF),Department of English,Guntur,Andhra Pradesh,IndiaPavan Kumar NowbattulaAnuradha. S

2026

ABI

Abstract

Efficient emotion reading is vital in suitably human-computer interaction. This study will offer a multimodal solution through the latest NLP and computer vision in order to identify emotions in English conversation. RoBERTa is used to understand linguistic meaning and context whereas any CNN trained on AffectNet data is used to detect facial expressions. Coherent preprocessing makes the data format and conversation frame CRM. Late-fusion strategy, which has weighted streams and meta-classifier, is a highly effective way of enhancing the reliability of prediction. Multimodal MELD system is 98.7 % but is more accurate compared to the text only and vision only models. These findings indicate the great potential of multimodal learning to emotion-sensitive virtual assistants, affective computing, and intelligent interaction systems.

Topics

Emotion and Mood Recognition Speech and dialogue systems Subtitles and Audiovisual Media

Identifiers

DOI: 10.1109/icaect68478.2026.11426023

Citations and references

Cited by 013 references

Metrics — AkademScholar