Статья

Deep Learning Algorithms for Multimodal Interaction Using Speech and Motion Data in Virtual Reality Systems

Ahmed ZubairFaculty of Computer science , American University of Sharjah , UAEFatima Al RashedFaculty of Computer science , American University of Sharjah , UAE

2024en

ABI

Аннотация

Multimodal interaction (MMI) represented by speech and motion data (SMD) has enormous potential in virtual reality (VR) systems. However, real-time synchronization, context-sensitive interpretation, and effective fusion of heterogeneous data modalities remain open. The study presents a deep learning-based framework that fuses speech and motion data to provide better performance in interaction. This study proposes a novel method called MMI-CNNRNN that combines a Convolutional Neural Network (CNN) that features extraction in speech with a Recurrent Neural Network (RNN) for temporal motion analysis, integrated into a Transformer-based architecture to enhance the contextual understanding and responsiveness of the system. In this regard, the performance of the proposed framework is evaluated using benchmark multimodal datasets such as the IEMOCAP dataset. These results represent a 20% increase in interaction accuracy and a 15% latency reduction compared to unimodal and early fusion methods. The fusion of CNN and RNN mechanisms translates into more natural and intuitive interactions, making both the assistive device and the VR environment more adaptive and user-friendly. Concluding from the findings of the proposed work, efficient multimodal system development supports better accessibility and engagement among users with various needs.

Перевод пока недоступен

Идентификаторы

DOI: 10.70023/sahd/241105

Цитирования и источники

Цитирований: 3Использованных источников: 0

Показатели — AkademScholar