Мақола

Deep Learning-Powered Gesture and Speech Recognition in Augmented Reality Interfaces

Odilbek KosimovTermez University of Economics and Service,Department of Information Technology and Exact Sciences,Termez,UzbekistanYesMatyakubov NurbekUrgench Innovation University,Social-Humaniratidan Department,Urgench City,UzbekistanOtajanov OlimboyUrgench State University,Department of Pedagogy and Psychology,Urgench,UzbekistanAditya Kumar SharmaTula’s Institute,Department of Computer Science and Engineering,Dehradun,India,248197Feruza JumaniyazovaMamun University,Department Romano-Germanic Philology,UzbekistanFarrukh NurullayevBukhara State Pedagogical Institute,Department of Music Education,Bukhara,Uzbekistan

2025

ABI

Аннотация

Augmented Reality (AR) interfaces demand intuitive and natural human-computer interaction modalities to enhance user experience and accessibility. Traditional AR input methods often rely on handheld controllers or touch-based interactions that can limit the immersive potential of AR applications. This research aims to develop and evaluate a multimodal deep learning framework that integrates gesture and speech recognition for seamless AR interface control. The methodology employs a hybrid architecture combining Convolutional Neural Networks (CNNs) for real-time hand gesture recognition and Transformer-based models for continuous speech recognition, integrated through a fusion layer that processes multimodal inputs simultaneously. The system was trained on a custom dataset of 50,000 gesture samples and 100,000 speech utterances collected from 200 participants across diverse demographic groups. The experimental findings indicate that the suggested multimodal system attains 94.7 % in gesture recognition and 96.2 % in speech recognition and together they constitute 97.8 % accuracy where both modalities are exploited jointly. The framework is interactive with an average of 45ms latency and able to support real-time performance of AR applications. This study leads to the further development of the natural user interface in the AR context and has large-scale prospects and implications in the sphere of accessibility and productivity apps and immersive computing experience.

Ҳали таржима қилинмаган

Мавзулар

Hand Gesture Recognition Systems Emotion and Mood Recognition Speech and Audio Processing

Идентификаторлар

DOI: 10.1109/aisummit66170.2025.11411613

Иқтибослар ва манбалар

0 та иқтибос20 та фойдаланилган манба

Кўрсаткичлар — AkademScholar