Deep Learning-Powered Gesture and Speech Recognition in Augmented Reality Interfaces
Annotatsiya
Augmented Reality (AR) interfaces demand intuitive and natural human-computer interaction modalities to enhance user experience and accessibility. Traditional AR input methods often rely on handheld controllers or touch-based interactions that can limit the immersive potential of AR applications. This research aims to develop and evaluate a multimodal deep learning framework that integrates gesture and speech recognition for seamless AR interface control. The methodology employs a hybrid architecture combining Convolutional Neural Networks (CNNs) for real-time hand gesture recognition and Transformer-based models for continuous speech recognition, integrated through a fusion layer that processes multimodal inputs simultaneously. The system was trained on a custom dataset of 50,000 gesture samples and 100,000 speech utterances collected from 200 participants across diverse demographic groups. The experimental findings indicate that the suggested multimodal system attains 94.7 % in gesture recognition and 96.2 % in speech recognition and together they constitute 97.8 % accuracy where both modalities are exploited jointly. The framework is interactive with an average of 45ms latency and able to support real-time performance of AR applications. This study leads to the further development of the natural user interface in the AR context and has large-scale prospects and implications in the sphere of accessibility and productivity apps and immersive computing experience.