RingGesture: A Ring-Based Mid-Air Gesture Typing System Powered by a Deep-Learning Word Prediction Framework

Published 8 Oct 2024 in cs.CV, cs.AI, and cs.CL | (2410.18100v1)

Abstract: Text entry is a critical capability for any modern computing experience, with lightweight augmented reality (AR) glasses being no exception. Designed for all-day wearability, a limitation of lightweight AR glass is the restriction to the inclusion of multiple cameras for extensive field of view in hand tracking. This constraint underscores the need for an additional input device. We propose a system to address this gap: a ring-based mid-air gesture typing technique, RingGesture, utilizing electrodes to mark the start and end of gesture trajectories and inertial measurement units (IMU) sensors for hand tracking. This method offers an intuitive experience similar to raycast-based mid-air gesture typing found in VR headsets, allowing for a seamless translation of hand movements into cursor navigation. To enhance both accuracy and input speed, we propose a novel deep-learning word prediction framework, Score Fusion, comprised of three key components: a) a word-gesture decoding model, b) a spatial spelling correction model, and c) a lightweight contextual LLM. In contrast, this framework fuses the scores from the three models to predict the most likely words with higher precision. We conduct comparative and longitudinal studies to demonstrate two key findings: firstly, the overall effectiveness of RingGesture, which achieves an average text entry speed of 27.3 words per minute (WPM) and a peak performance of 47.9 WPM. Secondly, we highlight the superior performance of the Score Fusion framework, which offers a 28.2% improvement in uncorrected Character Error Rate over a conventional word prediction framework, Naive Correction, leading to a 55.2% improvement in text entry speed for RingGesture. Additionally, RingGesture received a System Usability Score of 83 signifying its excellent usability.

Abstract PDF HTML Upgrade to Chat

References (79)

Summary

The paper introduces RingGesture, a novel system for mid-air gesture typing in AR using a ring device and a deep-learning word prediction framework.
RingGesture utilizes a Score Fusion framework integrating word-gesture decoding, spatial spelling correction, and contextual language models to enhance accuracy and speed.
Evaluations show RingGesture achieves 27.3 WPM (peak 47.9 WPM), a 28.2% reduction in Character Error Rate, and a System Usability Score of 83.

Overview of RingGesture: A Ring-Based Mid-Air Gesture Typing System Powered by a Deep Learning Word Prediction Framework

The paper introduces RingGesture, a novel system designed for text input via a ring-based device intended for augmented reality (AR) glasses. This system addresses text entry challenges imposed by AR's constraints, particularly when lightweight glasses cannot incorporate multiple cameras for hand tracking. As a solution, RingGesture uses an innovative combination of gesture detection and inertial measurement to capture hand movements and translate them into text entries. The system features a pinch gesture to initiate and conclude hand movements, mimicking the interaction style of virtual reality raycast-based gesture typing systems.

Key Components and Methodology

RingGesture operates on a deep-learning framework termed Score Fusion, which enhances accuracy and speed in text entry. This framework integrates three core components: a word-gesture decoding model, a spatial spelling correction model, and a contextual LLM. This integration works by fusing individual model scores to predict the most probable word or phrase being typed.

Word-Gesture Decoding Model: This model interprets the user's mid-air gesture paths to generate text, relying on a pre-trained deep-learning architecture using a dataset for gesture typing.
Spatial Spelling Correction Model: It adjusts predictions based on the proximity of gesture paths to target keys on a virtual keyboard, thus correcting potential errors induced by hand movement noise and vibrations.
Contextual LLM: Provides predictive capabilities by understanding the context of previous words and phrases for more coherent sentence construction, utilizing lightweight LLMs that allow real-time processing.

Performance and Usability

Empirical evaluations show that the integration of these models significantly boosts the text input speed and accuracy. RingGesture achieves an average of 27.3 words per minute with a peak rate of 47.9 WPM, showing its competitive edge against traditional mobile text entry methods. The system furthers usability by achieving a System Usability Score of 83, indicating positive user reception.

The framework's efficacy is highlighted by a 28.2% improvement in Character Error Rate (CER) when compared to traditional word prediction frameworks. Moreover, Score Fusion enhances text input speed by 55.2%, indicating the potential for enhanced productivity within AR contexts.

Implications and Future Work

The implications of this research are substantial, not only within the AR domain but potentially extending to other applications requiring innovative input modalities. RingGesture affirms the viability of ring-based text entry solutions for interactive systems constrained by the capability to physically integrate complex input mechanisms. On a theoretical level, this work contributes insights into machine learning's role in interfacing human gestures with computational systems.

Future research directions could explore the refinement of tactile feedback to further reduce input errors and the integration of advanced machine learning models for even more efficient context understanding. Additionally, expanding evaluations in terms of user-specific adaptations could further tailor the system's usability across different demographics and use cases.

In conclusion, RingGesture stands as a promising advancement in AR input methods. By leveraging deep learning for enhanced text prediction and user interaction, it sets a precedent for practical, efficient text entry in augmented settings, paving the way for more integrated and ubiquitous technology solutions.