Zero-Shot, Low-Latency Intent Detection via sEMG

We propose ReactEMG, a model that can predict the user's hand gesture from forearm EMG data at every timestep, with low latency and high accuracy. Our masked segmentation architecture jointly learns EMG and user intent, and is capable of zero-shot generalization at high accuracy without any subject-specific calibration. The combination of low latency, high accuracy and predicted pose stability make ReactEMG well-suited for controlling robotic devices.

Surface electromyography (sEMG) signals show promise for effective human–computer interfaces, particularly in rehabilitation and prosthetics. However, challenges remain in developing systems that respond quickly and reliably to user intent, across different subjects and without requiring time-consuming calibration. In this work, we propose a framework for EMG-based intent detection that addresses these challenges. Unlike traditional gesture recognition models that wait until a gesture is completed before classifying it, ReactEMG uses a segmentation strategy to assign intent labels at every timestep as the gesture unfolds. We introduce a novel masked modeling strategy that aligns muscle activations with their corresponding user intents, enabling rapid onset detection and stable tracking of ongoing gestures. In evaluations against baseline methods, considering both accuracy and stability for device control, our approach surpasses state-of-the-art performance in zero-shot transfer conditions, demonstrating its potential for wearable robotics and next-generation prosthetic systems.

(1) Same Subject, Different Task. On a new, unseeen subject, ReactEMG accurately detects open and close intents across diverse real-world tasks that contain different arm movements.

(2) Same Task, Different Subjects. In the static hanging task (subjects perform open and close gestures with the arm hanging freely at the side) ReactEMG accurately detects intents across different subjects, even when EMG signals exhibit vastly different coactivation patterns.

Our model treats 8-channel EMG signals and intent labels as two synchronized input streams: each is embedded, randomly masked in contiguous spans, and concatenated into a single sequence as input to Transformer encoders. The network is trained to reconstruct masked EMG signals (via a regression loss) and masked intent tokens (via a classification loss), which forces tight alignment between muscle activity and user intention. As a result, it delivers fast, stable intent predictions at every timestep—no manual features or per-user calibration required.

While traditional metrics such as per-timestep raw accuracy are important, they fail to capture situations where a transition to a new gesture is detected with delay, or maintenance of a stable gesture exhibits unwanted flicker in the prediction. To capture such cases, we use a new metric dubbed transition accuracy. According to this metric, a transition is considered to be successfully detected only if the model output correctly transitions between intents close enough to the ground truth change ("reaction buffer" in the image below), and exhibits no instability either before or after the transition ("maintenance period").

ReactEMG outperforms baselines on both raw accuracy and transition accuracy, particularly on datasets exhibiting direct transitions between different hand gestures and maintenance windows of varying lenghts, suggesting future applicability to controlling devices such as wearable robots.

ReactEMG: Zero-Shot, Low-Latency Intent Detection via sEMG

Abstract

Intent Detection Examples

Method

Metric for Device Control