PromptGAR: Flexible Promptive Group Activity Recognition

WACV 2026 Oral

¹University of Southern California, Institute for Creative Technologies

²Army Research Laboratory

Abstract

We present PromptGAR, a novel framework for Group Activity Recognition (GAR) that offers both input flexibility and high recognition accuracy.

The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, fixed number of frames and instances, and the lack of actor consistency. To bridge the gap, we proposed PromptGAR, which is the first GAR model to provide input flexibility across prompts, frames, and instances without the need for retraining. We leverage diverse visual prompts—like bounding boxes, skeletal keypoints, and instance identities—by unifying them as point prompts. A recognition decoder then cross-updates class and prompt tokens for enhanced performance.

To ensure actor consistency for extended activity durations, we also introduce a relative instance attention mechanism that directly encodes instance identities. Comprehensive evaluations demonstrate that PromptGAR achieves competitive performances both on full prompts and partial prompt inputs, establishing its effectiveness on input flexibility and generalization ability for real-world applications.

PromptGAR Architecture

A sequence of frames \( I \) is processed by the video encoder, yielding RGB features \( F_I \) and GAR class token \( X_{\mathrm{gar}} \). Prompts, such as bounding boxes and skeletal keypoints, are transformed to prompt tokens \( F_p \) by the prompt encoder. These tokens, along with instance identities, are then fed into the recognition decoder to obtain the group activity prediction \( Y_{\mathrm{gar}} \).

Prompt Encoder

(a) Bounding boxes are represented by three points (upper-left, center, lower-right). (b) Skeletal keypoints consist of 17 points. (c) Positional encoding captures both spatial and temporal coordinates. (d) Point types are distinguished using learnable embeddings. (e) Depth-wise prompt pooling reduces temporal and type dimensions to 1, then up-projects to the number of pooled prompts \( O \).

Recognition Decoder

The decoder processes (a) RGB features \( F_I \) with positional embeddings and (b) the GAR class token \( X_{\mathrm{gar}} \) together with prompt tokens \( F_p \). (c) The two-way transformer performs cross-updating between these features. (d) Relative instance self-attention, using instance identities, ensures actor consistency. (e) The GAR head takes the updated GAR class token \( \hat{X}_{\mathrm{gar}} \) and the average of updated prompt tokens \( \hat{F}_{p} \) to predict the group activity label \( Y_{\mathrm{gar}} \).

Relative Instance Attention

Prior GAR approaches rely on a fixed player order. Their performance degrades when this order changes. Namely, the interaction between two prompts becomes dependent on their arbitrary instance IDs, even if they refer to the same underlying object or entity. Inspired by relative positional embedding, we introduce the relative instance identity embedding to address this issue by focusing on whether two tokens belong to the same instance.

@article{jin2025promptgar, title={PromptGAR: Flexible Promptive Group Activity Recognition}, author={Jin, Zhangyu and Feng, Andrew and Chemburkar, Ankur and De Melo, Celso M}, journal={arXiv preprint arXiv:2503.08933}, year={2025} }

PromptGAR: Flexible Promptive Group Activity Recognition

Abstract

PromptGAR Architecture

Prompt Encoder

Recognition Decoder

Relative Instance Attention

Benchmark Results

BibTeX

Acknowledgement