We present PromptGAR, a novel framework for Group Activity Recognition (GAR) that offers both input flexibility and high recognition accuracy.
The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, fixed number of frames and instances, and the lack of actor consistency. To bridge the gap, we proposed PromptGAR, which is the first GAR model to provide input flexibility across prompts, frames, and instances without the need for retraining. We leverage diverse visual prompts—like bounding boxes, skeletal keypoints, and instance identities—by unifying them as point prompts. A recognition decoder then cross-updates class and prompt tokens for enhanced performance.
To ensure actor consistency for extended activity durations, we also introduce a relative instance attention mechanism that directly encodes instance identities. Comprehensive evaluations demonstrate that PromptGAR achieves competitive performances both on full prompts and partial prompt inputs, establishing its effectiveness on input flexibility and generalization ability for real-world applications.
A sequence of frames \( I \) is processed by the video encoder, yielding RGB features \( F_I \) and GAR class token \( X_{\mathrm{gar}} \). Prompts, such as bounding boxes and skeletal keypoints, are transformed to prompt tokens \( F_p \) by the prompt encoder. These tokens, along with instance identities, are then fed into the recognition decoder to obtain the group activity prediction \( Y_{\mathrm{gar}} \).
(a) Bounding boxes are represented by three points (upper-left, center, lower-right). (b) Skeletal keypoints consist of 17 points. (c) Positional encoding captures both spatial and temporal coordinates. (d) Point types are distinguished using learnable embeddings. (e) Depth-wise prompt pooling reduces temporal and type dimensions to 1, then up-projects to the number of pooled prompts \( O \).
The decoder processes (a) RGB features \( F_I \) with positional embeddings and (b) the GAR class token \( X_{\mathrm{gar}} \) together with prompt tokens \( F_p \). (c) The two-way transformer performs cross-updating between these features. (d) Relative instance self-attention, using instance identities, ensures actor consistency. (e) The GAR head takes the updated GAR class token \( \hat{X}_{\mathrm{gar}} \) and the average of updated prompt tokens \( \hat{F}_{p} \) to predict the group activity label \( Y_{\mathrm{gar}} \).
Prior GAR approaches rely on a fixed player order. Their performance degrades when this order changes. Namely, the interaction between two prompts becomes dependent on their arbitrary instance IDs, even if they refer to the same underlying object or entity. Inspired by relative positional embedding, we introduce the relative instance identity embedding to address this issue by focusing on whether two tokens belong to the same instance.
@article{jin2025promptgar,
title={PromptGAR: Flexible Promptive Group Activity Recognition},
author={Jin, Zhangyu and Feng, Andrew and Chemburkar, Ankur and De Melo, Celso M},
journal={arXiv preprint arXiv:2503.08933},
year={2025}
}
The project or effort depicted was or is sponsored by the DEVCOM Army Research Lab under contract number W911QX-21-D-0001. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.