PromptGAR: Flexible Promptive Group Activity Recognition

WACV 2026
1University of Southern California, Institute for Creative Technologies
2Army Research Laboratory
Overview diagram of the PromptGAR framework.

This diagram illustrates PromptGAR's performance and key capabilities. Compared to existing methods, PromptGAR achieves competitive group activity recognition accuracy and superior input flexibility without the need for retraining, including: (a) flexible visual prompt inputs, (b) flexible instance counts, and (c) flexible frame sampling.

Abstract

We present PromptGAR, a novel framework for Group Activity Recognition (GAR) that offers both input flexibility and high recognition accuracy.

The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, fixed number of frames and instances, and the lack of actor consistency. To bridge the gap, we proposed PromptGAR, which is the first GAR model to provide input flexibility across prompts, frames, and instances without the need for retraining. We leverage diverse visual prompts—like bounding boxes, skeletal keypoints, and instance identities—by unifying them as point prompts. A recognition decoder then cross-updates class and prompt tokens for enhanced performance.

To ensure actor consistency for extended activity durations, we also introduce a relative instance attention mechanism that directly encodes instance identities. Comprehensive evaluations demonstrate that PromptGAR achieves competitive performances both on full prompts and partial prompt inputs, establishing its effectiveness on input flexibility and generalization ability for real-world applications.

PromptGAR Architecture

High-level architecture diagram of the PromptGAR model.

A sequence of frames \( I \) is processed by the video encoder, yielding RGB features \( F_I \) and GAR class token \( X_{\mathrm{gar}} \). Prompts, such as bounding boxes and skeletal keypoints, are transformed to prompt tokens \( F_p \) by the prompt encoder. These tokens, along with instance identities, are then fed into the recognition decoder to obtain the group activity prediction \( Y_{\mathrm{gar}} \).

Prompt Encoder

(a) Bounding boxes are represented by three points (upper-left, center, lower-right). (b) Skeletal keypoints consist of 17 points. (c) Positional encoding captures both spatial and temporal coordinates. (d) Point types are distinguished using learnable embeddings. (e) Depth-wise prompt pooling reduces temporal and type dimensions to 1, then up-projects to the number of pooled prompts \( O \).

Diagram of the Prompt Encoder module.

Recognition Decoder

The decoder processes (a) RGB features \( F_I \) with positional embeddings and (b) the GAR class token \( X_{\mathrm{gar}} \) together with prompt tokens \( F_p \). (c) The two-way transformer performs cross-updating between these features. (d) Relative instance self-attention, using instance identities, ensures actor consistency. (e) The GAR head takes the updated GAR class token \( \hat{X}_{\mathrm{gar}} \) and the average of updated prompt tokens \( \hat{F}_{p} \) to predict the group activity label \( Y_{\mathrm{gar}} \).

Diagram of the Recognition Decoder module.

Relative Instance Attention

Prior GAR approaches rely on a fixed player order. Their performance degrades when this order changes. Namely, the interaction between two prompts becomes dependent on their arbitrary instance IDs, even if they refer to the same underlying object or entity. Inspired by relative positional embedding, we introduce the relative instance identity embedding to address this issue by focusing on whether two tokens belong to the same instance.

Diagram of the relative instance attention mechanism.

Benchmark Results

Quantitative comparisons on the Volleyball dataset.

 

Quantitative comparisons on the NBA dataset.
Detailed flexible prompts, instances, frames, and actor order results.

BibTeX

@article{jin2025promptgar,
  title={PromptGAR: Flexible Promptive Group Activity Recognition},
  author={Jin, Zhangyu and Feng, Andrew and Chemburkar, Ankur and De Melo, Celso M},
  journal={arXiv preprint arXiv:2503.08933},
  year={2025}
}

Acknowledgement

The project or effort depicted was or is sponsored by the DEVCOM Army Research Lab under contract number W911QX-21-D-0001. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.