LLMs are Good Action Recognizers
Abstract: Skeleton-based action recognition has attracted lots of research attention. Recently, to build an accurate skeleton-based action recognizer, a variety of works have been proposed. Among them, some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability, while some other works pre-train their recognizers on external data to enrich the knowledge. In this work, we observe that LLMs which have been extensively used in various natural language processing tasks generally hold both large model architectures and rich implicit knowledge. Motivated by this, we propose a novel LLM-AR framework, in which we investigate treating the LLM as an Action Recognizer. In our framework, we propose a linguistic projection process to project each input action signal (i.e., each skeleton sequence) into its sentence format'' (i.e., anaction sentence''). Moreover, we also incorporate our framework with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework.
- Lit-llama. https://github.com/Lightning-AI/lit-llama.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Ske2grid: Skeleton-to-grid representation learning for action recognition. 2023.
- Hmanet: Hyperbolic manifold aware network for skeleton-based action recognition. IEEE Transactions on Cognitive and Developmental Systems, 2022.
- Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13359–13368, 2021.
- Decoupling gcn with dropgraph module for skeleton-based action recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pages 536–553. Springer, 2020a.
- Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 183–192, 2020b.
- Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022.
- Emergence of zipf’s law in the evolution of communication. Physical Review E, 83(3):036115, 2011.
- Toyota smarthome: Real-world activities of daily living. In Proceedings of the IEEE/CVF international conference on computer vision, pages 833–842, 2019.
- Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2969–2978, 2022.
- Skeletr: Towards skeleton-based action recognition in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13634–13644, 2023.
- Efficient human action recognition interface for augmented and virtual reality applications based on binary descriptor. In Augmented Reality, Virtual Reality, and Computer Graphics: 5th International Conference, AVR 2018, Otranto, Italy, June 24–27, 2018, Proceedings, Part I 5, pages 252–260. Springer, 2018.
- Openai chatgpt as a logical interpreter of code. In 2023 2nd International Conference on Edge Computing and Applications (ICECAA), pages 1192–1197. IEEE, 2023.
- Unified pose sequence modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13019–13030, 2023a.
- Ai-generated content (aigc) for various data modalities: A survey, 2023b.
- Hyperbolic self-paced learning for self-supervised skeleton-based action representations. In The Eleventh International Conference on Learning Representations, 2023.
- Hyperbolic neural networks. Advances in neural information processing systems, 31, 2018.
- Llms are good sign language translators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- A kernel method for the two-sample-problem. Advances in neural information processing systems, 19, 2006.
- Unified keypoint-based action recognition framework via structured keypoint pooling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22962–22971, 2023.
- Ai, write an essay for me: A large-scale comparison of human-written versus chatgpt-generated essays. arXiv preprint arXiv:2304.14276, 2023.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795, 2023.
- Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745, 2023.
- The convergence of mildly context-sensitive grammar formalisms. Technical Reports (CIS), page 539, 1990.
- A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3288–3297, 2017.
- Hyperbolic image embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6418–6428, 2020.
- Leveraging spatio-temporal dependency for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10255–10264, 2023a.
- Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10444–10453, 2023b.
- On the benefit of generative foundation models for human activity recognition. arXiv preprint arXiv:2310.12085, 2023.
- Resizing codebook of vector quantization without retraining. Multimedia Systems, pages 1–14, 2023.
- Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3595–3603, 2019.
- Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16266–16275, 2021.
- Activity recognition using a combination of category components and local models for video surveillance. IEEE Transactions on Circuits and Systems for Video Technology, 18(8):1128–1139, 2008.
- Human action recognition using adaptive hierarchical depth motion maps and gabor filter. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1432–1436. IEEE, 2017a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Spatio-temporal lstm with trust gates for 3d human action recognition. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 816–833. Springer, 2016.
- Global context-aware attention lstm networks for 3d action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1647–1656, 2017b.
- Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10):2684–2701, 2019.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Benoit Mandelbrot. Structure formelle des textes et communication: Deux études par. Word, 10(1):1–27, 1954.
- Can language models learn to listen? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023.
- Poincaré embeddings for learning hierarchical representations. Advances in neural information processing systems, 30, 2017.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Pretrain on just structure: Understanding linguistic inductive biases using transfer learning. arXiv preprint arXiv:2304.13060, 2023.
- Steven T Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic bulletin & review, 21:1112–1130, 2014.
- Searching for effective multilingual fine-tuning methods: A case study in summarization, 2022.
- Lmc: Large model collaboration with cross-assessment for training-free open-set object recognition. Advances in Neural Information Processing Systems, 36, 2024.
- Acquaviva Sam. Hyperbolic vq-vaes.
- Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016.
- Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7912–7921, 2019a.
- Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12026–12035, 2019b.
- Stuart M Shieber. Evidence against the context-freeness of natural language. In The Formal complexity of natural language, pages 320–334. Springer, 1985.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Differential recurrent neural networks for action recognition. In Proceedings of the IEEE international conference on computer vision, pages 4041–4049, 2015.
- 3mformer: Multi-order multi-mode transformer for skeletal action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5620–5631, 2023.
- Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
- Skeleton-based action recognition via adaptive cross-form learning. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1670–1678, 2022.
- Neural koopman pooling: Control-inspired temporal dynamics encoding for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10597–10607, 2023.
- Human skeleton tree recurrent neural network with joint relative motion feature for skeleton based action recognition. In 2017 IEEE international conference on image processing (ICIP), pages 91–95. IEEE, 2017.
- G-tuning: Improving generalization of pre-trained language models with generative adversarial network. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4747–4755, 2023.
- Generative action description prompts for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10276–10285, 2023.
- Skeleton mixformer: Multivariate topology representation for skeleton-based action recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 2211–2220, 2023.
- Language knowledge-assisted representation learning for skeleton-based action recognition. arXiv preprint arXiv:2305.12398, 2023.
- Topology-aware convolutional neural network for efficient skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2866–2874, 2022.
- Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2272–2281, 2017.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, 2018.
- Selective spatio-temporal aggregation based pose refinement system: Towards understanding human activities in real-world videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2363–2372, 2021a.
- Unik: A unified framework for real-world skeleton-based action recognition. arXiv preprint arXiv:2107.08580, 2021b.
- Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3427–3435, 2023a.
- Regularized vector quantization for tokenized image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18467–18476, 2023b.
- T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052, 2023c.
- Stst: Spatial-temporal specialized transformer for skeleton-based action recognition. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3229–3237, 2021.
- Motiongpt: Finetuned llms are general-purpose motion generators. arXiv preprint arXiv:2306.10900, 2023d.
- Learning discriminative representations for skeleton based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- Llafs: When large-language models meet few-shot segmentation. arXiv preprint arXiv:2311.16926, 2023a.
- Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023b.
- Multilevel spatial–temporal excited graph network for skeleton-based action recognition. IEEE Transactions on Image Processing, 32:496–508, 2022.
- G. K. Zipf. The psycho-biology of language, 1935.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.