PhD defence: Efficiently moving forward in video-based human action recognition
PLEASE NOTE: If a candidate gives a layman's talk, the livestream will start fifteen minutes earlier.
This PhD thesis addresses the challenge of improving the efficiency and scalability of video-based human action recognition. This is an essential task in areas such as surveillance, healthcare, and human-computer interaction. While modern transformer-based models deliver high accuracy, their high computational demands limit practical deployment.
To tackle this, the thesis proposes three key contributions. First, it introduces the Local Attention Layer (LA-layer), a convolution-style attention mechanism with a deformable kernel and constraint rule. This design captures local spatial-temporal patterns effectively while reducing computational costs. Second, the Trajectory-Correlation (TC) block is proposed, a hybrid spatio-temporal module that enhances recognition of fine-grained and complex actions, including continuous sign language.
Third, the thesis focuses on enhancing transformer efficiency. It presents VideoMambaPro, a compact and fast architecture based on the Mamba state-space model, which achieves competitive accuracy with significantly fewer resources than traditional Vision Transformers. Additionally, the Four-Tiered Prompts (FTP) framework is proposed to leverage external knowledge from Visual Language Models (VLMs), improving generalisation across datasets and tasks without the need for task-specific fine-tuning.
The effectiveness of these methods is validated on multiple benchmark datasets, including Kinetics-400, Something-Something V2, and PHOENIX14. Results show state-of-the-art performance with reduced memory and computation requirements.
This thesis contributes to the development of efficient, generalisable, and scalable action recognition systems, advancing the practical deployment of video understanding technologies.
- Start date and time
- End date and time
- Location
- PhD candidate
- H. Lu
- Dissertation
- Efficiently moving forward in video-based human action recognition
- PhD supervisor(s)
- prof. dr. A.A. Salah
- Co-supervisor(s)
- dr. ir. R.W. Poppe
- More information