PhD defence: Efficiently moving forward in video-based human action recognition

to

PLEASE NOTE: If a candidate gives a layman's talk, the livestream will start fifteen minutes earlier.

This PhD thesis addresses the challenge of improving the efficiency and scalability of video-based human action recognition. This is an essential task in areas such as surveillance, healthcare, and human-computer interaction. While modern transformer-based models deliver high accuracy, their high computational demands limit practical deployment.

To tackle this, the thesis proposes three key contributions. First, it introduces the Local Attention Layer (LA-layer), a convolution-style attention mechanism with a deformable kernel and constraint rule. This design captures local spatial-temporal patterns effectively while reducing computational costs. Second, the Trajectory-Correlation (TC) block is proposed, a hybrid spatio-temporal module that enhances recognition of fine-grained and complex actions, including continuous sign language.

Third, the thesis focuses on enhancing transformer efficiency. It presents VideoMambaPro, a compact and fast architecture based on the Mamba state-space model, which achieves competitive accuracy with significantly fewer resources than traditional Vision Transformers. Additionally, the Four-Tiered Prompts (FTP) framework is proposed to leverage external knowledge from Visual Language Models (VLMs), improving generalisation across datasets and tasks without the need for task-specific fine-tuning.

The effectiveness of these methods is validated on multiple benchmark datasets, including Kinetics-400, Something-Something V2, and PHOENIX14. Results show state-of-the-art performance with reduced memory and computation requirements.

This thesis contributes to the development of efficient, generalisable, and scalable action recognition systems, advancing the practical deployment of video understanding technologies.

Start date and time
End date and time
Location
PhD candidate
H. Lu
Dissertation
Efficiently moving forward in video-based human action recognition
PhD supervisor(s)
prof. dr. A.A. Salah
Co-supervisor(s)
dr. ir. R.W. Poppe
More information