Research Article

A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51

Volume: 9 Number: 2 December 15, 2025

A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51

Abstract

Video-Based Human Action Recognition (HAR) remains challenging due to inter-class similarity, background noise, and the need to capture long-term temporal dependencies. This study proposes a hybrid deep learning model that integrates 3D Convolutional Neural Networks (3D CNNs) with Transformer-based attention mechanisms to jointly capture spatio-temporal features and long-range motion context. The architecture was optimized for parameter efficiency and trained on the UCF101 and HMDB51 benchmark datasets using standardized preprocessing and training strategies. Experimental results indicate that the proposed model reaches 97% accuracy and 96.8% mean F1-score on UCF101, and 85% accuracy, and 83.8% F1-score on HMDB51, showing consistent improvements compared to the standalone 3D CNNs and Transformer variants under identical settings. Ablation studies confirm that the combination of convolutional and attention layers significantly improves recognition performance while maintaining competitive computational cost (3.78M parameters, 17.75 GFLOPs/video, ~7 ms GPU latency). These findings highlight the effectiveness of the hybrid design for accurate and efficient HAR. Future work will address class imbalance using focal loss or weighted training, explore multimodal data integration, and develop more lightweight Transformer modules for real-time deployment on resource-constrained devices.

Keywords

References

  1. [1] Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21. https://doi.org/10.1016/J.IMAVIS.2017.01.010
  2. [2] Waghchaware, S., & Joshi, R. (2024). Machine learning and deep learning models for human activity recognition in security and surveillance: a review. Knowledge and Information Systems, 66(8), 4405–4436.
  3. [3] Andreu-Perez, J., Poon, C. C. Y., Merrifield, R. D., Wong, S. T. C., & Yang, G. Z. (2015). Big Data for Health. IEEE Journal of Biomedical and Health Informatics, 19(4), 1193–1208. https://doi.org/10.1109/JBHI.2015.2450362
  4. [4] Liu, R., Ramli, A. A., Zhang, H., Henricson, E., & Liu, X. (2022). An Overview of Human Activity Recognition Using Wearable Sensors: Healthcare and Artificial Intelligence. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12993 LNCS, 1–14. https://doi.org/10.1007/978-3-030-96068-1_1
  5. [5] Das, D., Nishimura, Y., Vivek, R. P., Takeda, N., Fish, S. T., Plötz, T., & Chernova, S. (2023). Explainable Activity Recognition for Smart Home Systems. ACM Transactions on Interactive Intelligent Systems, 13(2). https://doi.org/https://doi.org/10.1145/3561533
  6. [6] Alzubaidi, A., & Kalita, J. (2016). Authentication of smartphone users using behavioral biometrics. IEEE Communications Surveys and Tutorials, 18(3), 1998–2026. https://doi.org/10.1109/COMST.2016.2537748
  7. [7] Chen, W.-H., & Cho, P.-C. (2021). A GAN-Based Data Augmentation Approach for Sensor-Based Human Activity Recognition. International Journal of Computer and Communication Engineering, 10(4), 75–84. https://doi.org/10.17706/IJCCE.2021.10.4.75-84
  8. [8] Liu, M., Geißler, D., Bian, S., Zhou, B., & Lukowicz, P. (2025). Assessing the Impact of Sampling Irregularity in Time Series Data: Human Activity Recognition As A Case Study. https://arxiv.org/pdf/2501.15330

Details

Primary Language

English

Subjects

Image Processing

Journal Section

Research Article

Early Pub Date

November 18, 2025

Publication Date

December 15, 2025

Submission Date

May 21, 2025

Acceptance Date

September 29, 2025

Published in Issue

Year 2025 Volume: 9 Number: 2

APA
Seven, E., & Yücel Demirel, E. (2025). A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. Journal of Innovative Science and Engineering, 9(2), 327-342. https://doi.org/10.38088/jise.1703936
AMA
1.Seven E, Yücel Demirel E. A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. JISE. 2025;9(2):327-342. doi:10.38088/jise.1703936
Chicago
Seven, Engin, and Eylem Yücel Demirel. 2025. “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition With Improved Accuracy on UCF101 and HMDB51”. Journal of Innovative Science and Engineering 9 (2): 327-42. https://doi.org/10.38088/jise.1703936.
EndNote
Seven E, Yücel Demirel E (December 1, 2025) A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. Journal of Innovative Science and Engineering 9 2 327–342.
IEEE
[1]E. Seven and E. Yücel Demirel, “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51”, JISE, vol. 9, no. 2, pp. 327–342, Dec. 2025, doi: 10.38088/jise.1703936.
ISNAD
Seven, Engin - Yücel Demirel, Eylem. “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition With Improved Accuracy on UCF101 and HMDB51”. Journal of Innovative Science and Engineering 9/2 (December 1, 2025): 327-342. https://doi.org/10.38088/jise.1703936.
JAMA
1.Seven E, Yücel Demirel E. A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. JISE. 2025;9:327–342.
MLA
Seven, Engin, and Eylem Yücel Demirel. “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition With Improved Accuracy on UCF101 and HMDB51”. Journal of Innovative Science and Engineering, vol. 9, no. 2, Dec. 2025, pp. 327-42, doi:10.38088/jise.1703936.
Vancouver
1.Engin Seven, Eylem Yücel Demirel. A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. JISE. 2025 Dec. 1;9(2):327-42. doi:10.38088/jise.1703936


Creative Commons License

The works published in Journal of Innovative Science and Engineering (JISE) are licensed under a  Creative Commons Attribution-NonCommercial 4.0 International License.