A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51

Engin Seven; Eylem Yücel Demirel

doi:10.38088/jise.1703936

A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51

Abstract

Video-Based Human Action Recognition (HAR) remains challenging due to inter-class similarity, background noise, and the need to capture long-term temporal dependencies. This study proposes a hybrid deep learning model that integrates 3D Convolutional Neural Networks (3D CNNs) with Transformer-based attention mechanisms to jointly capture spatio-temporal features and long-range motion context. The architecture was optimized for parameter efficiency and trained on the UCF101 and HMDB51 benchmark datasets using standardized preprocessing and training strategies. Experimental results indicate that the proposed model reaches 97% accuracy and 96.8% mean F1-score on UCF101, and 85% accuracy, and 83.8% F1-score on HMDB51, showing consistent improvements compared to the standalone 3D CNNs and Transformer variants under identical settings. Ablation studies confirm that the combination of convolutional and attention layers significantly improves recognition performance while maintaining competitive computational cost (3.78M parameters, 17.75 GFLOPs/video, ~7 ms GPU latency). These findings highlight the effectiveness of the hybrid design for accurate and efficient HAR. Future work will address class imbalance using focal loss or weighted training, explore multimodal data integration, and develop more lightweight Transformer modules for real-time deployment on resource-constrained devices.

Keywords

References

[1] Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21. https://doi.org/10.1016/J.IMAVIS.2017.01.010
[2] Waghchaware, S., & Joshi, R. (2024). Machine learning and deep learning models for human activity recognition in security and surveillance: a review. Knowledge and Information Systems, 66(8), 4405–4436.
[3] Andreu-Perez, J., Poon, C. C. Y., Merrifield, R. D., Wong, S. T. C., & Yang, G. Z. (2015). Big Data for Health. IEEE Journal of Biomedical and Health Informatics, 19(4), 1193–1208. https://doi.org/10.1109/JBHI.2015.2450362
[4] Liu, R., Ramli, A. A., Zhang, H., Henricson, E., & Liu, X. (2022). An Overview of Human Activity Recognition Using Wearable Sensors: Healthcare and Artificial Intelligence. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12993 LNCS, 1–14. https://doi.org/10.1007/978-3-030-96068-1_1
[5] Das, D., Nishimura, Y., Vivek, R. P., Takeda, N., Fish, S. T., Plötz, T., & Chernova, S. (2023). Explainable Activity Recognition for Smart Home Systems. ACM Transactions on Interactive Intelligent Systems, 13(2). https://doi.org/https://doi.org/10.1145/3561533
[6] Alzubaidi, A., & Kalita, J. (2016). Authentication of smartphone users using behavioral biometrics. IEEE Communications Surveys and Tutorials, 18(3), 1998–2026. https://doi.org/10.1109/COMST.2016.2537748
[7] Chen, W.-H., & Cho, P.-C. (2021). A GAN-Based Data Augmentation Approach for Sensor-Based Human Activity Recognition. International Journal of Computer and Communication Engineering, 10(4), 75–84. https://doi.org/10.17706/IJCCE.2021.10.4.75-84
[8] Liu, M., Geißler, D., Bian, S., Zhou, B., & Lukowicz, P. (2025). Assessing the Impact of Sampling Irregularity in Time Series Data: Human Activity Recognition As A Case Study. https://arxiv.org/pdf/2501.15330

[9] Hao, Y., Wang, B., & Zheng, R. (2023). VALERIAN: Invariant Feature Learning for IMU Sensor-based Human Activity Recognition in the Wild. ACM International Conference Proceeding Series, 66–78. https://doi.org/10.1145/3576842.3582390
[10] Chen, J., Xu, X., Wang, T., Jeon, G., & Camacho, D. (2024). An AIoT Framework With Multi-modal Frequency Fusion for WiFi-Based Coarse and Fine Activity Recognition. IEEE Internet of Things Journal. https://doi.org/10.1109/JIOT.2024.3400773
[11] Ullah, H. A., Letchmunan, S., Zia, M. S., Butt, U. M., & Hassan, F. H. (2021). Analysis of Deep Neural Networks for Human Activity Recognition in Videos - A Systematic Literature Review. IEEE Access, 9, 126366–126387. https://doi.org/10.1109/ACCESS.2021.3110610
[12] Wang, C., & Mohamed, A. S. A. (2023). Group Activity Recognition in Computer Vision: A Comprehensive Review, Challenges, and Future Perspectives. https://arxiv.org/pdf/2307.13541
[13] Ahn, D., Kim, S., Hong, H., & Ko, B. C. (2023). STAR-Transformer: A Spatio-Temporal Cross Attention Transformer for Human Action Recognition (pp. 3330–3339).
[14] Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. Proceedings of the IEEE International Conference on Computer Vision, 2019-Octob, 6201–6210. https://doi.org/10.1109/ICCV.2019.00630
[15] Zeng, M., Nguyen, L. T., Yu, B., Mengshoel, O. J., Zhu, J., Wu, P., & Zhang, J. (2015). Convolutional Neural Networks for human activity recognition using mobile sensors. Proceedings of the 2014 6th International Conference on Mobile Computing, Applications and Services, MobiCASE 2014, 197–205. https://doi.org/10.4108/ICST.MOBICASE.2014.257786
[16] Lara, Ó. D., & Labrador, M. A. (2013). A survey on human activity recognition using wearable sensors. IEEE Communications Surveys and Tutorials, 15(3), 1192–1209. https://doi.org/10.1109/SURV.2012.110112.00192
[17] Bulling, A., Blanke, U., & Schiele, B. (2014). A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys, 46(3). https://doi.org/10.1145/2499621
[18] Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, 1(January), 568–576. http://arxiv.org/abs/1406.2199
[19] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[19] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2014). Learning Spatiotemporal Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV), 2015 Inter, 4489–4497.
[20] Carreira, J., & Zisserman, A. (2017). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017-Janua, 4724–4733. https://doi.org/10.1109/CVPR.2017.502
[21] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9912 LNCS, 20–36.
[22] Wang, X., Girshick, R., Gupta, A., & He, K. (2017). Non-local Neural Networks. ArXiv, arXiv:1711.07971. https://doi.org/10.48550/ARXIV.1711.07971
[23] Yan, S., Xiong, Y., & Lin, D. (2018). Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
[24] Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). ViGirdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2018). Video Action Transformer Network. Retrieved from http://arxiv.org/abs/1812.02707deo Action Transformer Network. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. http://arxiv.org/abs/1812.02707
[25] Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding? Supplementary Materials 1. Implementation Details. 139. https://github.com/
[26] Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35, 10078-10093.
[27] Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., ... & Qiao, Y. (2023). Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14549-14560).
[28] Mehta, S., & Rastegari, M. (2021). MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. ICLR 2022 - 10th International Conference on Learning Representations. https://arxiv.org/pdf/2110.02178
[29] Yamato, J., Ohya, J., & Ishii, K. (1992, June). Recognizing human action in time-sequential images using hidden Markov model. In CVPR (Vol. 92, pp. 379-385).
[30] Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.
[31] Zhang, R., Li, S., Xue, J., Lin, F., Zhang, Q., Ma, X., & Yan, X. (2024). Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions. https://arxiv.org/pdf/2405.17729
[32] Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. https://arxiv.org/pdf/1212.0402
[33] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. Proceedings of the IEEE International Conference on Computer Vision, 2556–2563. https://doi.org/10.1109/ICCV.2011.6126543
[34] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., & Zisserman, A. (2017). The Kinetics Human Action Video Dataset. https://arxiv.org/pdf/1705.06950
[ 35] Goyal, R., Michalski, V., Materzy, J., Westphal, S., Kim, H., Haenel, V., Yianilos, P., Mueller-freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The “something something” video database for learning and evaluating visual common sense. Proceedings of the IEEE International Conference on Computer Vision, 5842–5850.
[36] Kingma, D. P., & Ba, J. L. (2014). Adam: A Method for Stochastic Optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. https://arxiv.org/pdf/1412.6980
[37] Keskar, N. S., Nocedal, J., Tang, P. T. P., Mudigere, D., & Smelyanskiy, M. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, 1–16.

Details

Primary Language

English

Subjects

Image Processing

Journal Section

Research Article

Authors

Engin Seven ^*
0000-0002-7994-2679
Türkiye

Eylem Yücel Demirel
0000-0003-1979-8860
Türkiye

Early Pub Date

November 18, 2025

Publication Date

December 15, 2025

Submission Date

May 21, 2025

Acceptance Date

September 29, 2025

Published in Issue

Year 2025 Volume: 9 Number: 2

DOI

https://doi.org/10.38088/jise.1703936

IZ

https://izlik.org/JA26PR96RA

Cite

RIS / Bibtex

APA

Seven, E., & Yücel Demirel, E. (2025). A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. Journal of Innovative Science and Engineering, 9(2), 327-342. https://doi.org/10.38088/jise.1703936

AMA

1.Seven E, Yücel Demirel E. A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. JISE. 2025;9(2):327-342. doi:10.38088/jise.1703936

Chicago

Seven, Engin, and Eylem Yücel Demirel. 2025. “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition With Improved Accuracy on UCF101 and HMDB51”. Journal of Innovative Science and Engineering 9 (2): 327-42. https://doi.org/10.38088/jise.1703936.

EndNote

Seven E, Yücel Demirel E (December 1, 2025) A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. Journal of Innovative Science and Engineering 9 2 327–342.

IEEE

[1]E. Seven and E. Yücel Demirel, “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51”, JISE, vol. 9, no. 2, pp. 327–342, Dec. 2025, doi: 10.38088/jise.1703936.

ISNAD

Seven, Engin - Yücel Demirel, Eylem. “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition With Improved Accuracy on UCF101 and HMDB51”. Journal of Innovative Science and Engineering 9/2 (December 1, 2025): 327-342. https://doi.org/10.38088/jise.1703936.

JAMA

1.Seven E, Yücel Demirel E. A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. JISE. 2025;9:327–342.

MLA

Seven, Engin, and Eylem Yücel Demirel. “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition With Improved Accuracy on UCF101 and HMDB51”. Journal of Innovative Science and Engineering, vol. 9, no. 2, Dec. 2025, pp. 327-42, doi:10.38088/jise.1703936.

Vancouver

1.Engin Seven, Eylem Yücel Demirel. A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. JISE. 2025 Dec. 1;9(2):327-42. doi:10.38088/jise.1703936