A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51

Engin Seven; Eylem Yücel Demirel

doi:10.38088/jise.1703936

Research Article

A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51

Year 2025, Volume: 9 Issue: 2, 327 - 342

Abstract

Video-Based Human Action Recognition (HAR) remains challenging due to inter-class similarity, background noise, and the need to capture long-term temporal dependencies. This study proposes a hybrid deep learning model that integrates 3D Convolutional Neural Networks (3D CNNs) with Transformer-based attention mechanisms to jointly capture spatio-temporal features and long-range motion context. The architecture was optimized for parameter efficiency and trained on the UCF101 and HMDB51 benchmark datasets using standardized preprocessing and training strategies. Experimental results indicate that the proposed model reaches 97% accuracy and 96.8% mean F1-score on UCF101, and 85% accuracy, and 83.8% F1-score on HMDB51, showing consistent improvements compared to the standalone 3D CNNs and Transformer variants under identical settings. Ablation studies confirm that the combination of convolutional and attention layers significantly improves recognition performance while maintaining competitive computational cost (3.78M parameters, 17.75 GFLOPs/video, ~7 ms GPU latency). These findings highlight the effectiveness of the hybrid design for accurate and efficient HAR. Future work will address class imbalance using focal loss or weighted training, explore multimodal data integration, and develop more lightweight Transformer modules for real-time deployment on resource-constrained devices.

Keywords

Human Activity Recognition , Video-based Action Recognition , 3D Convolutional Neural Networks , Attention Mechanism , Deep Learning in Computer Vision

References

[1] Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21. https://doi.org/10.1016/J.IMAVIS.2017.01.010
[2] Waghchaware, S., & Joshi, R. (2024). Machine learning and deep learning models for human activity recognition in security and surveillance: a review. Knowledge and Information Systems, 66(8), 4405–4436.
[3] Andreu-Perez, J., Poon, C. C. Y., Merrifield, R. D., Wong, S. T. C., & Yang, G. Z. (2015). Big Data for Health. IEEE Journal of Biomedical and Health Informatics, 19(4), 1193–1208. https://doi.org/10.1109/JBHI.2015.2450362
[4] Liu, R., Ramli, A. A., Zhang, H., Henricson, E., & Liu, X. (2022). An Overview of Human Activity Recognition Using Wearable Sensors: Healthcare and Artificial Intelligence. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12993 LNCS, 1–14. https://doi.org/10.1007/978-3-030-96068-1_1
[5] Das, D., Nishimura, Y., Vivek, R. P., Takeda, N., Fish, S. T., Plötz, T., & Chernova, S. (2023). Explainable Activity Recognition for Smart Home Systems. ACM Transactions on Interactive Intelligent Systems, 13(2). https://doi.org/https://doi.org/10.1145/3561533
[6] Alzubaidi, A., & Kalita, J. (2016). Authentication of smartphone users using behavioral biometrics. IEEE Communications Surveys and Tutorials, 18(3), 1998–2026. https://doi.org/10.1109/COMST.2016.2537748
[7] Chen, W.-H., & Cho, P.-C. (2021). A GAN-Based Data Augmentation Approach for Sensor-Based Human Activity Recognition. International Journal of Computer and Communication Engineering, 10(4), 75–84. https://doi.org/10.17706/IJCCE.2021.10.4.75-84
[8] Liu, M., Geißler, D., Bian, S., Zhou, B., & Lukowicz, P. (2025). Assessing the Impact of Sampling Irregularity in Time Series Data: Human Activity Recognition As A Case Study. https://arxiv.org/pdf/2501.15330
[9] Hao, Y., Wang, B., & Zheng, R. (2023). VALERIAN: Invariant Feature Learning for IMU Sensor-based Human Activity Recognition in the Wild. ACM International Conference Proceeding Series, 66–78. https://doi.org/10.1145/3576842.3582390
[10] Chen, J., Xu, X., Wang, T., Jeon, G., & Camacho, D. (2024). An AIoT Framework With Multi-modal Frequency Fusion for WiFi-Based Coarse and Fine Activity Recognition. IEEE Internet of Things Journal. https://doi.org/10.1109/JIOT.2024.3400773
[11] Ullah, H. A., Letchmunan, S., Zia, M. S., Butt, U. M., & Hassan, F. H. (2021). Analysis of Deep Neural Networks for Human Activity Recognition in Videos - A Systematic Literature Review. IEEE Access, 9, 126366–126387. https://doi.org/10.1109/ACCESS.2021.3110610
[12] Wang, C., & Mohamed, A. S. A. (2023). Group Activity Recognition in Computer Vision: A Comprehensive Review, Challenges, and Future Perspectives. https://arxiv.org/pdf/2307.13541
[13] Ahn, D., Kim, S., Hong, H., & Ko, B. C. (2023). STAR-Transformer: A Spatio-Temporal Cross Attention Transformer for Human Action Recognition (pp. 3330–3339).
[14] Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. Proceedings of the IEEE International Conference on Computer Vision, 2019-Octob, 6201–6210. https://doi.org/10.1109/ICCV.2019.00630
[15] Zeng, M., Nguyen, L. T., Yu, B., Mengshoel, O. J., Zhu, J., Wu, P., & Zhang, J. (2015). Convolutional Neural Networks for human activity recognition using mobile sensors. Proceedings of the 2014 6th International Conference on Mobile Computing, Applications and Services, MobiCASE 2014, 197–205. https://doi.org/10.4108/ICST.MOBICASE.2014.257786
[16] Lara, Ó. D., & Labrador, M. A. (2013). A survey on human activity recognition using wearable sensors. IEEE Communications Surveys and Tutorials, 15(3), 1192–1209. https://doi.org/10.1109/SURV.2012.110112.00192
[17] Bulling, A., Blanke, U., & Schiele, B. (2014). A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys, 46(3). https://doi.org/10.1145/2499621
[18] Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, 1(January), 568–576. http://arxiv.org/abs/1406.2199
[19] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[19] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2014). Learning Spatiotemporal Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV), 2015 Inter, 4489–4497.
[20] Carreira, J., & Zisserman, A. (2017). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017-Janua, 4724–4733. https://doi.org/10.1109/CVPR.2017.502
[21] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9912 LNCS, 20–36.
[22] Wang, X., Girshick, R., Gupta, A., & He, K. (2017). Non-local Neural Networks. ArXiv, arXiv:1711.07971. https://doi.org/10.48550/ARXIV.1711.07971
[23] Yan, S., Xiong, Y., & Lin, D. (2018). Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
[24] Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). ViGirdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2018). Video Action Transformer Network. Retrieved from http://arxiv.org/abs/1812.02707deo Action Transformer Network. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. http://arxiv.org/abs/1812.02707
[25] Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding? Supplementary Materials 1. Implementation Details. 139. https://github.com/
[26] Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35, 10078-10093.
[27] Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., ... & Qiao, Y. (2023). Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14549-14560).
[28] Mehta, S., & Rastegari, M. (2021). MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. ICLR 2022 - 10th International Conference on Learning Representations. https://arxiv.org/pdf/2110.02178
[29] Yamato, J., Ohya, J., & Ishii, K. (1992, June). Recognizing human action in time-sequential images using hidden Markov model. In CVPR (Vol. 92, pp. 379-385).
[30] Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.
[31] Zhang, R., Li, S., Xue, J., Lin, F., Zhang, Q., Ma, X., & Yan, X. (2024). Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions. https://arxiv.org/pdf/2405.17729
[32] Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. https://arxiv.org/pdf/1212.0402
[33] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. Proceedings of the IEEE International Conference on Computer Vision, 2556–2563. https://doi.org/10.1109/ICCV.2011.6126543
[34] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., & Zisserman, A. (2017). The Kinetics Human Action Video Dataset. https://arxiv.org/pdf/1705.06950
[ 35] Goyal, R., Michalski, V., Materzy, J., Westphal, S., Kim, H., Haenel, V., Yianilos, P., Mueller-freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The “something something” video database for learning and evaluating visual common sense. Proceedings of the IEEE International Conference on Computer Vision, 5842–5850.
[36] Kingma, D. P., & Ba, J. L. (2014). Adam: A Method for Stochastic Optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. https://arxiv.org/pdf/1412.6980
[37] Keskar, N. S., Nocedal, J., Tang, P. T. P., Mudigere, D., & Smelyanskiy, M. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, 1–16.

There are 38 citations in total.

Details

Primary Language	English
Subjects	Image Processing
Journal Section	Research Articles
Authors	Engin Seven 0000-0002-7994-2679 Eylem Yücel Demirel 0000-0003-1979-8860
Early Pub Date	November 18, 2025
Publication Date	November 21, 2025
Submission Date	May 21, 2025
Acceptance Date	September 29, 2025
Published in Issue	Year 2025 Volume: 9 Issue: 2

Cite

APA	Seven, E., & Yücel Demirel, E. (2025). A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. Journal of Innovative Science and Engineering, 9(2), 327-342. https://doi.org/10.38088/jise.1703936
AMA	Seven E, Yücel Demirel E. A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. JISE. November 2025;9(2):327-342. doi:10.38088/jise.1703936
Chicago	Seven, Engin, and Eylem Yücel Demirel. “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition With Improved Accuracy on UCF101 and HMDB51”. Journal of Innovative Science and Engineering 9, no. 2 (November 2025): 327-42. https://doi.org/10.38088/jise.1703936.
EndNote	Seven E, Yücel Demirel E (November 1, 2025) A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. Journal of Innovative Science and Engineering 9 2 327–342.
IEEE	E. Seven and E. Yücel Demirel, “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51”, JISE, vol. 9, no. 2, pp. 327–342, 2025, doi: 10.38088/jise.1703936.
ISNAD	Seven, Engin - Yücel Demirel, Eylem. “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition With Improved Accuracy on UCF101 and HMDB51”. Journal of Innovative Science and Engineering 9/2 (November2025), 327-342. https://doi.org/10.38088/jise.1703936.
JAMA	Seven E, Yücel Demirel E. A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. JISE. 2025;9:327–342.
MLA	Seven, Engin and Eylem Yücel Demirel. “A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition With Improved Accuracy on UCF101 and HMDB51”. Journal of Innovative Science and Engineering, vol. 9, no. 2, 2025, pp. 327-42, doi:10.38088/jise.1703936.
Vancouver	Seven E, Yücel Demirel E. A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51. JISE. 2025;9(2):327-42.

Download Cover Image

Article Files

Full Text

Creative Commons License

The works published in Journal of Innovative Science and Engineering (JISE) are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.