Video description with subject, verb and object supervision

doi:10.19682/j.cnki.1005-8885.2019.1006

中国邮电高校学报(英文) ›› 2019, Vol. 26 ›› Issue (2): 52-58.doi: 10.19682/j.cnki.1005-8885.2019.1006

• Artificial Intelligence • 上一篇下一篇

Video description with subject, verb and object supervision

Wang Yue, Liu Jinlai, Wang Xiaojie

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China

收稿日期:2018-12-06 修回日期:2019-04-01 出版日期:2019-04-30 发布日期:2019-06-14
通讯作者: Corresponding author: Wang Yue, E-mail: wangyuesophie@bupt.edu.cn E-mail:wangyuesophie@bupt.edu.cn
作者简介:Corresponding author: Wang Yue, E-mail: wangyuesophie@bupt.edu.cn
基金资助:
This work was supported by the National Natural Science Foundation of China (61273365), and 111 Project (B08004).

Video description with subject, verb and object supervision

Wang Yue, Liu Jinlai, Wang Xiaojie

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China

Received:2018-12-06 Revised:2019-04-01 Online:2019-04-30 Published:2019-06-14
Contact: Corresponding author: Wang Yue, E-mail: wangyuesophie@bupt.edu.cn E-mail:wangyuesophie@bupt.edu.cn
About author:Corresponding author: Wang Yue, E-mail: wangyuesophie@bupt.edu.cn
Supported by:
This work was supported by the National Natural Science Foundation of China (61273365), and 111 Project (B08004).

摘要/Abstract

摘要： Video description aims to generate descriptive natural language for videos. Inspired from the deep neural network (DNN) used in the machine translation, the video description (VD) task applies the convolutional neural network (CNN) to extracting video features and the long short-term memory (LSTM) to generating descriptions. However, some models generate incorrect words and syntax. The reason may because that the previous models only apply LSTM to generate sentences, which learn insufficient linguistic information. In order to solve this problem, an end-to-end DNN model incorporated subject, verb and object (SVO) supervision is proposed. Experimental results on a publicly available dataset, i. e. Youtube2Text, indicate that our model gets a 58.4% consensus-based image
description evaluation (CIDEr) value. It outperforms the mean pool and video description with first feed (VD-FF) models, demonstrating the effectiveness of SVO supervision.

关键词: VD, DNN, CNN, LSTM

Abstract: Video description aims to generate descriptive natural language for videos. Inspired from the deep neural network (DNN) used in the machine translation, the video description (VD) task applies the convolutional neural network (CNN) to extracting video features and the long short-term memory (LSTM) to generating descriptions. However, some models generate incorrect words and syntax. The reason may because that the previous models only apply LSTM to generate sentences, which learn insufficient linguistic information. In order to solve this problem, an end-to-end DNN model incorporated subject, verb and object (SVO) supervision is proposed. Experimental results on a publicly available dataset, i. e. Youtube2Text, indicate that our model gets a 58.4% consensus-based image
description evaluation (CIDEr) value. It outperforms the mean pool and video description with first feed (VD-FF) models, demonstrating the effectiveness of SVO supervision.

Key words: VD, DNN, CNN, LSTM

中图分类号:

TP389.1

Wang Yue, Liu Jinlai, Wang Xiaojie. Video description with subject, verb and object supervision[J]. The Journal of China Universities of Posts and Telecommunications, 2019, 26(2): 52-58.

参考文献

References
1. Aafaq N, Gilani S Z, Liu W, et al. Video description: a survey of methods, datasets and evaluation metrics. ArXiv Preprint ArXiv: 1806. 00186, 2018
2. Baltrusaitis T, Ahuja C, Morency L P. Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018
3. Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012: 1097 -1105
4. Zhang K, Sun M, Han T X, et al. Residual networks of residual networks: multilevel residual networks. IEEE Transactions on Circuits and Systems for Video Technology, 2016
5. Socher R, Bengio Y, Manning C D. Deep learning for NLP (without magic). In Tutorial Abstracts of ACL2012. Association for Computational Linguistics, 2012: 5p
6. Sutskever I, Oriol V, Quoc V L. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 2014: 3104 -3112
7. Vinyals O, Toshev A, Bengio S, et al. Show and tell: a neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3156 -3164
8. Venugopalan S, Xu H, Donahue J, et al. Translating videos to natural language using deep recurrent neural networks. ArXiv Preprint, 2014, ArXiv: 1412. 4729
9. Wang Y, Wang X, Mao Y. First-feed LSTM model for video description. Journal of China Universities of Posts and Telecommunications, 2016, 23(3): 89 -93
10. Lecun Y, Bengio Y. Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks, 1995, 3361(10): 255 -258
11. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
12. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735 -1780
13. Guadarrama S, Krishnamoorthy N, Malkarnenkar G, et al. Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013: 2712 -2719
14. Kingma D P, Ba J L. ADAM: a method for stochastic optimization. International Conference on Learning Representations, 2015: 1 -13
15. Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002: 311 -318
16. Banerjee S, Lavie A. Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic
Evaluation Measures for Machine Translation and/or Summarization, 2005, 29: 65 -72
17. Vedantam R, Lawrence Z C, Parikh D. Cider: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 4566 -4575

Video description with subject, verb and object supervision

Video description with subject, verb and object supervision

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

编辑推荐

Metrics

本文评价