中国邮电高校学报(英文) ›› 2019, Vol. 26 ›› Issue (2): 52-58.doi: 10.19682/j.cnki.1005-8885.2019.1006

• Artificial Intelligence • 上一篇    下一篇

Video description with subject, verb and object supervision

Wang Yue, Liu Jinlai, Wang Xiaojie   

  1. School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • 收稿日期:2018-12-06 修回日期:2019-04-01 出版日期:2019-04-30 发布日期:2019-06-14
  • 通讯作者: Corresponding author: Wang Yue, E-mail: wangyuesophie@bupt.edu.cn E-mail:wangyuesophie@bupt.edu.cn
  • 作者简介:Corresponding author: Wang Yue, E-mail: wangyuesophie@bupt.edu.cn
  • 基金资助:
    This work was supported by the National Natural Science Foundation of China (61273365), and 111 Project (B08004).

Video description with subject, verb and object supervision

Wang Yue, Liu Jinlai, Wang Xiaojie   

  1. School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2018-12-06 Revised:2019-04-01 Online:2019-04-30 Published:2019-06-14
  • Contact: Corresponding author: Wang Yue, E-mail: wangyuesophie@bupt.edu.cn E-mail:wangyuesophie@bupt.edu.cn
  • About author:Corresponding author: Wang Yue, E-mail: wangyuesophie@bupt.edu.cn
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61273365), and 111 Project (B08004).

摘要: Video description aims to generate descriptive natural language for videos. Inspired from the deep neural network (DNN) used in the machine translation, the video description (VD) task applies the convolutional neural network (CNN) to extracting video features and the long short-term memory (LSTM) to generating descriptions. However, some models generate incorrect words and syntax. The reason may because that the previous models only apply LSTM to generate sentences, which learn insufficient linguistic information. In order to solve this problem, an end-to-end DNN model incorporated subject, verb and object (SVO) supervision is proposed. Experimental results on a publicly available dataset, i. e. Youtube2Text, indicate that our model gets a 58.4% consensus-based image
description evaluation (CIDEr) value. It outperforms the mean pool and video description with first feed (VD-FF) models, demonstrating the effectiveness of SVO supervision.

关键词: VD, DNN, CNN, LSTM

Abstract: Video description aims to generate descriptive natural language for videos. Inspired from the deep neural network (DNN) used in the machine translation, the video description (VD) task applies the convolutional neural network (CNN) to extracting video features and the long short-term memory (LSTM) to generating descriptions. However, some models generate incorrect words and syntax. The reason may because that the previous models only apply LSTM to generate sentences, which learn insufficient linguistic information. In order to solve this problem, an end-to-end DNN model incorporated subject, verb and object (SVO) supervision is proposed. Experimental results on a publicly available dataset, i. e. Youtube2Text, indicate that our model gets a 58.4% consensus-based image
description evaluation (CIDEr) value. It outperforms the mean pool and video description with first feed (VD-FF) models, demonstrating the effectiveness of SVO supervision.

Key words: VD, DNN, CNN, LSTM

中图分类号: