中国邮电高校学报(英文) ›› 2022, Vol. 29 ›› Issue (3): 25-33.doi: 10.19682/j.cnki.1005-8885.2022.1013

• • 上一篇    下一篇

Multi-level fusion with deep neural networks for multimodal sentiment classification

Zhang Guangwei, Zhao Bing, Li Ruifan   

  1. 1. School of Computer Sciences, Beijing University of Posts and Communications, Beijing 100876, China
    2. School of Science, Yanshan University, Qinhuangdao 066004, China
    3. School of Artificial Intelligence, Beijing University of Posts and Communications, Beijing 100876, China

  • 收稿日期:2022-02-28 修回日期:2022-06-22 出版日期:2022-06-30 发布日期:2022-06-30
  • 通讯作者: 李睿凡 E-mail:rfli@bupt.edu.cn
  • 基金资助:
    This work was supported in part by the National Key Research and Development ( R&D ) Program of China (2018YFB1403003)

Multi-level fusion with deep neural networks for multimodal sentiment classification

Zhang Guangwei, Zhao Bing, Li Ruifan   

  1. 1. School of Computer Sciences, Beijing University of Posts and Communications, Beijing 100876, China
    2. School of Science, Yanshan University, Qinhuangdao 066004, China
    3. School of Artificial Intelligence, Beijing University of Posts and Communications, Beijing 100876, China
  • Received:2022-02-28 Revised:2022-06-22 Online:2022-06-30 Published:2022-06-30
  • Contact: Li Ruifan E-mail:rfli@bupt.edu.cn
  • Supported by:
    This work was supported in part by the National Key Research and Development ( R&D ) Program of China (2018YFB1403003)

摘要:

The task of multimodal sentiment classification aims to associate multimodal information, such as images and texts with appropriate sentiment polarities. There are various levels that can affect human sentiment in visual and textual modalities. However, most existing methods treat various levels of features independently without having effective method for feature fusion. In this paper, we propose a multi-level fusion classification (MFC) model to predict the sentiment polarity based on the fusing features from different levels by exploiting the dependency among them. The proposed architecture leverages convolutional neural networks ( CNNs) with multiple layers to extract levels of features in image and text modalities. Considering the dependencies within the low-level and high-level features, a bi-directional (Bi) recurrent neural network (RNN) is adopted to integrate the learned features from different layers in CNNs. In addition, a conflict detection module is incorporated to address the conflict between modalities. Experiments on the Flickr dataset demonstrate that the MFC method achieves comparable performance compared with strong baseline methods.

关键词: multimodal fusion, sentiment analysis, deep learning

Abstract: The task of multimodal sentiment classification aims to associate multimodal information, such as images and texts with appropriate sentiment polarities. There are various levels that can affect human sentiment in visual and textual modalities. However, most existing methods treat various levels of features independently without having effective method for feature fusion. In this paper, we propose a multi-level fusion classification (MFC) model to predict the sentiment polarity based on the fusing features from different levels by exploiting the dependency among them. The proposed architecture leverages convolutional neural networks ( CNNs) with multiple layers to extract levels of features in image and text modalities. Considering the dependencies within the low-level and high-level features, a bi-directional (Bi) recurrent neural network (RNN) is adopted to integrate the learned features from different layers in CNNs. In addition, a conflict detection module is incorporated to address the conflict between modalities. Experiments on the Flickr dataset demonstrate that the MFC method achieves comparable performance compared with strong baseline methods.

Key words: multimodal fusion, sentiment analysis, deep learning