Multi-level fusion with deep neural networks for multimodal sentiment classification

doi:10.19682/j.cnki.1005-8885.2022.1013

中国邮电高校学报(英文) ›› 2022, Vol. 29 ›› Issue (3): 25-33.doi: 10.19682/j.cnki.1005-8885.2022.1013

Multi-level fusion with deep neural networks for multimodal sentiment classification

Zhang Guangwei, Zhao Bing, Li Ruifan

1. School of Computer Sciences, Beijing University of Posts and Communications, Beijing 100876, China

2. School of Science, Yanshan University, Qinhuangdao 066004, China

3. School of Artificial Intelligence, Beijing University of Posts and Communications, Beijing 100876, China

收稿日期:2022-02-28 修回日期:2022-06-22 出版日期:2022-06-30 发布日期:2022-06-30
通讯作者: 李睿凡 E-mail:rfli@bupt.edu.cn
基金资助:
This work was supported in part by the National Key Research and Development ( R&D ) Program of China (2018YFB1403003)

Multi-level fusion with deep neural networks for multimodal sentiment classification

Zhang Guangwei, Zhao Bing, Li Ruifan

1. School of Computer Sciences, Beijing University of Posts and Communications, Beijing 100876, China

2. School of Science, Yanshan University, Qinhuangdao 066004, China

3. School of Artificial Intelligence, Beijing University of Posts and Communications, Beijing 100876, China

Received:2022-02-28 Revised:2022-06-22 Online:2022-06-30 Published:2022-06-30
Contact: Li Ruifan E-mail:rfli@bupt.edu.cn
Supported by:
This work was supported in part by the National Key Research and Development ( R&D ) Program of China (2018YFB1403003)

摘要/Abstract

摘要：

The task of multimodal sentiment classification aims to associate multimodal information, such as images and texts with appropriate sentiment polarities. There are various levels that can affect human sentiment in visual and textual modalities. However, most existing methods treat various levels of features independently without having effective method for feature fusion. In this paper, we propose a multi-level fusion classification (MFC) model to predict the sentiment polarity based on the fusing features from different levels by exploiting the dependency among them. The proposed architecture leverages convolutional neural networks ( CNNs) with multiple layers to extract levels of features in image and text modalities. Considering the dependencies within the low-level and high-level features, a bi-directional (Bi) recurrent neural network (RNN) is adopted to integrate the learned features from different layers in CNNs. In addition, a conflict detection module is incorporated to address the conflict between modalities. Experiments on the Flickr dataset demonstrate that the MFC method achieves comparable performance compared with strong baseline methods.

关键词: multimodal fusion, sentiment analysis, deep learning

Abstract: The task of multimodal sentiment classification aims to associate multimodal information, such as images and texts with appropriate sentiment polarities. There are various levels that can affect human sentiment in visual and textual modalities. However, most existing methods treat various levels of features independently without having effective method for feature fusion. In this paper, we propose a multi-level fusion classification (MFC) model to predict the sentiment polarity based on the fusing features from different levels by exploiting the dependency among them. The proposed architecture leverages convolutional neural networks ( CNNs) with multiple layers to extract levels of features in image and text modalities. Considering the dependencies within the low-level and high-level features, a bi-directional (Bi) recurrent neural network (RNN) is adopted to integrate the learned features from different layers in CNNs. In addition, a conflict detection module is incorporated to address the conflict between modalities. Experiments on the Flickr dataset demonstrate that the MFC method achieves comparable performance compared with strong baseline methods.

Key words: multimodal fusion, sentiment analysis, deep learning

Zhang Guangwei, Zhao Bing, Li Ruifan. Multi-level fusion with deep neural networks for multimodal sentiment classification [J]. The Journal of China Universities of Posts and Telecommunications, 2022, 29(3): 25-33.

参考文献

1. REN F J, WU Y. Predicting user-topic opinions in twitter with social and topical context. IEEE Transactions on Affective Computing, 2013, 4(4): 412-424.

2. PENG L, CUI G, ZHUANG M Z, et al. What do seller manipulations of online product reviews mean to consumers? HKIBS/WPS/070-1314. Hong Kong, China: Hong Kong Institute of Business Studies (HKIBS), 2014.

3. ASUR S, HUBERMAN B A. Predicting the future with social media. Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology: Vol. 1, 2010, Aug 31-Sept 3,Toronto, Canada. Piscataway, NJ, USA: IEEE, 492-499.

4. KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 2012, 25(2): 1097-1105.

5. LECUN Y, BOSER B, DENKE J S, et al. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1(4): 541-551.

6. GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13), 2013, May 26-31, Vancouver, Canada. Piscataway, NJ, USA: IEEE, 2013: 6645-6649.

7. SANTOS C D, GATTI M. Deep convolutional neural networks for sentiment analysis of short texts. Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14), 2014, Aug 23-29, Dublin, Ireland. Dublin, Ireland: Dublin City University and Association for Computational Linguistics, 2014: 69-78.

8. KIM Y. Convolutional neural networks for sentence classification. Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing (EMNLP'14), 2014, Oct 25-29, Doha, Qatar. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 1746–1751.

9. XU C, CETINTAS S, LEE K C, et al. Visual sentiment prediction with deep convolutional neural networks. arXiv Preprint, arXiv:1411.5731, 2014..

10. YOU Q Z, LUO J B, JIN H L, et al. Robust image sentiment analysis using progressively trained and domain transferred deep networks. Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15), 2015, Jan 25-30, Austin, TX, USA. Menlo Park,CA, USA: American Association for Artificial Intelligence (AAAI), 2015: 381-388.

11. D’MELLO S K, KORY J. A review and meta-analysis of multimodal affect detection systems. ACM Computing Survey, 2015, 47(3): Article 43/1-36.

12. MONKARESI H, HUSSAIN M S, CALVO R A. Classification of affects using head movement, skin color features and physiological signals. Proceedings of the 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC’12), 2012, Oct 14-17, Seoul, Republic of Korea. Piscataway, NJ, USA: IEEE, 2012: 2664-2669.

13. PORIA S, CAMBRIA E, HUSSAIN A, et al. Towards an intelligent framework for multimodal affective data analysis. Neural Networks, 2015, 63: 104-116.

14. SARKAR C, BHATIA S, AGARWAL A, et al. Feature analysis for computational personality recognition using youtube personality data set. Proceedings of the 2014 ACM Multi Media on Workshop on Computational Personality Recognition (WCPR'14), 2014, Nov 7, Orlando, FL, USA. New York, NY, USA: ACM, 2014: 11-14.

15. WANG S F, ZHU Y C, WU G B, et al. Hybrid video emotional tagging using users’ EEG and video content. Multimedia Tools and Applications, 2014, 72(2): 1257-1283.

16. ALAM F, RICCARDI G. Predicting personality traits using multimodal information. Proceedings of the 2014 Workshop on Computational Personality Recognition (WCPR'14), 2014, Nov 7, Orlando, FL, USA. New York, NY, USA: ACM, 2014: 15-18.

17. CAI G Y, XIA B B. Convolutional neural networks for multimedia sentiment analysis. Natural Language Processing and Chinese Computing: Proceedings of the 4th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC’15), 2015, Oct 9-13, Nanchang, China. LNCS 9362. Berlin, Germany: Springer, 2015: 159-167.

18. DOBRIŠEK S, GAJŠEK R, MIHELIČ F, et al. Towards efficient multi-modal emotion recognition. International Journal of Advanced Robotic Systems, 2013, 10(1): Article 53/1-10.

19. GLODEK M, REUTER S, SCHELS M, et al. Kalman filter based classifier fusion for affective state recognition. Multiple Classifier Systems: Proceedings of the 11^th International Workshop on Multiple Classifier Systems (MCS’13), 2013, May 15-17, Nanjing, China. LNIP 7872. Berlin, Germany: Springer, 2013: 85-94.

20. PORIA S, CAMBRIA E, GELBUKH A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, Sept 17-21, Lisbon, Portugal. Stroudsburg, PA, USA: Association for Computational Linguistics, 2015: 2539-2544.

21. PORIA S, CAMBRIA E, HOWARD N, et al. Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing, 2016, 174 (Part A): 50-59.

22. BALTRUŠAITIS T, BANDA N, ROBINSON P. Dimensional affect recognition using continuous conditional random fields. Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG’13), 2013, Apr 22-26, Shanghai, China. Piscataway, NJ, USA: IEEE, 2013: 8p.

23. METALLINOU A, WOLLMER M, KATSAMANIS A, et al. Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Transactions on Affective Computing 2012, 3(2): 184-198.

24. ADAMS W H, IYENGAR G, LIN C Y, et al. Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP Journal on Advances in Signal Processing, 2003, Article 987184/1-16.

25. NEFIAN A V, LIANG L H , PI X B, et al. Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2002, Article 783042/1-15.

26. CORRADINI A, MEHTA M, BERNSEN N O, et al. Multimodal input fusion in human-computer interaction. NATO Science Series Sub Series III: Computer and Systems Sciences 198. Odense, Denmark: University of Southern Denmark, 2005.

27. IYENGAR G, NOCK H J, NETI C. Audio-visual synchrony for detection of monologues in video archives. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03): Vol 5, 2003, Apr 6-10, Hong Kong, China. Piscataway, NJ, USA: IEEE, 2003: V-772.

28. NICKEL K, GEHRIG T, STIEFELHAGEN R, et al. A joint particle filter for audio-visual speaker tracking. Proceedings of the 7th International Conference on Multimodal Interfaces (ICMI’05), 2005, Oct 4-6, Torento, Italy. New York, NY, USA: ACM, 2005: 61-68.

29. POTAMITIS I, CHEN H M, TREMOULIS G. Tracking of multiple moving speakers with multiple microphone arrays. IEEE Transactions on Speech and Audio Processing, 2004, 12(5): 520-529.

30. CAMPOS V, JOU B, GIRÓ-I-NIETO X. From pixels to sentiment: Fine-tuning CNNs for visual sentiment prediction. Image and Vision Computing, 2017, 65: 15-22.

31. WANG J, YU L C, LAI K R, et al. Dimensional sentiment analysis using a regional CNN-LSTM model. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: Vol 2: Short Papers, 2016, Aug 7-12, Berlin, Germany. Stroudsburg, PA, USA: Association for Computational Linguistics, 2016: 225-230.

32. RAO T R, LI X X, XU M. Learning multi-level deep representations for image emotion classification. Neural Processing Letters, 2020, 51: 2043-2061

33. JOU B, CHEN T, PAPPAS N, et al. Visual affect around the world: A large-scale multilingual visual sentiment ontology. Proceedings of the 23rd ACM International Conference on Multimedia (MM’15), 2015, Oct 26-30, Brisbane, Australia. New York, NY, USA: ACM, 2015: 159-168.

34. YOU Q Z, LUO J B, JIN H L, et al. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM'16), 2016, Feb 22-25, San Francisco, CA, USA. New York, NY, USA: ACM, 2016: 13-22.

35. YU Y H, LIN H F, MENG J N, et al. Visual and textual sentiment analysis of a microblog using deep convolutional neural networks. Algorithms, 2016, 9 (2): Article 41/1-11.

36. CHEN X Y, WANG Y H, LIU Q J. Visual and textual sentiment analysis using deep fusion convolutional neural networks. Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP’17), 2017, Sept 17-20, Beijing, China. Piscataway, NJ, USA: IEEE, 2017: 1557-1561.

37. HUANG F R, ZHANG X M, ZHAO Z H, et al. Image-text sentiment analysis via deep multimodal attentive fusion. Knowledge-Based Systems, 2019, 167: 26-37.

38. HUANG F R, WEI K M, WENG J, et al. Attention-based modality-gated networks for image-text sentiment analysis. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(3): Article 79/1-19.

39. LIAO W X, ZENG B, LIU J Q, et al. Image-text interaction graph neural network for image-text sentiment analysis. Applied Intelligence, 2022, DOI:10.1007/s10489-021-02936-9.

[1]	Yang Chao, Li Yimin, Li Tong, Xu Siya, Qi Jun, Zhang Yu. Intelligent Service Function Chain Mapping Framework for Cloud-and-Edge-Collaborative IoT[J]. 中国邮电高校学报(英文), 2022, 29(3): 54-68.
[2]	Yang Jingjing, Guo Yuchun, Feng Tingting, Chen Yishuai. User Friendly Preferential Private Recommendation[J]. 中国邮电高校学报(英文), 2022, 29(3): 43-53.
[3]	Li Qinghua, Wang Jiahui, Li Haiming, Feng Chao. HQD-RRT*: a high-quality path planner for mobile robot in dynamic environment[J]. 中国邮电高校学报(英文), 2022, 29(3): 69-80.
[4]	Shao Zihao, Qu Tianguang, Wang Huiqiang, Zou Yifan, Lv Hongwu. User selection based on user-union and relative entropy in mobile crowdsensing[J]. 中国邮电高校学报(英文), 2022, 29(3): 34-42.
[5]	Zhang Miao, Wang Zixian, Yan Danfeng. Illumination robust image transformations for feature-based SLAM using photometric and feature matches loss[J]. 中国邮电高校学报(英文), 2022, 29(3): 92-104.
[6]	Miao Jiansong, Li Hairui, Zheng Ziyuan, Wang Chu, Zhao Zhenmin. Secrecy Energy Efficiency Maximization for UAV-Enabled Multi-Hop Mobile Relay System[J]. 中国邮电高校学报(英文), 2022, 29(3): 81-91.
[7]	Tang Fei, Dong Kun, Ye Zhangtao, Ling Guowei. Authentication scheme for industrial Internet of things based on DAG blockchain[J]. 中国邮电高校学报(英文版), 2021, 28(6): 1-12.

Multi-level fusion with deep neural networks for multimodal sentiment classification

Multi-level fusion with deep neural networks for multimodal sentiment classification

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

1. REN F J, WU Y. Predicting user-topic opinions in twitter with social and topical context. IEEE Transactions on Affective Computing, 2013, 4(4): 412-424.

相关文章 7

编辑推荐

Metrics

本文评价