Multi-level fusion with deep neural networks for multimodal sentiment classification

doi:10.19682/j.cnki.1005-8885.2022.1013

The Journal of China Universities of Posts and Telecommunications ›› 2022, Vol. 29 ›› Issue (3): 25-33.doi: 10.19682/j.cnki.1005-8885.2022.1013

Previous Articles Next Articles

Multi-level fusion with deep neural networks for multimodal sentiment classification

Zhang Guangwei, Zhao Bing, Li Ruifan

1. School of Computer Sciences, Beijing University of Posts and Communications, Beijing 100876, China

2. School of Science, Yanshan University, Qinhuangdao 066004, China

3. School of Artificial Intelligence, Beijing University of Posts and Communications, Beijing 100876, China

Received:2022-02-28 Revised:2022-06-22 Online:2022-06-30 Published:2022-06-30
Contact: Li Ruifan E-mail:rfli@bupt.edu.cn
Supported by:
This work was supported in part by the National Key Research and Development ( R&D ) Program of China (2018YFB1403003)

Abstract

Abstract: The task of multimodal sentiment classification aims to associate multimodal information, such as images and texts with appropriate sentiment polarities. There are various levels that can affect human sentiment in visual and textual modalities. However, most existing methods treat various levels of features independently without having effective method for feature fusion. In this paper, we propose a multi-level fusion classification (MFC) model to predict the sentiment polarity based on the fusing features from different levels by exploiting the dependency among them. The proposed architecture leverages convolutional neural networks ( CNNs) with multiple layers to extract levels of features in image and text modalities. Considering the dependencies within the low-level and high-level features, a bi-directional (Bi) recurrent neural network (RNN) is adopted to integrate the learned features from different layers in CNNs. In addition, a conflict detection module is incorporated to address the conflict between modalities. Experiments on the Flickr dataset demonstrate that the MFC method achieves comparable performance compared with strong baseline methods.

Key words: multimodal fusion, sentiment analysis, deep learning

Zhang Guangwei, Zhao Bing, Li Ruifan. Multi-level fusion with deep neural networks for multimodal sentiment classification [J]. The Journal of China Universities of Posts and Telecommunications, 2022, 29(3): 25-33.

References

1. REN F J, WU Y. Predicting user-topic opinions in twitter with social and topical context. IEEE Transactions on Affective Computing, 2013, 4(4): 412-424.

2. PENG L, CUI G, ZHUANG M Z, et al. What do seller manipulations of online product reviews mean to consumers? HKIBS/WPS/070-1314. Hong Kong, China: Hong Kong Institute of Business Studies (HKIBS), 2014.

3. ASUR S, HUBERMAN B A. Predicting the future with social media. Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology: Vol. 1, 2010, Aug 31-Sept 3,Toronto, Canada. Piscataway, NJ, USA: IEEE, 492-499.

4. KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 2012, 25(2): 1097-1105.

5. LECUN Y, BOSER B, DENKE J S, et al. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1(4): 541-551.

6. GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13), 2013, May 26-31, Vancouver, Canada. Piscataway, NJ, USA: IEEE, 2013: 6645-6649.

7. SANTOS C D, GATTI M. Deep convolutional neural networks for sentiment analysis of short texts. Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14), 2014, Aug 23-29, Dublin, Ireland. Dublin, Ireland: Dublin City University and Association for Computational Linguistics, 2014: 69-78.

8. KIM Y. Convolutional neural networks for sentence classification. Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing (EMNLP'14), 2014, Oct 25-29, Doha, Qatar. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 1746–1751.

9. XU C, CETINTAS S, LEE K C, et al. Visual sentiment prediction with deep convolutional neural networks. arXiv Preprint, arXiv:1411.5731, 2014..

10. YOU Q Z, LUO J B, JIN H L, et al. Robust image sentiment analysis using progressively trained and domain transferred deep networks. Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15), 2015, Jan 25-30, Austin, TX, USA. Menlo Park,CA, USA: American Association for Artificial Intelligence (AAAI), 2015: 381-388.

11. D’MELLO S K, KORY J. A review and meta-analysis of multimodal affect detection systems. ACM Computing Survey, 2015, 47(3): Article 43/1-36.

12. MONKARESI H, HUSSAIN M S, CALVO R A. Classification of affects using head movement, skin color features and physiological signals. Proceedings of the 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC’12), 2012, Oct 14-17, Seoul, Republic of Korea. Piscataway, NJ, USA: IEEE, 2012: 2664-2669.

13. PORIA S, CAMBRIA E, HUSSAIN A, et al. Towards an intelligent framework for multimodal affective data analysis. Neural Networks, 2015, 63: 104-116.

14. SARKAR C, BHATIA S, AGARWAL A, et al. Feature analysis for computational personality recognition using youtube personality data set. Proceedings of the 2014 ACM Multi Media on Workshop on Computational Personality Recognition (WCPR'14), 2014, Nov 7, Orlando, FL, USA. New York, NY, USA: ACM, 2014: 11-14.

15. WANG S F, ZHU Y C, WU G B, et al. Hybrid video emotional tagging using users’ EEG and video content. Multimedia Tools and Applications, 2014, 72(2): 1257-1283.

16. ALAM F, RICCARDI G. Predicting personality traits using multimodal information. Proceedings of the 2014 Workshop on Computational Personality Recognition (WCPR'14), 2014, Nov 7, Orlando, FL, USA. New York, NY, USA: ACM, 2014: 15-18.

17. CAI G Y, XIA B B. Convolutional neural networks for multimedia sentiment analysis. Natural Language Processing and Chinese Computing: Proceedings of the 4th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC’15), 2015, Oct 9-13, Nanchang, China. LNCS 9362. Berlin, Germany: Springer, 2015: 159-167.

18. DOBRIŠEK S, GAJŠEK R, MIHELIČ F, et al. Towards efficient multi-modal emotion recognition. International Journal of Advanced Robotic Systems, 2013, 10(1): Article 53/1-10.

19. GLODEK M, REUTER S, SCHELS M, et al. Kalman filter based classifier fusion for affective state recognition. Multiple Classifier Systems: Proceedings of the 11^th International Workshop on Multiple Classifier Systems (MCS’13), 2013, May 15-17, Nanjing, China. LNIP 7872. Berlin, Germany: Springer, 2013: 85-94.

20. PORIA S, CAMBRIA E, GELBUKH A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, Sept 17-21, Lisbon, Portugal. Stroudsburg, PA, USA: Association for Computational Linguistics, 2015: 2539-2544.

21. PORIA S, CAMBRIA E, HOWARD N, et al. Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing, 2016, 174 (Part A): 50-59.

22. BALTRUŠAITIS T, BANDA N, ROBINSON P. Dimensional affect recognition using continuous conditional random fields. Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG’13), 2013, Apr 22-26, Shanghai, China. Piscataway, NJ, USA: IEEE, 2013: 8p.

23. METALLINOU A, WOLLMER M, KATSAMANIS A, et al. Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Transactions on Affective Computing 2012, 3(2): 184-198.

24. ADAMS W H, IYENGAR G, LIN C Y, et al. Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP Journal on Advances in Signal Processing, 2003, Article 987184/1-16.

25. NEFIAN A V, LIANG L H , PI X B, et al. Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2002, Article 783042/1-15.

26. CORRADINI A, MEHTA M, BERNSEN N O, et al. Multimodal input fusion in human-computer interaction. NATO Science Series Sub Series III: Computer and Systems Sciences 198. Odense, Denmark: University of Southern Denmark, 2005.

27. IYENGAR G, NOCK H J, NETI C. Audio-visual synchrony for detection of monologues in video archives. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03): Vol 5, 2003, Apr 6-10, Hong Kong, China. Piscataway, NJ, USA: IEEE, 2003: V-772.

28. NICKEL K, GEHRIG T, STIEFELHAGEN R, et al. A joint particle filter for audio-visual speaker tracking. Proceedings of the 7th International Conference on Multimodal Interfaces (ICMI’05), 2005, Oct 4-6, Torento, Italy. New York, NY, USA: ACM, 2005: 61-68.

29. POTAMITIS I, CHEN H M, TREMOULIS G. Tracking of multiple moving speakers with multiple microphone arrays. IEEE Transactions on Speech and Audio Processing, 2004, 12(5): 520-529.

30. CAMPOS V, JOU B, GIRÓ-I-NIETO X. From pixels to sentiment: Fine-tuning CNNs for visual sentiment prediction. Image and Vision Computing, 2017, 65: 15-22.

31. WANG J, YU L C, LAI K R, et al. Dimensional sentiment analysis using a regional CNN-LSTM model. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: Vol 2: Short Papers, 2016, Aug 7-12, Berlin, Germany. Stroudsburg, PA, USA: Association for Computational Linguistics, 2016: 225-230.

32. RAO T R, LI X X, XU M. Learning multi-level deep representations for image emotion classification. Neural Processing Letters, 2020, 51: 2043-2061

33. JOU B, CHEN T, PAPPAS N, et al. Visual affect around the world: A large-scale multilingual visual sentiment ontology. Proceedings of the 23rd ACM International Conference on Multimedia (MM’15), 2015, Oct 26-30, Brisbane, Australia. New York, NY, USA: ACM, 2015: 159-168.

34. YOU Q Z, LUO J B, JIN H L, et al. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM'16), 2016, Feb 22-25, San Francisco, CA, USA. New York, NY, USA: ACM, 2016: 13-22.

35. YU Y H, LIN H F, MENG J N, et al. Visual and textual sentiment analysis of a microblog using deep convolutional neural networks. Algorithms, 2016, 9 (2): Article 41/1-11.

36. CHEN X Y, WANG Y H, LIU Q J. Visual and textual sentiment analysis using deep fusion convolutional neural networks. Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP’17), 2017, Sept 17-20, Beijing, China. Piscataway, NJ, USA: IEEE, 2017: 1557-1561.

37. HUANG F R, ZHANG X M, ZHAO Z H, et al. Image-text sentiment analysis via deep multimodal attentive fusion. Knowledge-Based Systems, 2019, 167: 26-37.

38. HUANG F R, WEI K M, WENG J, et al. Attention-based modality-gated networks for image-text sentiment analysis. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(3): Article 79/1-19.

39. LIAO W X, ZENG B, LIU J Q, et al. Image-text interaction graph neural network for image-text sentiment analysis. Applied Intelligence, 2022, DOI:10.1007/s10489-021-02936-9.

Metrics

Comments

Copyright © 2020 The Journal of China Universities of Posts and Telecommunications
　 Adress: P.O. Box 231,Beijing University of Posts and Telecommunications,10 Xi Tucheng Road,Beijing 100876,P.R.China　Post Code: 100081
Tel：86-010-62282493　Fax： 86-010-62283461　E-mail: jchupt@bupt.edu.cn
Support by: Beijing Magtech Co.Ltd

[1]	Han Zhen, Zhou Wen'an, Han Xiaoxuan, Wu Jie. Black-box membership inference attacks based on shadow model [J]. The Journal of China Universities of Posts and Telecommunications, 2024, 31(4): 1-16.
[2]	Yang Jingjing, Guo Yuchun, Feng Tingting, Chen Yishuai. User Friendly Preferential Private Recommendation [J]. The Journal of China Universities of Posts and Telecommunications, 2022, 29(3): 43-53.
[3]	Li Qinghua, Wang Jiahui, Li Haiming, Feng Chao. HQD-RRT*: a high-quality path planner for mobile robot in dynamic environment [J]. The Journal of China Universities of Posts and Telecommunications, 2022, 29(3): 69-80.
[4]	Miao Jiansong, Li Hairui, Zheng Ziyuan, Wang Chu, Zhao Zhenmin. Secrecy Energy Efficiency Maximization for UAV-Enabled Multi-Hop Mobile Relay System [J]. The Journal of China Universities of Posts and Telecommunications, 2022, 29(3): 81-91.
[5]	Tang Fei, Dong Kun, Ye Zhangtao, Ling Guowei. Authentication scheme for industrial Internet of things based on DAG blockchain [J]. The Journal of China Universities of Posts and Telecommunications, 2021, 28(6): 1-12.

Multi-level fusion with deep neural networks for multimodal sentiment classification

PDF

Knowledge

Abstract

Cite this article

share this article

References

1. REN F J, WU Y. Predicting user-topic opinions in twitter with social and topical context. IEEE Transactions on Affective Computing, 2013, 4(4): 412-424.

Related Articles 5

Recommended Articles

Metrics

Comments