國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,可解釋缺失跨模態模型的探討與應用,Learning Interpretable Cross-Modal Models with Missing Modality

論文名稱 Title	可解釋缺失跨模態模型的探討與應用 Learning Interpretable Cross-Modal Models with Missing Modality
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	112 學年度第 1 學期 The fall semester of Academic Year 112	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	50
研究生 Author	李虎倫 Hu-Lun Li
指導教授 Advisor	康藝晃 KANG, YI-HUANG
召集委員 Convenor	王惠嘉 Wang, Hei-Chia
口試委員 Advisory Committee	楊惠芳 Yang, Huei-Fang
口試日期 Date of Exam	2023-07-27	繳交日期 Date of Submission	2023-08-24
關鍵字 Keywords	解釋性、多模態模型、缺失模態、LIME、CAM、反卷積、Network dissection Interpretability, Multi-modal model, Missing modality, LIME, CAM, Deconvolution, Network dissection
統計 Statistics	本論文已被瀏覽 285 次，被下載 4 次 The thesis/dissertation has been browsed 285 times, has been downloaded 4 times.

中文摘要
對於多模態學習，大部分人都未考量現實世界狀況，將資料集假設為全模態，很少有人研究關於處理缺失模態的問題，也鮮少有人考慮到使用單一模態特徵去彌補另一模態的方法。在這個 AI 大熱門的年代裡，許多機器學習與深度學習的技巧被廣泛利用，但往往人們都單純追求準確度，卻忽略了解釋性，大家都知道把資料輸入神經網路進行訓練，卻不知道實際模型學習到哪些特徵。我們的工作中使用多模態模型來提高可解釋性。它將表格資訊與集成圖像特徵混合在一起，希望使模型更準確、更容易解釋。缺失模態可能讓多模態學習失敗，導致解釋力較差。我們將考量在缺失模態下的實驗過程，以此證明我們方法的可行性。最後，實驗採用Local Interpretable Model-agnostic Explanations (LIME) 和 Class Activation Map (CAM) 進行局部解釋，以研究規則和圖像對預期值的影響。對於全局解釋，使用反卷積和網絡剖析檢測模型，探討模型學習的內容，並參考IoU 指標評估模型的學習概念是否符合邏輯。
Abstract
Most people believe that the dataset for multimodal learning is full-modality without considering real-world conditions. Few studies have addressed the problem of missing modalities, and few have considered methods that use features of a single modality to compensate for another modality. In this era of AI boom, many machine learning and deep learning techniques are widely utilized, but often people simply pursue accuracy at the expense of interpretability. We all know how to input data into the nerve network, but we don't know what features the actual model learns. A multimodal model is used in our work to improve interpretability. It mixes tabular information with integrated image features in the hopes of making the model more accurate and easier to explain. Missing modalities can make multimodal learning fail, resulting in poor explanatory power. We will consider the experimental procedure in missing modalities to demonstrate the feasibility of our method. Finally, the experiments employ Local Interpretable Model-agnostic Explanations (LIME) and Class Activation Map (CAM) for local interpretation in order to investigate the effect of rules and images on the anticipated value. For global interpretation, use the deconvolution and network dissection detection model to explore the content of model learning, and refer to the IoU metric to evaluate whether the model's learnt notions are logical.

目次 Table of Contents
Table of content 論文審定書 i 摘要 ii Abstract iii List of Figures vi List of Tables viii 1. Introduction 1 2. Background and Related Works 3 2.1 Explainable AI 3 2.2 Multi-modal Model 5 2.3 Missing Modality 6 2.4 Multimodal generative models 6 2.5 LIME explainer 7 2.6 CAM based explainer 8 2.7 Global explainer 10 3. Methodology 11 4. Experiments 16 4.1 Kaohsiung real estate transaction records dataset 17 4.1.1 Datasets and pre-processing 17 4.1.2 Performance Comparison 18 4.1.3 Importance Ranking 21 4.1.4 Local Explanation and Result 23 4.1.5 Global Explanation and Result 24 4.2 UTKFace dataset 26 4.2.1 Datasets and pre-processing 26 4.2.2 Performance Comparison 27 4.2.3 Importance Ranking 30 4.2.4 Local Explanation and Result 31 4.2.5 Global Explanation and Result 32 5. Conclusion and future work 34 Reference 37 Appendix 40 Appendix A. Kaohsiung real estate transaction records dataset 40 Figure A1. RMSE comparison of different models in the training set (handling missing modality with our method and AE and GAN from left to right) 40 Figure A2. MAE comparison of different models in the training set (handling missing modality with our method and AE and GAN from left to right) 40 Appendix B. UTKFace dataset 41 Figure B1. RMSE comparison of different models in the training set (handling missing modality with our method and AE and GAN from left to right) 41 Figure B2. MAE comparison of different models in the training set (handling missing modality with our method and AE and GAN from left to right) 41

參考文獻 References
Baldi, P. (2012). Autoencoders, Unsupervised Learning, and Deep Architectures. Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 37–49. https://proceedings.mlr.press/v27/baldi12a.html Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2017). Multimodal Machine Learning: A Survey and Taxonomy (arXiv:1705.09406). arXiv. https://doi.org/10.48550/arXiv.1705.09406 Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484–489. https://doi.org/10.1038/nature16961 Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network Dissection: Quantifying Interpretability of Deep Visual Representations (arXiv:1704.05796). arXiv. https://doi.org/10.48550/arXiv.1704.05796 Chattopadhyay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 839–847. https://doi.org/10.1109/WACV.2018.00097 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (arXiv:2010.11929). arXiv. https://doi.org/10.48550/arXiv.2010.11929 Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut Learning in Deep Neural Networks. Nature Machine Intelligence, 2(11), 665–673. https://doi.org/10.1038/s42256-020-00257-z Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014, June 10). Generative Adversarial Networks. ArXiv.Org. https://arxiv.org/abs/1406.2661v1 Long, J., Shelhamer, E., & Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation (arXiv:1411.4038). arXiv. https://doi.org/10.48550/arXiv.1411.4038 Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., & Peng, X. (2021). SMIL: Multimodal Learning with Severely Missing Modality (arXiv:2103.05677). arXiv. http://arxiv.org/abs/2103.05677 Noh, H., Hong, S., & Han, B. (2015). Learning Deconvolution Network for Semantic Segmentation (arXiv:1505.04366). arXiv. https://doi.org/10.48550/arXiv.1505.04366 Pandey, G., & Dukkipati, A. (2016, March 6). Variational methods for Conditional Multimodal Deep Learning. ArXiv.Org. https://arxiv.org/abs/1603.01801v2 Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier (arXiv:1602.04938). arXiv. http://arxiv.org/abs/1602.04938 Samek, W., Wiegand, T., & Müller, K.-R. (2017). Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models (arXiv:1708.08296). arXiv. https://doi.org/10.48550/arXiv.1708.08296 Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2020). Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. International Journal of Computer Vision, 128(2), 336–359. https://doi.org/10.1007/s11263-019-01228-7 Suzuki, M., Nakayama, K., & Matsuo, Y. (2016). Joint Multimodal Learning with Deep Generative Models (arXiv:1611.01891). arXiv. http://arxiv.org/abs/1611.01891 Tran, L., Liu, X., Zhou, J., & Jin, R. (2017). Missing Modalities Imputation via Cascaded Residual Autoencoder. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4971–4980. https://doi.org/10.1109/CVPR.2017.528 Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., & Salakhutdinov, R. (2019). Learning Factorized Multimodal Representations (arXiv:1806.06176). arXiv. http://arxiv.org/abs/1806.06176 Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., & Hu, X. (2020). Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks (arXiv:1910.01279). arXiv. https://doi.org/10.48550/arXiv.1910.01279 Wu, M., & Goodman, N. (2018). Multimodal Generative Models for Scalable Weakly-Supervised Learning (arXiv:1802.05335). arXiv. http://arxiv.org/abs/1802.05335

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0724123-080851.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS