國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,在線疫苗相關敘述之可互動且可解釋的主題精煉方法,Interactive and Interpretable Topic Refinement for Analyzing Online Vaccine-Related Narratives

論文名稱 Title	在線疫苗相關敘述之可互動且可解釋的主題精煉方法 Interactive and Interpretable Topic Refinement for Analyzing Online Vaccine-Related Narratives
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	112 學年度第 1 學期 The fall semester of Academic Year 112	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	61
研究生 Author	陳靖中 Ching-Chung Chen
指導教授 Advisor	康藝晃 Kang,Yi-Huang
召集委員 Convenor	李珮如 Lee, Pei-Ju
口試委員 Advisory Committee	韓毅 Han Yi
口試日期 Date of Exam	2024-01-24	繳交日期 Date of Submission	2024-02-02
關鍵字 Keywords	Covid-19、反疫苗、敘事分析、主題模型、資料視覺化、互動式系統、人機迴圈 Covid-19, Anti-vaccination, Narratives analysis, Topic modeling, Data visualization, Interactive system, Human-in-the-loop
統計 Statistics	本論文已被瀏覽 249 次，被下載 0 次 The thesis/dissertation has been browsed 249 times, has been downloaded 0 times.

中文摘要
COVID-19 冠狀病毒爆發後，為了更有效控制疫情，各國政府及研究人員皆致力於了解大眾對於疫情的想法、反疫苗及支持疫苗的言論趨勢等，然而卻鮮少有針對疫情的視覺化分析系統，且大多無法讓使用者能自行修正資料標籤，以優化分析結果。為此，本研究利用深度學習模型與半監督分群模型，建構能區分反對疫苗和支持疫苗兩種貼文的主題模型，生成易於解釋的主題分析結果，並結合可互動圖表，發展一個可互動且具可解釋性的系統，讓使用者能透過該系統深入了解反疫苗及支持疫苗之相似或相異的主題及敘事內容；並且，它結合一種受約束的分群演算法，允許使用者透過系統界面進行人機迴圈 (Human-in-the-loop) 的過程，使用者可以通過視覺化圖表探索主題之間的關係，驗證各貼文的標籤是否正確，並修正可能不正確的標籤，再次建構主題模型並觀察結果，以此反覆過程來優化主題分析的結果。本研究使用 COVID-19 疫苗相關的社群媒體貼文作為案例研究，測試該系統在識別反對疫苗和支持疫苗兩種敘事方面的能力，實驗結果顯示，透過該系統有助於提高主題模型的各項評估指標，如熵值 (entropy) 和純度 (purity) 等，以此讓使用者可以更精確地了解反疫苗和支持疫苗兩種主題間的關係。
Abstract
This research aims to develop highly interpretable models that help generate easy-to-explain data representations of social media texts, which will enhance the interpretability of the online measurement extracted from social media user-generated texts. Such a capacity can benefit our research seeking to measure online engagement and its connection to collective decision-making on societal changes. In this research, we develop an interactive and interpretable framework that allows analysts to identify text with similar or distinct narratives. We use social media text related to the Coronavirus disease 2019 (COVID-19) vaccines as a case study and test the capability of our framework in identifying the Anti-vaccine and Pro-vaccine narratives. Our framework offers two major advantages. First, it leverages semi-supervised topic modeling with deep learning architecture to identify topics that distinguishes between Anti-vaccine and Pro-vaccine posts. Second, it incorporates a constrained hierarchical clustering method that allows human-in-the-loop topic refinement through the system interface, where analysts can explore the relationship of topics via visual representation, verify the labels of post instances, or update labels that are more likely to be incorrect or less certain. Our evaluation shows that the results with refinement significantly improve the topics' coherence and allow for exploring the relationship between Anti-vaccine and Pro-vaccine topics.

目次 Table of Contents
論文審定書 i 論文公開授權書 ii 摘要 iii Abstract iv List of Figures vii List of Tables viii 1. Introduction 1 2. Background and Related Works 3 2.1 Interpretable Topic Modeling 3 2.2 Interactive System of Topic Exploration 5 2.3 Constraint-based Clustering 6 3. Methodology 8 3.1 Topic Modeling via BERTopic 8 3.2 Topic Refinement 11 3.3 Interactive System 14 3.3.1 Data and Algorithm Control 14 3.3.2 Overview Map 19 3.3.3 Topic Inspection 25 4 Results 32 4.1 Performance Comparison 32 4.2 C-HDBSCAN Hyperparameter Selection 37 5. Conclusion and Future Work 42 Reference 45

參考文獻 References
Bair, E. (2013). Semi-supervised clustering methods. WIREs Computational Statistics, 5(5), 349–361. https://doi.org/10.1002/wics.1270 Bikakis, N. (2022). Big Data Visualization Tools (pp. 1–8). https://doi.org/10.1007/978-3-319-63962-8_109-2 Bischof, J. M., & Airoldi, E. M. (2012). Summarizing topical content with word frequency and exclusivity. Proceedings of the 29th International Coference on International Conference on Machine Learning, 9–16. Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. Proceedings of the 23rd International Conference on Machine Learning, 113–120. https://doi.org/10.1145/1143844.1143859 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan), 993–1022. Boyd-Graber, J., Hu, Y., & Mimno, D. (2017). Applications of Topic Models. Foundations and Trends® in Information Retrieval, 11(2–3), 143–296. https://doi.org/10.1561/1500000030 Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. In J. Pei, V. S. Tseng, L. Cao, H. Motoda, & G. Xu (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 160–172). Springer. https://doi.org/10.1007/978-3-642-37456-2_14 Campello, R. J. G. B., Moulavi, D., Zimek, A., & Sander, J. (2015). Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Transactions on Knowledge Discovery from Data, 10(1), 5:1-5:51. https://doi.org/10.1145/2733381 Chaney, A., & Blei, D. (2012). Visualizing Topic Models. Proceedings of the International AAAI Conference on Web and Social Media, 6(1), Article 1. https://doi.org/10.1609/icwsm.v6i1.14321 Chuang, J., Manning, C. D., & Heer, J. (2012). Termite: Visualization techniques for assessing textual topic models. Proceedings of the International Working Conference on Advanced Visual Interfaces, 74–77. https://doi.org/10.1145/2254556.2254572 Chuang, J., & McFarland, D. A. (2013). Document Exploration with Topic Modeling: Designing Interactive Visualizations to Support Effective Analysis Workflows. https://www.semanticscholar.org/paper/Document-Exploration-with-Topic-Modeling-%3A-to-Chuang-McFarland/0d2927f9ec35c340d91998e6f73d94efcf690bd0 Davidson, I., & Ravi, S. S. (2005). Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results. In A. M. Jorge, L. Torgo, P. Brazdil, R. Camacho, & J. Gama (Eds.), Knowledge Discovery in Databases: PKDD 2005 (pp. 59–70). Springer. https://doi.org/10.1007/11564126_11 Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423 El-Assady, M., Kehlbeck, R., Collins, C., Keim, D., & Deussen, O. (2020). Semantic Concept Spaces: Guided Topic Model Refinement using Word-Embedding Projections. IEEE Transactions on Visualization and Computer Graphics, 26(1), 1001–1011. https://doi.org/10.1109/TVCG.2019.2934654 Ertl, B., Meyer, J., Schneider, M., & Streit, A. (2020). CoExDBSCAN: Density-based Clustering with Constrained Expansion. 2, 104–115. https://doi.org/10.5220/0010131201040115 Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 226–231. Gokhale, S. S. (2020). Monitoring the Perception of Covid-19 Vaccine using Topic Models. 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), 867–874. https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00134 Gretarsson, B., O’Donovan, J., Bostandjiev, S., Höllerer, T., Asuncion, A., Newman, D., & Smyth, P. (2012). TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling. ACM Transactions on Intelligent Systems and Technology, 3(2), 23:1-23:26. https://doi.org/10.1145/2089094.2089099 Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure (arXiv:2203.05794). arXiv. https://doi.org/10.48550/arXiv.2203.05794 Hossain, S. (2019). Visualization of Bioinformatics Data with Dash Bio. 126–133. https://doi.org/10.25080/Majora-7ddc1dd1-012 Hughes, B., Miller-Idriss, C., Piltch-Loeb, R., Goldberg, B., White, K., Criezis, M., & Savoia, E. (2021). Development of a Codebook of Online Anti-Vaccination Rhetoric to Manage COVID-19 Vaccine Misinformation. International Journal of Environmental Research and Public Health, 18(14), 7556. https://doi.org/10.3390/ijerph18147556 Jiang, L., Liu, S., & Chen, C. (2019). Recent research advances on interactive machine learning. Journal of Visualization, 22(2), 401–417. https://doi.org/10.1007/s12650-018-0531-1 Lafferty, J., & Blei, D. (2005). Correlated Topic Models. Advances in Neural Information Processing Systems, 18. https://papers.nips.cc/paper_files/paper/2005/hash/9e82757e9a1c12cb710ad680db11f6f1-Abstract.html Lee, T., Smith, A., Seppi, K., Elmqvist, N., Boyd-Graber, J., & Findlater, L. (2017). The Human Touch: How Non-expert Users Perceive, Interpret, and Fix Topic Models. Faculty Publications. https://scholarsarchive.byu.edu/facpub/1847 MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics: Vol. 5.1 (pp. 281–298). University of California Press. https://projecteuclid.org/ebooks/berkeley-symposium-on-mathematical-statistics-and-probability/Proceedings-of-the-Fifth-Berkeley-Symposium-on-Mathematical-Statistics-and/chapter/Some-methods-for-classification-and-analysis-of-multivariate-observations/bsmsp/1200512992 Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. In K. Bontcheva & J. Zhu (Eds.), Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-5010 Martin, L. R., & Petrie, K. J. (2017). Understanding the Dimensions of Anti-Vaccination Attitudes: The Vaccination Attitudes Examination (VAX) Scale. Annals of Behavioral Medicine: A Publication of the Society of Behavioral Medicine, 51(5), 652–660. https://doi.org/10.1007/s12160-017-9888-y McInnes, L., Healy, J., & Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (arXiv:1802.03426). arXiv. https://doi.org/10.48550/arXiv.1802.03426 Melton, C. A., Olusanya, O. A., Ammar, N., & Shaban-Nejad, A. (2021). Public sentiment analysis and topic modeling regarding COVID-19 vaccines on the Reddit social media platform: A call to action for strengthening vaccine confidence. Journal of Infection and Public Health, 14(10), 1505–1512. https://doi.org/10.1016/j.jiph.2021.08.010 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (arXiv:1301.3781). arXiv. https://doi.org/10.48550/arXiv.1301.3781 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, 26. https://papers.nips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html Monarch, R., & Munro, R. (2021). Human-in-the-Loop Machine Learning: Active Learning and Annotation for Human-centered AI. Simon and Schuster. Moody, C. E. (2016). Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec (arXiv:1605.02019). arXiv. https://doi.org/10.48550/arXiv.1605.02019 Murdock, J., & Allen, C. (2015, January 27). Visualization Techniques for Topic Model Checking. Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. 45–50. https://doi.org/10.13140/2.1.2393.1847 Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (arXiv:1908.10084). arXiv. https://doi.org/10.48550/arXiv.1908.10084 Roberts, M. E., Stewart, B. M., & Tingley, D. (2019). stm: An R Package for Structural Topic Models. Journal of Statistical Software, 91, 1–40. https://doi.org/10.18637/jss.v091.i02 Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399–408. https://doi.org/10.1145/2684822.2685324 Ruiz, C., Spiliopoulou, M., & Menasalvas, E. (2007). C-DBSCAN: Density-Based Clustering with Constraints. In A. An, J. Stefanowski, S. Ramanna, C. J. Butz, W. Pedrycz, & G. Wang (Eds.), Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (pp. 216–223). Springer. https://doi.org/10.1007/978-3-540-72530-5_25 Ruiz, C., Spiliopoulou, M., & Menasalvas, E. (2010). Density-based semi-supervised clustering. Data Mining and Knowledge Discovery, 21(3), 345–370. https://doi.org/10.1007/s10618-009-0157-y Sahoo, P., Ekbal, A., Saha, S., Molla Aliod, D., & Nandan, K. (2016, December 11). Semi-supervised Clustering of Medical Text. Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. In J. Chuang, S. Green, M. Hearst, J. Heer, & P. Koehn (Eds.), Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces (pp. 63–70). Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-3110 Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach (1st ed.). O’Reilly Media, Inc. Smith, N., & Graham, T. (2019). Mapping the anti-vaccination movement on Facebook. Information, Communication & Society, 22(9), 1310–1327. https://doi.org/10.1080/1369118X.2017.1418406 Sperrle, F., El-Assady, M., Guo, G., Borgo, R., Chau, D. H., Endert, A., & Keim, D. (2021). A Survey of Human-Centered Evaluations in Human-Centered Machine Learning. COMPUTER GRAPHICS FORUM, 40(3), 543–567. https://doi.org/10.1111/cgf.14329 Trivedi, G., Pham, P., Chapman, W. W., Hwa, R., Wiebe, J., & Hochheiser, H. (2017). NLPReViz: An interactive tool for natural language processing on clinical text. Journal of the American Medical Informatics Association : JAMIA, 25(1), 81–87. https://doi.org/10.1093/jamia/ocx070 Wagstaff, K., & Cardie, C. (2000). Clustering with Instance-level Constraints. Proceedings of the Seventeenth International Conference on Machine Learning, 1103–1110. Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001). Constrained K-means Clustering with Background Knowledge. Proceedings of the Eighteenth International Conference on Machine Learning, 577–584. Zhao, Z., Xu, P., Scheidegger, C., & Ren, L. (2021). Human-in-the-loop Extraction of Interpretable Concepts in Deep Learning Models (arXiv:2108.03738). arXiv. https://doi.org/10.48550/arXiv.2108.03738

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0102124-112803.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS