Responsive image
博碩士論文 etd-0102124-112803 詳細資訊
Title page for etd-0102124-112803
論文名稱
Title
在線疫苗相關敘述之可互動且可解釋的主題精煉方法
Interactive and Interpretable Topic Refinement for Analyzing Online Vaccine-Related Narratives
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
61
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2024-01-24
繳交日期
Date of Submission
2024-02-02
關鍵字
Keywords
Covid-19、反疫苗、敘事分析、主題模型、資料視覺化、互動式系統、人機迴圈
Covid-19, Anti-vaccination, Narratives analysis, Topic modeling, Data visualization, Interactive system, Human-in-the-loop
統計
Statistics
本論文已被瀏覽 102 次,被下載 0
The thesis/dissertation has been browsed 102 times, has been downloaded 0 times.
中文摘要
COVID-19 冠狀病毒爆發後,為了更有效控制疫情,各國政府及研究人員皆致力於了解大眾對於疫情的想法、反疫苗及支持疫苗的言論趨勢等,然而卻鮮少有針對疫情的視覺化分析系統,且大多無法讓使用者能自行修正資料標籤,以優化分析結果。為此,本研究利用深度學習模型與半監督分群模型,建構能區分反對疫苗和支持疫苗兩種貼文的主題模型,生成易於解釋的主題分析結果,並結合可互動圖表,發展一個可互動且具可解釋性的系統,讓使用者能透過該系統深入了解反疫苗及支持疫苗之相似或相異的主題及敘事內容;並且,它結合一種受約束的分群演算法,允許使用者透過系統界面進行人機迴圈 (Human-in-the-loop) 的過程,使用者可以通過視覺化圖表探索主題之間的關係,驗證各貼文的標籤是否正確,並修正可能不正確的標籤,再次建構主題模型並觀察結果,以此反覆過程來優化主題分析的結果。本研究使用 COVID-19 疫苗相關的社群媒體貼文作為案例研究,測試該系統在識別反對疫苗和支持疫苗兩種敘事方面的能力,實驗結果顯示,透過該系統有助於提高主題模型的各項評估指標,如熵值 (entropy) 和純度 (purity) 等,以此讓使用者可以更精確地了解反疫苗和支持疫苗兩種主題間的關係。
Abstract
This research aims to develop highly interpretable models that help generate easy-to-explain data representations of social media texts, which will enhance the interpretability of the online measurement extracted from social media user-generated texts. Such a capacity can benefit our research seeking to measure online engagement and its connection to collective decision-making on societal changes. In this research, we develop an interactive and interpretable framework that allows analysts to identify text with similar or distinct narratives. We use social media text related to the Coronavirus disease 2019 (COVID-19) vaccines as a case study and test the capability of our framework in identifying the Anti-vaccine and Pro-vaccine narratives. Our framework offers two major advantages. First, it leverages semi-supervised topic modeling with deep learning architecture to identify topics that distinguishes between Anti-vaccine and Pro-vaccine posts. Second, it incorporates a constrained hierarchical clustering method that allows human-in-the-loop topic refinement through the system interface, where analysts can explore the relationship of topics via visual representation, verify the labels of post instances, or update labels that are more likely to be incorrect or less certain. Our evaluation shows that the results with refinement significantly improve the topics' coherence and allow for exploring the relationship between Anti-vaccine and Pro-vaccine topics.
目次 Table of Contents
論文審定書 i
論文公開授權書 ii
摘要 iii
Abstract iv
List of Figures vii
List of Tables viii
1. Introduction 1
2. Background and Related Works 3
2.1 Interpretable Topic Modeling 3
2.2 Interactive System of Topic Exploration 5
2.3 Constraint-based Clustering 6
3. Methodology 8
3.1 Topic Modeling via BERTopic 8
3.2 Topic Refinement 11
3.3 Interactive System 14
3.3.1 Data and Algorithm Control 14
3.3.2 Overview Map 19
3.3.3 Topic Inspection 25
4 Results 32
4.1 Performance Comparison 32
4.2 C-HDBSCAN Hyperparameter Selection 37
5. Conclusion and Future Work 42
Reference 45
參考文獻 References
Bair, E. (2013). Semi-supervised clustering methods. WIREs Computational Statistics, 5(5), 349–361. https://doi.org/10.1002/wics.1270
Bikakis, N. (2022). Big Data Visualization Tools (pp. 1–8). https://doi.org/10.1007/978-3-319-63962-8_109-2
Bischof, J. M., & Airoldi, E. M. (2012). Summarizing topical content with word frequency and exclusivity. Proceedings of the 29th International Coference on International Conference on Machine Learning, 9–16.
Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. Proceedings of the 23rd International Conference on Machine Learning, 113–120. https://doi.org/10.1145/1143844.1143859
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.
Boyd-Graber, J., Hu, Y., & Mimno, D. (2017). Applications of Topic Models. Foundations and Trends® in Information Retrieval, 11(2–3), 143–296. https://doi.org/10.1561/1500000030
Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. In J. Pei, V. S. Tseng, L. Cao, H. Motoda, & G. Xu (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 160–172). Springer. https://doi.org/10.1007/978-3-642-37456-2_14
Campello, R. J. G. B., Moulavi, D., Zimek, A., & Sander, J. (2015). Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Transactions on Knowledge Discovery from Data, 10(1), 5:1-5:51. https://doi.org/10.1145/2733381
Chaney, A., & Blei, D. (2012). Visualizing Topic Models. Proceedings of the International AAAI Conference on Web and Social Media, 6(1), Article 1. https://doi.org/10.1609/icwsm.v6i1.14321
Chuang, J., Manning, C. D., & Heer, J. (2012). Termite: Visualization techniques for assessing textual topic models. Proceedings of the International Working Conference on Advanced Visual Interfaces, 74–77. https://doi.org/10.1145/2254556.2254572
Chuang, J., & McFarland, D. A. (2013). Document Exploration with Topic Modeling: Designing Interactive Visualizations to Support Effective Analysis Workflows. https://www.semanticscholar.org/paper/Document-Exploration-with-Topic-Modeling-%3A-to-Chuang-McFarland/0d2927f9ec35c340d91998e6f73d94efcf690bd0
Davidson, I., & Ravi, S. S. (2005). Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results. In A. M. Jorge, L. Torgo, P. Brazdil, R. Camacho, & J. Gama (Eds.), Knowledge Discovery in Databases: PKDD 2005 (pp. 59–70). Springer. https://doi.org/10.1007/11564126_11
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
El-Assady, M., Kehlbeck, R., Collins, C., Keim, D., & Deussen, O. (2020). Semantic Concept Spaces: Guided Topic Model Refinement using Word-Embedding Projections. IEEE Transactions on Visualization and Computer Graphics, 26(1), 1001–1011. https://doi.org/10.1109/TVCG.2019.2934654
Ertl, B., Meyer, J., Schneider, M., & Streit, A. (2020). CoExDBSCAN: Density-based Clustering with Constrained Expansion. 2, 104–115. https://doi.org/10.5220/0010131201040115
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 226–231.
Gokhale, S. S. (2020). Monitoring the Perception of Covid-19 Vaccine using Topic Models. 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), 867–874. https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00134
Gretarsson, B., O’Donovan, J., Bostandjiev, S., Höllerer, T., Asuncion, A., Newman, D., & Smyth, P. (2012). TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling. ACM Transactions on Intelligent Systems and Technology, 3(2), 23:1-23:26. https://doi.org/10.1145/2089094.2089099
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure (arXiv:2203.05794). arXiv. https://doi.org/10.48550/arXiv.2203.05794
Hossain, S. (2019). Visualization of Bioinformatics Data with Dash Bio. 126–133. https://doi.org/10.25080/Majora-7ddc1dd1-012
Hughes, B., Miller-Idriss, C., Piltch-Loeb, R., Goldberg, B., White, K., Criezis, M., & Savoia, E. (2021). Development of a Codebook of Online Anti-Vaccination Rhetoric to Manage COVID-19 Vaccine Misinformation. International Journal of Environmental Research and Public Health, 18(14), 7556. https://doi.org/10.3390/ijerph18147556
Jiang, L., Liu, S., & Chen, C. (2019). Recent research advances on interactive machine learning. Journal of Visualization, 22(2), 401–417. https://doi.org/10.1007/s12650-018-0531-1
Lafferty, J., & Blei, D. (2005). Correlated Topic Models. Advances in Neural Information Processing Systems, 18. https://papers.nips.cc/paper_files/paper/2005/hash/9e82757e9a1c12cb710ad680db11f6f1-Abstract.html
Lee, T., Smith, A., Seppi, K., Elmqvist, N., Boyd-Graber, J., & Findlater, L. (2017). The Human Touch: How Non-expert Users Perceive, Interpret, and Fix Topic Models. Faculty Publications. https://scholarsarchive.byu.edu/facpub/1847
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics: Vol. 5.1 (pp. 281–298). University of California Press. https://projecteuclid.org/ebooks/berkeley-symposium-on-mathematical-statistics-and-probability/Proceedings-of-the-Fifth-Berkeley-Symposium-on-Mathematical-Statistics-and/chapter/Some-methods-for-classification-and-analysis-of-multivariate-observations/bsmsp/1200512992
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. In K. Bontcheva & J. Zhu (Eds.), Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-5010
Martin, L. R., & Petrie, K. J. (2017). Understanding the Dimensions of Anti-Vaccination Attitudes: The Vaccination Attitudes Examination (VAX) Scale. Annals of Behavioral Medicine: A Publication of the Society of Behavioral Medicine, 51(5), 652–660. https://doi.org/10.1007/s12160-017-9888-y
McInnes, L., Healy, J., & Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (arXiv:1802.03426). arXiv. https://doi.org/10.48550/arXiv.1802.03426
Melton, C. A., Olusanya, O. A., Ammar, N., & Shaban-Nejad, A. (2021). Public sentiment analysis and topic modeling regarding COVID-19 vaccines on the Reddit social media platform: A call to action for strengthening vaccine confidence. Journal of Infection and Public Health, 14(10), 1505–1512. https://doi.org/10.1016/j.jiph.2021.08.010
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (arXiv:1301.3781). arXiv. https://doi.org/10.48550/arXiv.1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, 26. https://papers.nips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html
Monarch, R., & Munro, R. (2021). Human-in-the-Loop Machine Learning: Active Learning and Annotation for Human-centered AI. Simon and Schuster.
Moody, C. E. (2016). Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec (arXiv:1605.02019). arXiv. https://doi.org/10.48550/arXiv.1605.02019
Murdock, J., & Allen, C. (2015, January 27). Visualization Techniques for Topic Model Checking.
Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. 45–50. https://doi.org/10.13140/2.1.2393.1847
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (arXiv:1908.10084). arXiv. https://doi.org/10.48550/arXiv.1908.10084
Roberts, M. E., Stewart, B. M., & Tingley, D. (2019). stm: An R Package for Structural Topic Models. Journal of Statistical Software, 91, 1–40. https://doi.org/10.18637/jss.v091.i02
Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399–408. https://doi.org/10.1145/2684822.2685324
Ruiz, C., Spiliopoulou, M., & Menasalvas, E. (2007). C-DBSCAN: Density-Based Clustering with Constraints. In A. An, J. Stefanowski, S. Ramanna, C. J. Butz, W. Pedrycz, & G. Wang (Eds.), Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (pp. 216–223). Springer. https://doi.org/10.1007/978-3-540-72530-5_25
Ruiz, C., Spiliopoulou, M., & Menasalvas, E. (2010). Density-based semi-supervised clustering. Data Mining and Knowledge Discovery, 21(3), 345–370. https://doi.org/10.1007/s10618-009-0157-y
Sahoo, P., Ekbal, A., Saha, S., Molla Aliod, D., & Nandan, K. (2016, December 11). Semi-supervised Clustering of Medical Text.
Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. In J. Chuang, S. Green, M. Hearst, J. Heer, & P. Koehn (Eds.), Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces (pp. 63–70). Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-3110
Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach (1st ed.). O’Reilly Media, Inc.
Smith, N., & Graham, T. (2019). Mapping the anti-vaccination movement on Facebook. Information, Communication & Society, 22(9), 1310–1327. https://doi.org/10.1080/1369118X.2017.1418406
Sperrle, F., El-Assady, M., Guo, G., Borgo, R., Chau, D. H., Endert, A., & Keim, D. (2021). A Survey of Human-Centered Evaluations in Human-Centered Machine Learning. COMPUTER GRAPHICS FORUM, 40(3), 543–567. https://doi.org/10.1111/cgf.14329
Trivedi, G., Pham, P., Chapman, W. W., Hwa, R., Wiebe, J., & Hochheiser, H. (2017). NLPReViz: An interactive tool for natural language processing on clinical text. Journal of the American Medical Informatics Association : JAMIA, 25(1), 81–87. https://doi.org/10.1093/jamia/ocx070
Wagstaff, K., & Cardie, C. (2000). Clustering with Instance-level Constraints. Proceedings of the Seventeenth International Conference on Machine Learning, 1103–1110.
Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001). Constrained K-means Clustering with Background Knowledge. Proceedings of the Eighteenth International Conference on Machine Learning, 577–584.
Zhao, Z., Xu, P., Scheidegger, C., & Ren, L. (2021). Human-in-the-loop Extraction of Interpretable Concepts in Deep Learning Models (arXiv:2108.03738). arXiv. https://doi.org/10.48550/arXiv.2108.03738
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code