國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於大型語言模型迴路學習的主題建模,Topic Modeling with LLM-in-the-loop Learning

論文名稱 Title	基於大型語言模型迴路學習的主題建模 Topic Modeling with LLM-in-the-loop Learning
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	112 學年度第 2 學期 The spring semester of Academic Year 112	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	56
研究生 Author	蔡柏毅 Po-Yi Tsai
指導教授 Advisor	康藝晃 KANG,YI-HUANG
召集委員 Convenor	李珮如 Lee,Pei-Ju
口試委員 Advisory Committee	楊惠芳 Yang,Huei-Fang
口試日期 Date of Exam	2024-07-11	繳交日期 Date of Submission	2024-08-28
關鍵字 Keywords	主題演進、主題模型、主題擴散、矩陣分解、深度學習、自編碼器、大型語言模型 Topic Evolution, Topic Modeling, Topic Diffusion, Matrix Factorization,, Deep learning, Autoencoder, Large Language Model
統計 Statistics	本論文已被瀏覽 167 次，被下載 8 次 The thesis/dissertation has been browsed 167 times, has been downloaded 8 times.

中文摘要
隨著科技日益進步，每天都有大量內容產出，大量數據在各種媒體傳遞，人們很難在有限時間內快速擷取所需資訊，本研究提出以主題模型與大型語言模型(Large Language Model, LLM) 構成文檔主題快速擷取框架，並以電子文獻為例，使用神經網路進行矩陣分解的主題模型，快速蒐整文獻內容蘊含的主題，並透過LLM對構成主題意象關鍵字字典進行增修，使主題模型結果聚焦且明確，實驗結果證實可據以蒐整出更多文獻語料且重新訓練的主體模型具備較多樣關鍵字，對過往需商請專家學者對主題模型結果進行判讀並增修的部分，提供另一個可行且可運用的處理方案。
Abstract
With increasing technological advancements, a large amount of content is produced every day, and vast amounts of data are transmitted through various media. It is difficult for people to quickly retrieve the required information within a limited time. This study proposes a framework for the rapid retrieval of document topics by using a topic model and a large language model (LLM). Using electronic documents as an example, we employ a topic model with neural network matrix decomposition. By using the neural network to perform matrix decomposition on the topic model, we can quickly identify the topics embedded in the content of the documents and refine the keywords that constitute the topic imagery dictionary through the LLM. This makes the results of the topic model more focused and clearer. The experimental results prove that a larger document corpus can be collected accordingly, and the retrained main model contains more keywords. This provides another feasible and applicable solution for the part of the topic model results that previously required expert and academic review and modification. The main model has been retrained with a wider variety of keywords.

目次 Table of Contents
論文審定書 i 誌謝 ii 摘要 iii Abstract iv 目錄 v 圖次 vii 表次 viii 第一章緒論 1 1.1 研究背景 1 1.2 研究動機 1 1.3 研究目的 2 第二章文獻探討 3 2.1 主題模型 Topic model 3 2.2 大型語言模型迴路機器學習 Large language model-in-the-Loop Machine Learning 5 2.3 大型語言模型 Large Language Model 6 第三章研究方法與步驟 7 3.1 研究方法 7 3.1.1 LLM Dictionary and Text Corpus Updater 8 3.2 評估標準 9 3.2.1 評估詞彙擴散程度 9 3.2.2 可視化主題相關性 11 3.3 研究分析 12 第四章實驗結果與討論分析 13 4.1 資料整理 13 4.2 研究過程 14 4.2.1 Visualization of Topic Relationship and Evolution 14 4.2.2 Term Evolution 16 4.3 研究結果分析 17 第五章研究結論與建議 18 5.1 研究結論 18 5.2 研究限制 18 第六章參考文獻 20 附錄 A 23 附錄 B 27 附錄 C 37

參考文獻 References
Baldi, P. (2012). Autoencoders, unsupervised learning, and deep architectures. 37–49. Blei, D. M., & Laﬀerty, J. D. (2009). Topic models. In Text mining (pp. 101–124). Chapman and Hall/CRC. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners (No. arXiv:2005.14165). arXiv. http://arxiv.org/abs/2005.14165 Dai, S.-C., Xiong, A., & Ku, L.-W. (2023). LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis (No. arXiv:2310.15100). arXiv. http://arxiv.org/abs/2310.15100 Egger, R., & Yu, J. (2022). A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Frontiers in Sociology, 7, 886498. François, C. (2015). Keras [Computer software]. https://keras.io Grbovic, M., Radosavljevic, V., Djuric, N., Bhamidipati, N., Savla, J., Bhagwan, V., & Sharp, D. (2015). E-commerce in your inbox: Product recommendations at scale. 1809–1818. Greene, D., O’Callaghan, D., & Cunningham, P. (2014). How many topics? Stability analysis for topic models. 498–513. Grosse, I., Bernaola-Galván, P., Carpena, P., Román-Roldán, R., Oliver, J., & Stanley, H. E. (2002). Analysis of symbolic sequences using the Jensen-Shannon divergence. Physical Review E, 65(4), 041905. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. Huang, S.-T., Kang, Y., Hung, S.-M., Kuo, B., & Cheng, I.-L. (2020). Topic diffusion discovery based on deep non-negative autoencoder. 405–408. Hung, S.-M. (2020). Topic Evolution and Diffusion Discovery based on Online Deep Non-negative Autoencoder. In Topic Evolution and Diffusion Discovery based on Online Deep Non-negative Autoencoder. 撰者. Kang, Y., & Zadorozhny, V. (2016). Process monitoring using maximum sequence divergence. Knowledge and Information Systems, 48(1), 81–109. https://doi.org/10.1007/s10115-015-0858-z Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., & Hüllermeier, E. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (2013). Handbook of latent semantic analysis. Psychology Press. Lü, L., Medo, M., Yeung, C. H., Zhang, Y.-C., Zhang, Z.-K., & Zhou, T. (2012). Recommender systems. Physics Reports, 519(1), 1–49. Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J., & Fernández-Leal, Á. (2023). Human-in-the-loop machine learning: A state of the art. Artificial Intelligence Review, 56(4), 3005–3054. https://doi.org/10.1007/s10462-022-10246-w Ognyanova, K. (n.d.). Network visualization with R. Network, 1, T2. Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non‐negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126. https://doi.org/10.1002/env.3170050203 R Core Team, R. (2013). R: A language and environment for statistical computing. Ram, K., & Broman, K. (2019). aRxiv: Interface to the ArXiv API. R Package Version 0.5, 19. Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (n.d.). Exploring Topic Coherence over Many Models and Many Topics. Villanueva, R. A. M., & Chen, Z. J. (2019). ggplot2: Elegant graphics for data analysis. Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., & Dong, Z. (2023). A survey of large language models. arXiv Preprint arXiv:2303.18223.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0728124-111750.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS