論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available
論文名稱 Title |
基於大型語言模型迴路學習的主題建模 Topic Modeling with LLM-in-the-loop Learning |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
56 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2024-07-11 |
繳交日期 Date of Submission |
2024-08-28 |
關鍵字 Keywords |
主題演進、主題模型、主題擴散、矩陣分解、深度學習、自編碼器、大型語言模型 Topic Evolution, Topic Modeling, Topic Diffusion, Matrix Factorization,, Deep learning, Autoencoder, Large Language Model |
||
統計 Statistics |
本論文已被瀏覽 167 次,被下載 8 次 The thesis/dissertation has been browsed 167 times, has been downloaded 8 times. |
中文摘要 |
隨著科技日益進步,每天都有大量內容產出,大量數據在各種媒體傳遞,人們很難在有限時間內快速擷取所需資訊,本研究提出以主題模型與大型語言模型(Large Language Model, LLM) 構成文檔主題快速擷取框架,並以電子文獻為例,使用神經網路進行矩陣分解的主題模型,快速蒐整文獻內容蘊含的主題,並透過LLM對構成主題意象關鍵字字典進行增修,使主題模型結果聚焦且明確,實驗結果證實可據以蒐整出更多文獻語料且重新訓練的主體模型具備較多樣關鍵字,對過往需商請專家學者對主題模型結果進行判讀並增修的部分,提供另一個可行且可運用的處理方案。 |
Abstract |
With increasing technological advancements, a large amount of content is produced every day, and vast amounts of data are transmitted through various media. It is difficult for people to quickly retrieve the required information within a limited time. This study proposes a framework for the rapid retrieval of document topics by using a topic model and a large language model (LLM). Using electronic documents as an example, we employ a topic model with neural network matrix decomposition. By using the neural network to perform matrix decomposition on the topic model, we can quickly identify the topics embedded in the content of the documents and refine the keywords that constitute the topic imagery dictionary through the LLM. This makes the results of the topic model more focused and clearer. The experimental results prove that a larger document corpus can be collected accordingly, and the retrained main model contains more keywords. This provides another feasible and applicable solution for the part of the topic model results that previously required expert and academic review and modification. The main model has been retrained with a wider variety of keywords. |
目次 Table of Contents |
論文審定書 i 誌謝 ii 摘要 iii Abstract iv 目錄 v 圖次 vii 表次 viii 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 1 1.3 研究目的 2 第二章 文獻探討 3 2.1 主題模型 Topic model 3 2.2 大型語言模型迴路機器學習 Large language model-in-the-Loop Machine Learning 5 2.3 大型語言模型 Large Language Model 6 第三章 研究方法與步驟 7 3.1 研究方法 7 3.1.1 LLM Dictionary and Text Corpus Updater 8 3.2 評估標準 9 3.2.1 評估詞彙擴散程度 9 3.2.2 可視化主題相關性 11 3.3 研究分析 12 第四章 實驗結果與討論分析 13 4.1 資料整理 13 4.2 研究過程 14 4.2.1 Visualization of Topic Relationship and Evolution 14 4.2.2 Term Evolution 16 4.3 研究結果分析 17 第五章 研究結論與建議 18 5.1 研究結論 18 5.2 研究限制 18 第六章 參考文獻 20 附錄 A 23 附錄 B 27 附錄 C 37 |
參考文獻 References |
Baldi, P. (2012). Autoencoders, unsupervised learning, and deep architectures. 37–49. Blei, D. M., & Lafferty, J. D. (2009). Topic models. In Text mining (pp. 101–124). Chapman and Hall/CRC. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners (No. arXiv:2005.14165). arXiv. http://arxiv.org/abs/2005.14165 Dai, S.-C., Xiong, A., & Ku, L.-W. (2023). LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis (No. arXiv:2310.15100). arXiv. http://arxiv.org/abs/2310.15100 Egger, R., & Yu, J. (2022). A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Frontiers in Sociology, 7, 886498. François, C. (2015). Keras [Computer software]. https://keras.io Grbovic, M., Radosavljevic, V., Djuric, N., Bhamidipati, N., Savla, J., Bhagwan, V., & Sharp, D. (2015). E-commerce in your inbox: Product recommendations at scale. 1809–1818. Greene, D., O’Callaghan, D., & Cunningham, P. (2014). How many topics? Stability analysis for topic models. 498–513. Grosse, I., Bernaola-Galván, P., Carpena, P., Román-Roldán, R., Oliver, J., & Stanley, H. E. (2002). Analysis of symbolic sequences using the Jensen-Shannon divergence. Physical Review E, 65(4), 041905. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. Huang, S.-T., Kang, Y., Hung, S.-M., Kuo, B., & Cheng, I.-L. (2020). Topic diffusion discovery based on deep non-negative autoencoder. 405–408. Hung, S.-M. (2020). Topic Evolution and Diffusion Discovery based on Online Deep Non-negative Autoencoder. In Topic Evolution and Diffusion Discovery based on Online Deep Non-negative Autoencoder. 撰者. Kang, Y., & Zadorozhny, V. (2016). Process monitoring using maximum sequence divergence. Knowledge and Information Systems, 48(1), 81–109. https://doi.org/10.1007/s10115-015-0858-z Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., & Hüllermeier, E. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (2013). Handbook of latent semantic analysis. Psychology Press. Lü, L., Medo, M., Yeung, C. H., Zhang, Y.-C., Zhang, Z.-K., & Zhou, T. (2012). Recommender systems. Physics Reports, 519(1), 1–49. Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J., & Fernández-Leal, Á. (2023). Human-in-the-loop machine learning: A state of the art. Artificial Intelligence Review, 56(4), 3005–3054. https://doi.org/10.1007/s10462-022-10246-w Ognyanova, K. (n.d.). Network visualization with R. Network, 1, T2. Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non‐negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126. https://doi.org/10.1002/env.3170050203 R Core Team, R. (2013). R: A language and environment for statistical computing. Ram, K., & Broman, K. (2019). aRxiv: Interface to the ArXiv API. R Package Version 0.5, 19. Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (n.d.). Exploring Topic Coherence over Many Models and Many Topics. Villanueva, R. A. M., & Chen, Z. J. (2019). ggplot2: Elegant graphics for data analysis. Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., & Dong, Z. (2023). A survey of large language models. arXiv Preprint arXiv:2303.18223. |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:校內校外完全公開 unrestricted 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus: 已公開 available |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |