國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以大型語言模型建立語言無涉之主題模型,On Building Language-agnostic Topic Model using Large Language Model

論文名稱 Title	以大型語言模型建立語言無涉之主題模型 On Building Language-agnostic Topic Model using Large Language Model
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	111 學年度第 2 學期 The spring semester of Academic Year 111	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	71
研究生 Author	黃天原 Tien-Yuan Huang
指導教授 Advisor	黃三益 Hwang, San-Yih
召集委員 Convenor	林福仁 Fu-Ren, Lin
口試委員 Advisory Committee	羅珮綺 Pei-Chi Lo
口試日期 Date of Exam	2023-07-07	繳交日期 Date of Submission	2023-08-23
關鍵字 Keywords	跨語言主題模型、大型語言模型、主題模型、奇異值分解、句嵌入 cross-lingual topic model, large language model, topic model, singular value decomposition, sentence-embedding
統計 Statistics	本論文已被瀏覽 1377 次，被下載 0 次 The thesis/dissertation has been browsed 1377 times, has been downloaded 0 times.

中文摘要
跨語言主題建模是主題模型研究中的一個重要子領域。本論文探討使用大型語言模型（LLM）開發語言無涉主題模型的方法，並研究跨語言主題建模的限制。本論文主要關注重點是利用LLM「Cohere」提供的多語言句向量在基於聚類的主題模型中的應用。通過實驗分析，觀察到LLM多語言句向量在零樣本分類任務中表現出色，但在主題模型中表現不佳。為了解決Cohere句向量中存在的語言干擾問題，我們提出了兩種主要方法。第一種方法是對LLM嵌入應用奇異值分解（SVD）過程，利用得到的正交矩陣 U 作為主題模型的輸入。第二種方法則是利用正交矩陣 U 和 Σ 進行逐元素相乘，並利用 t 檢定找出中文與英文間 t-value 最高的維度，去除該維度。利用此方法可以有效地消除了最具語言干擾性的維度。實驗結果顯示，這兩種方法顯著提升了主題模型的性能，使其能夠生成更具一致性的跨語言主題詞。通過應用本研究中提出的方法，研究人員可以在跨語言主題建模的後續任務或分析中獲得更有意義的結果。
Abstract
This research delves into language-agnostic topic modeling using large language models (LLMs) and explores limitations in cross-lingual topic modeling. The study concentrates on utilizing multilingual sentence embeddings from the LLM "Cohere" in clustering-based topic models. While LLM embeddings excel in zero-shot classification, their performance lags in topic modeling. To mitigate Cohere's language dependence, two approaches are proposed. The first involves employing singular value decomposition (SVD) on LLM embeddings, using the orthogonal matrix U for topic modeling. The second method leverages the element-wise product of matrix U and Σ array, and then eliminating the most language-dependent dimensions which is the highest t-value dimension between English and Chinese following the SVD process. Experimental results reveal these methods enhance the topic model's performance, generating coherent topic words across languages. These approaches can yield more insightful outcomes for cross-lingual topic modeling tasks.

目次 Table of Contents
論文審定書 i 謝誌 ii 摘要 iii Abstract iv Table of Contents v List of Figures viii List of Tables ix 1 Introduction 1 2 Related Works 8 2.1 Cross-lingual Topic Model 8 2.2 Cluster-based Topic Model 9 2.3 Large Language Model in Cross-lingual Task 10 3 Methodology 11 3.1 Cluster-based Topic Model Approach 11 3.2 Cross-lingual Cluster-based Topic Model 14 3.2.1 Encode Documents Using LLM 14 3.2.2 Eliminate Language-Dependent Dimension in the LLM Embedding 15 4 Experiments 25 4.1 Dataset 25 4.2 Experiment Task and Metric 26 4.2.1 Topic Model 26 4.2.2 Classification Task 31 4.3 Results and Analysis 33 4.3.1 Baseline and Cohere 33 4.3.2 Cohere SVD Unscale 37 4.3.3 Cohere Dimension Reduction 39 4.3.4 Cohere SVD Scale 40 4.3.5 Cohere SVD Scale LanguageDim Removal 41 5 Discussion 42 5.1 Topic Quality 42 5.2 Cross Comparison of the Purity/Reflexibility and Language 44 5.3 Classification Task of the topic distribution θ 46 6 Conclusion 48 7 References 50 8 Appendix 59 8.1 Robust Study for paraphrase-xlm-r-multilingual-v1 59

參考文獻 References
Adelani, D. I., Masiak, M., Azime, I. A., Alabi, J. O., Tonja, A. L., Mwase, C., Ogundepo, O., Dossou, B. F. P., Oladipo, A., Nixdorf, D., Emezue, C. C., al-azzawi, S. S., Sibanda, B. K., David, D., Ndolela, L., Mukiibi, J., Ajayi, T. O., Ngoli, T. M., Odhiambo, B., … Stenetorp, P. (2023). MasakhaNEWS: News Topic Classification for African languages (arXiv:2304.09972). arXiv. https://doi.org/10.48550/arXiv.2304.09972 Artetxe, M., Labaka, G., & Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2289–2294. https://doi.org/10.18653/v1/D16-1250 Artetxe, M., Labaka, G., & Agirre, E. (2017). Learning bilingual word embeddings with (almost) no bilingual data. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 451–462. https://doi.org/10.18653/v1/P17-1042 Artetxe, M., Labaka, G., & Agirre, E. (2018). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 789–798. https://doi.org/10.18653/v1/P18-1073 Bianchi, F., Terragni, S., & Hovy, D. (2021). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 759–766. https://doi.org/10.18653/v1/2021.acl-short.96 Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1676–1683. https://doi.org/10.18653/v1/2021.eacl-main.143 Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84. https://doi.org/10.1145/2133806.2133826 Chang, C.-H. (2021). A Study on Text Analysis Across Languages. National Sun Yat-sen University. Chang, C.-H., & Hwang, S.-Y. (2021). A word embedding-based approach to cross-lingual topic modeling. Knowledge and Information Systems, 63(6), 1529–1555. https://doi.org/10.1007/s10115-021-01555-7 Chang, C.-M., Chang, C.-H., & Hwang, S.-Y. (2020). Employing word mover’s distance for cross-lingual plagiarized text detection. Proceedings of the Association for Information Science and Technology, 57(1), e229. https://doi.org/10.1002/pra2.229 Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747 Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423 Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics, 8, 439–453. https://doi.org/10.1162/tacl_a_00325 Feng, Z., Cao, H., Zhao, T., Wang, W., & Peng, W. (2022). Cross-lingual Feature Extraction from Monolingual Corpora for Low-resource Unsupervised Bilingual Lexicon Induction. Proceedings of the 29th International Conference on Computational Linguistics, 5278–5287. https://aclanthology.org/2022.coling-1.469 Glavaš, G., & Vulić, I. (2020). Non-Linear Instance-Based Cross-Lingual Mapping for Non-Isomorphic Embedding Spaces. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7548–7555. https://doi.org/10.18653/v1/2020.acl-main.675 Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure (arXiv:2203.05794). arXiv. https://doi.org/10.48550/arXiv.2203.05794 Hofmann, T. (1999). Probabilistic latent semantic analysis. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, 289–296. Hofmann, T. (2001). Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42(1), 177–196. https://doi.org/10.1023/A:1007617005950 Jagarlamudi, J., & Daumé, H. (2010). Extracting Multilingual Topics from Unaligned Comparable Corpora. In C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, & K. van Rijsbergen (Eds.), Advances in Information Retrieval (pp. 444–456). Springer. https://doi.org/10.1007/978-3-642-12275-0_39 Kamalloo, E., Zhang, X., Ogundepo, O., Thakur, N., Alfonso-Hermelo, D., Rezagholizadeh, M., & Lin, J. (2023). Evaluating Embedding APIs for Information Retrieval (arXiv:2305.06300). arXiv. http://arxiv.org/abs/2305.06300 Kingma, D. P., & Welling, M. (2014, May 1). Auto-Encoding Variational Bayes. ICLR 2014. https://doi.org/10.48550/arXiv.1312.6114 Lample, G., Conneau, A., Ranzato, M., Denoyer, L., & Jégou, H. (2018). Word translation without parallel data. International Conference on Learning Representations. https://openreview.net/forum?id=H196sainb Lau, J. H., Newman, D., & Baldwin, T. (2014). Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 530–539. https://doi.org/10.3115/v1/E14-1056 Libovický, J., Rosa, R., & Fraser, A. (2020). On the Language Neutrality of Pre-trained Multilingual Representations. Findings of the Association for Computational Linguistics: EMNLP 2020, 1663–1674. https://doi.org/10.18653/v1/2020.findings-emnlp.150 Liu, Y., Wang, J., & Jiang, Y. (2016). PT-LDA: A latent variable model to predict personality traits of social network users. Neurocomputing, 210, 155–163. https://doi.org/10.1016/j.neucom.2015.10.144 Maaten, L. van der, & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86), 2579–2605. McInnes, L., Healy, J., & Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (arXiv:1802.03426). arXiv. https://doi.org/10.48550/arXiv.1802.03426 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (arXiv:1301.3781). arXiv. https://doi.org/10.48550/arXiv.1301.3781 Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting Similarities among Languages for Machine Translation (arXiv:1309.4168). arXiv. https://doi.org/10.48550/arXiv.1309.4168 Mimno, D., Wallach, H. M., Naradowsky, J., Smith, D. A., & McCallum, A. (2009). Polylingual Topic Models. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 880–889. https://aclanthology.org/D09-1092 Mohiuddin, T., Bari, M. S., & Joty, S. (2020). LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2712–2723. https://doi.org/10.18653/v1/2020.emnlp-main.215 Moschella, L., Maiorca, V., Fumero, M., Norelli, A., Locatello, F., & Rodolà, E. (2022). Relative representations enable zero-shot latent space communication (arXiv:2209.15430). arXiv. https://doi.org/10.48550/arXiv.2209.15430 Mueller, A., & Dredze, M. (2021). Fine-tuning Encoders for Improved Monolingual and Zero-shot Polylingual Neural Topic Modeling. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3054–3068. https://doi.org/10.18653/v1/2021.naacl-main.243 Neelakantan, A., Xu, T., Puri, R., Radford, A., Han, J. M., Tworek, J., Yuan, Q., Tezak, N., Kim, J. W., Hallacy, C., Heidecke, J., Shyam, P., Power, B., Nekoul, T. E., Sastry, G., Krueger, G., Schnurr, D., Such, F. P., Hsu, K., … Weng, L. (2022). Text and Code Embeddings by Contrastive Pre-Training (arXiv:2201.10005). arXiv. https://doi.org/10.48550/arXiv.2201.10005 Polylingual Topic Models—ACL Anthology. (n.d.). Retrieved August 26, 2023, from https://aclanthology.org/D09-1092/ Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992. https://doi.org/10.18653/v1/D19-1410 Reimers, N., & Gurevych, I. (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (arXiv:2004.09813). arXiv. https://doi.org/10.48550/arXiv.2004.09813 Sia, S., Dalmia, A., & Mielke, S. J. (2020). Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1728–1736. https://doi.org/10.18653/v1/2020.emnlp-main.135 Smith, S. L., Turban, D. H. P., Hamblin, S., & Hammerla, N. Y. (2017, February 14). Offline bilingual word vectors, orthogonal transformations and the inverted softmax. International Conference on Learning Representations. https://openreview.net/forum?id=r1Aab85gg Srivastava, A., & Sutton, C. (2017). Autoencoding Variational Inference For Topic Models. International Conference on Learning Representations. https://openreview.net/forum?id=BybtVK9lg Thompson, L., & Mimno, D. (2018). Authorless Topic Models: Biasing Models Away from Known Structure. Proceedings of the 27th International Conference on Computational Linguistics, 3903–3914. https://aclanthology.org/C18-1329 Wang, H., Henderson, J., & Merlo, P. (2021). Multi-Adversarial Learning for Cross-Lingual Word Embeddings. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 463–472. https://doi.org/10.18653/v1/2021.naacl-main.39 Wang, Z., Xie, J., Xu, R., Yang, Y., Neubig, G., & Carbonell, J. (2020). Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework (arXiv:1910.04708). arXiv. https://doi.org/10.48550/arXiv.1910.04708 Zhang, Z., Fang, M., Chen, L., & Namazi Rad, M. R. (2022). Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3886–3893. https://doi.org/10.18653/v1/2022.naacl-main.285

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：開放下載的時間 available 2025-08-23 校外 Off-campus：開放下載的時間 available 2025-08-23 您的 IP(校外) 位址是 216.73.216.218 現在時間是 2025-06-04 論文校外開放下載的時間是 2025-08-23 Your IP address is 216.73.216.218 The current date is 2025-06-04 This thesis will be available to you on 2025-08-23.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 2025-08-23

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS