Responsive image
博碩士論文 etd-0723123-163333 詳細資訊
Title page for etd-0723123-163333
論文名稱
Title
以大型語言模型建立語言無涉之主題模型
On Building Language-agnostic Topic Model using Large Language Model
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
71
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2023-07-07
繳交日期
Date of Submission
2023-08-23
關鍵字
Keywords
跨語言主題模型、大型語言模型、主題模型、奇異值分解、句嵌入
cross-lingual topic model, large language model, topic model, singular value decomposition, sentence-embedding
統計
Statistics
本論文已被瀏覽 352 次,被下載 0
The thesis/dissertation has been browsed 352 times, has been downloaded 0 times.
中文摘要
跨語言主題建模是主題模型研究中的一個重要子領域。本論文探討使用大型語言模型(LLM)開發語言無涉主題模型的方法,並研究跨語言主題建模的限制。本論文主要關注重點是利用LLM「Cohere」提供的多語言句向量在基於聚類的主題模型中的應用。通過實驗分析,觀察到LLM多語言句向量在零樣本分類任務中表現出色,但在主題模型中表現不佳。

為了解決Cohere句向量中存在的語言干擾問題,我們提出了兩種主要方法。第一種方法是對LLM嵌入應用奇異值分解(SVD)過程,利用得到的正交矩陣 U 作為主題模型的輸入。第二種方法則是利用正交矩陣 U 和 Σ 進行逐元素相乘,並利用 t 檢定找出中文與英文間 t-value 最高的維度,去除該維度。利用此方法可以有效地消除了最具語言干擾性的維度。

實驗結果顯示,這兩種方法顯著提升了主題模型的性能,使其能夠生成更具一致性的跨語言主題詞。通過應用本研究中提出的方法,研究人員可以在跨語言主題建模的後續任務或分析中獲得更有意義的結果。
Abstract
This research delves into language-agnostic topic modeling using large language models (LLMs) and explores limitations in cross-lingual topic modeling. The study concentrates on utilizing multilingual sentence embeddings from the LLM "Cohere" in clustering-based topic models. While LLM embeddings excel in zero-shot classification, their performance lags in topic modeling.

To mitigate Cohere's language dependence, two approaches are proposed. The first involves employing singular value decomposition (SVD) on LLM embeddings, using the orthogonal matrix U for topic modeling. The second method leverages the element-wise product of matrix U and Σ array, and then eliminating the most language-dependent dimensions which is the highest t-value dimension between English and Chinese following the SVD process.

Experimental results reveal these methods enhance the topic model's performance, generating coherent topic words across languages. These approaches can yield more insightful outcomes for cross-lingual topic modeling tasks.
目次 Table of Contents
論文審定書 i
謝誌 ii
摘要 iii
Abstract iv
Table of Contents v
List of Figures viii
List of Tables ix
1 Introduction 1
2 Related Works 8
2.1 Cross-lingual Topic Model 8
2.2 Cluster-based Topic Model 9
2.3 Large Language Model in Cross-lingual Task 10
3 Methodology 11
3.1 Cluster-based Topic Model Approach 11
3.2 Cross-lingual Cluster-based Topic Model 14
3.2.1 Encode Documents Using LLM 14
3.2.2 Eliminate Language-Dependent Dimension in the LLM Embedding 15
4 Experiments 25
4.1 Dataset 25
4.2 Experiment Task and Metric 26
4.2.1 Topic Model 26
4.2.2 Classification Task 31
4.3 Results and Analysis 33
4.3.1 Baseline and Cohere 33
4.3.2 Cohere SVD Unscale 37
4.3.3 Cohere Dimension Reduction 39
4.3.4 Cohere SVD Scale 40
4.3.5 Cohere SVD Scale LanguageDim Removal 41
5 Discussion 42
5.1 Topic Quality 42
5.2 Cross Comparison of the Purity/Reflexibility and Language 44
5.3 Classification Task of the topic distribution θ 46
6 Conclusion 48
7 References 50
8 Appendix 59
8.1 Robust Study for paraphrase-xlm-r-multilingual-v1 59
參考文獻 References
Adelani, D. I., Masiak, M., Azime, I. A., Alabi, J. O., Tonja, A. L., Mwase, C., Ogundepo, O., Dossou, B. F. P., Oladipo, A., Nixdorf, D., Emezue, C. C., al-azzawi, S. S., Sibanda, B. K., David, D., Ndolela, L., Mukiibi, J., Ajayi, T. O., Ngoli, T. M., Odhiambo, B., … Stenetorp, P. (2023). MasakhaNEWS: News Topic Classification for African languages (arXiv:2304.09972). arXiv. https://doi.org/10.48550/arXiv.2304.09972
Artetxe, M., Labaka, G., & Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2289–2294. https://doi.org/10.18653/v1/D16-1250
Artetxe, M., Labaka, G., & Agirre, E. (2017). Learning bilingual word embeddings with (almost) no bilingual data. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 451–462. https://doi.org/10.18653/v1/P17-1042
Artetxe, M., Labaka, G., & Agirre, E. (2018). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 789–798. https://doi.org/10.18653/v1/P18-1073
Bianchi, F., Terragni, S., & Hovy, D. (2021). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 759–766. https://doi.org/10.18653/v1/2021.acl-short.96
Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1676–1683. https://doi.org/10.18653/v1/2021.eacl-main.143
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84. https://doi.org/10.1145/2133806.2133826
Chang, C.-H. (2021). A Study on Text Analysis Across Languages. National Sun Yat-sen University.
Chang, C.-H., & Hwang, S.-Y. (2021). A word embedding-based approach to cross-lingual topic modeling. Knowledge and Information Systems, 63(6), 1529–1555. https://doi.org/10.1007/s10115-021-01555-7
Chang, C.-M., Chang, C.-H., & Hwang, S.-Y. (2020). Employing word mover’s distance for cross-lingual plagiarized text detection. Proceedings of the Association for Information Science and Technology, 57(1), e229. https://doi.org/10.1002/pra2.229
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423
Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics, 8, 439–453. https://doi.org/10.1162/tacl_a_00325
Feng, Z., Cao, H., Zhao, T., Wang, W., & Peng, W. (2022). Cross-lingual Feature Extraction from Monolingual Corpora for Low-resource Unsupervised Bilingual Lexicon Induction. Proceedings of the 29th International Conference on Computational Linguistics, 5278–5287. https://aclanthology.org/2022.coling-1.469
Glavaš, G., & Vulić, I. (2020). Non-Linear Instance-Based Cross-Lingual Mapping for Non-Isomorphic Embedding Spaces. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7548–7555. https://doi.org/10.18653/v1/2020.acl-main.675
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure (arXiv:2203.05794). arXiv. https://doi.org/10.48550/arXiv.2203.05794
Hofmann, T. (1999). Probabilistic latent semantic analysis. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, 289–296.
Hofmann, T. (2001). Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42(1), 177–196. https://doi.org/10.1023/A:1007617005950
Jagarlamudi, J., & Daumé, H. (2010). Extracting Multilingual Topics from Unaligned Comparable Corpora. In C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, & K. van Rijsbergen (Eds.), Advances in Information Retrieval (pp. 444–456). Springer. https://doi.org/10.1007/978-3-642-12275-0_39
Kamalloo, E., Zhang, X., Ogundepo, O., Thakur, N., Alfonso-Hermelo, D., Rezagholizadeh, M., & Lin, J. (2023). Evaluating Embedding APIs for Information Retrieval (arXiv:2305.06300). arXiv. http://arxiv.org/abs/2305.06300
Kingma, D. P., & Welling, M. (2014, May 1). Auto-Encoding Variational Bayes. ICLR 2014. https://doi.org/10.48550/arXiv.1312.6114
Lample, G., Conneau, A., Ranzato, M., Denoyer, L., & Jégou, H. (2018). Word translation without parallel data. International Conference on Learning Representations. https://openreview.net/forum?id=H196sainb
Lau, J. H., Newman, D., & Baldwin, T. (2014). Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 530–539. https://doi.org/10.3115/v1/E14-1056
Libovický, J., Rosa, R., & Fraser, A. (2020). On the Language Neutrality of Pre-trained Multilingual Representations. Findings of the Association for Computational Linguistics: EMNLP 2020, 1663–1674. https://doi.org/10.18653/v1/2020.findings-emnlp.150
Liu, Y., Wang, J., & Jiang, Y. (2016). PT-LDA: A latent variable model to predict personality traits of social network users. Neurocomputing, 210, 155–163. https://doi.org/10.1016/j.neucom.2015.10.144
Maaten, L. van der, & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86), 2579–2605.
McInnes, L., Healy, J., & Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (arXiv:1802.03426). arXiv. https://doi.org/10.48550/arXiv.1802.03426
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (arXiv:1301.3781). arXiv. https://doi.org/10.48550/arXiv.1301.3781
Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting Similarities among Languages for Machine Translation (arXiv:1309.4168). arXiv. https://doi.org/10.48550/arXiv.1309.4168
Mimno, D., Wallach, H. M., Naradowsky, J., Smith, D. A., & McCallum, A. (2009). Polylingual Topic Models. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 880–889. https://aclanthology.org/D09-1092
Mohiuddin, T., Bari, M. S., & Joty, S. (2020). LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2712–2723. https://doi.org/10.18653/v1/2020.emnlp-main.215
Moschella, L., Maiorca, V., Fumero, M., Norelli, A., Locatello, F., & Rodolà, E. (2022). Relative representations enable zero-shot latent space communication (arXiv:2209.15430). arXiv. https://doi.org/10.48550/arXiv.2209.15430
Mueller, A., & Dredze, M. (2021). Fine-tuning Encoders for Improved Monolingual and Zero-shot Polylingual Neural Topic Modeling. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3054–3068. https://doi.org/10.18653/v1/2021.naacl-main.243
Neelakantan, A., Xu, T., Puri, R., Radford, A., Han, J. M., Tworek, J., Yuan, Q., Tezak, N., Kim, J. W., Hallacy, C., Heidecke, J., Shyam, P., Power, B., Nekoul, T. E., Sastry, G., Krueger, G., Schnurr, D., Such, F. P., Hsu, K., … Weng, L. (2022). Text and Code Embeddings by Contrastive Pre-Training (arXiv:2201.10005). arXiv. https://doi.org/10.48550/arXiv.2201.10005
Polylingual Topic Models—ACL Anthology. (n.d.). Retrieved August 26, 2023, from https://aclanthology.org/D09-1092/
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992. https://doi.org/10.18653/v1/D19-1410
Reimers, N., & Gurevych, I. (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (arXiv:2004.09813). arXiv. https://doi.org/10.48550/arXiv.2004.09813
Sia, S., Dalmia, A., & Mielke, S. J. (2020). Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1728–1736. https://doi.org/10.18653/v1/2020.emnlp-main.135
Smith, S. L., Turban, D. H. P., Hamblin, S., & Hammerla, N. Y. (2017, February 14). Offline bilingual word vectors, orthogonal transformations and the inverted softmax. International Conference on Learning Representations. https://openreview.net/forum?id=r1Aab85gg
Srivastava, A., & Sutton, C. (2017). Autoencoding Variational Inference For Topic Models. International Conference on Learning Representations. https://openreview.net/forum?id=BybtVK9lg
Thompson, L., & Mimno, D. (2018). Authorless Topic Models: Biasing Models Away from Known Structure. Proceedings of the 27th International Conference on Computational Linguistics, 3903–3914. https://aclanthology.org/C18-1329
Wang, H., Henderson, J., & Merlo, P. (2021). Multi-Adversarial Learning for Cross-Lingual Word Embeddings. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 463–472. https://doi.org/10.18653/v1/2021.naacl-main.39
Wang, Z., Xie, J., Xu, R., Yang, Y., Neubig, G., & Carbonell, J. (2020). Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework (arXiv:1910.04708). arXiv. https://doi.org/10.48550/arXiv.1910.04708
Zhang, Z., Fang, M., Chen, L., & Namazi Rad, M. R. (2022). Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3886–3893. https://doi.org/10.18653/v1/2022.naacl-main.285
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus:開放下載的時間 available 2025-08-23
校外 Off-campus:開放下載的時間 available 2025-08-23

您的 IP(校外) 位址是 18.223.159.195
現在時間是 2024-04-28
論文校外開放下載的時間是 2025-08-23

Your IP address is 18.223.159.195
The current date is 2024-04-28
This thesis will be available to you on 2025-08-23.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 2025-08-23

QR Code