博碩士論文 etd-0723123-163333 詳細資訊
On Building Language-agnostic Topic Model using Large Language Model
cross-lingual topic model, large language model, topic model, singular value decomposition, sentence-embedding
為了解決Cohere句向量中存在的語言干擾問題,我們提出了兩種主要方法。第一種方法是對LLM嵌入應用奇異值分解(SVD)過程,利用得到的正交矩陣 U 作為主題模型的輸入。第二種方法則是利用正交矩陣 U 和 Σ 進行逐元素相乘,並利用 t 檢定找出中文與英文間 t-value 最高的維度,去除該維度。利用此方法可以有效地消除了最具語言干擾性的維度。

This research delves into language-agnostic topic modeling using large language models (LLMs) and explores limitations in cross-lingual topic modeling. The study concentrates on utilizing multilingual sentence embeddings from the LLM "Cohere" in clustering-based topic models. While LLM embeddings excel in zero-shot classification, their performance lags in topic modeling.

To mitigate Cohere's language dependence, two approaches are proposed. The first involves employing singular value decomposition (SVD) on LLM embeddings, using the orthogonal matrix U for topic modeling. The second method leverages the element-wise product of matrix U and Σ array, and then eliminating the most language-dependent dimensions which is the highest t-value dimension between English and Chinese following the SVD process.

Experimental results reveal these methods enhance the topic model's performance, generating coherent topic words across languages. These approaches can yield more insightful outcomes for cross-lingual topic modeling tasks.
目次 Table of Contents
論文審定書 i
謝誌 ii
摘要 iii
Abstract iv
Table of Contents v
List of Figures viii
List of Tables ix
1 Introduction 1
2 Related Works 8
2.1 Cross-lingual Topic Model 8
2.2 Cluster-based Topic Model 9
2.3 Large Language Model in Cross-lingual Task 10
3 Methodology 11
3.1 Cluster-based Topic Model Approach 11
3.2 Cross-lingual Cluster-based Topic Model 14
3.2.1 Encode Documents Using LLM 14
3.2.2 Eliminate Language-Dependent Dimension in the LLM Embedding 15
4 Experiments 25
4.1 Dataset 25
4.2 Experiment Task and Metric 26
4.2.1 Topic Model 26
4.2.2 Classification Task 31
4.3 Results and Analysis 33
4.3.1 Baseline and Cohere 33
4.3.2 Cohere SVD Unscale 37
4.3.3 Cohere Dimension Reduction 39
4.3.4 Cohere SVD Scale 40
4.3.5 Cohere SVD Scale LanguageDim Removal 41
5 Discussion 42
5.1 Topic Quality 42
5.2 Cross Comparison of the Purity/Reflexibility and Language 44
5.3 Classification Task of the topic distribution θ 46
6 Conclusion 48
7 References 50
8 Appendix 59
8.1 Robust Study for paraphrase-xlm-r-multilingual-v1 59
參考文獻 References
