The Research on the Design of a Cross Lingual Guided Topic Model
Cross-lingual Word Vector Space, Word Space Mapping, Semi-supervised Topic Model, Guided Topic Model, Cross-lingual Topic Model, Seed Words
跨語言主題模型能夠幫助使用者在不需要閱讀大量文件和資料的情況下,了解包含不同語言文件的語料庫所涵蓋的主題和資訊。然而,非監督式主題模型容易產生不適合實際應用的主題結果,使用者也無法依據使用目的調整模型的產出結果。結合使用者偏好使主題模型轉為半監督式學習是解決此問題的有效方法之一,但是現今的半監督式主題模為單語言主題模型,少數擁有互動性的跨語言主題模型依賴完整的字典資料,如果字典資料過於稀少,便會失去跨語言的特性,變回單語言主題模型。Cb-CLTM (Chang et al., 2021)是一個結合word embedding的非監督式機率模型,憑藉cross-lingual word vector space的優勢,可以用較少的字典資源生成跨語言主題,本研究將延伸Cb-CLTM模型成為能夠依據使用者偏好產生相關主題的seeded Cb-CLTM,利用使用者提供主題的相關詞 (seed words),在跨語言空間中建立各主題的seed topic center,引導模型產生與seed topic相關的主題。我們的實驗證明,在加入種子資訊後,模型可以產生更具解釋性的主題結果。
Cross-lingual topic models help users understand large corpora in different languages without reading every document, yet the unsupervised nature of topic models often results in low-quality topics which are impractical for end-user applications. Incorporating human interaction is a favorable approach to tackle this problem. However, most existing interactive topic models are monolingual topic models. Cross-lingual topic models with interactivity may lead to monolingual topics if dictionary entries are scarce. In this work, we extend Cb-CLTM (Chang et al., 2021), a probabilistic multilingual topic model integrating word embedding, to seeded Cb-CLTM. With the advantages of cross-lingual word space, Cb-CLTM can produce cross-lingual topics with fewer dictionary resources. We expand Cb-CLTM to learn topics based on users’ preferences with groups of seed words provided by users. The average of the word vectors of seed words construct seed topic centers in a cross-lingual space and encourage the model to generate topics related to seed topics. Extrinsic evaluations were implemented to explore the possibility of combining interactivity into Cb-CLTM. The experimental results show a superior performance when using seed information.
目次 Table of Contents
論文審定書 i
誌謝 ii
摘要 iii
Abstract iv
CHAPTER 1 - Introduction 1
CHAPTER 2 - Related Work 4
2.1 Guided topic model 4
2.2 Cross lingual Topic Model 7
2.3 Guided cross-lingual topic model 11
CHAPTER 3 - Methodology 14
3.1 Creating Word Embedding For Monolingual Corpus 14
3.2 Creating Cross-Lingual Word Vector Space 14
3.3 Background 15
3.4 Seeded Center-Based Cross-lingual Topic Model (Seeded Cb-CLTM) 17
3.5 Seed Words Selection 20
CHAPTER 4 - Experiment 23
4.1 Description of Dataset 23
4.2 Experimental Setup 24
4.2.1 Creating Cross-lingual word embedding space 24
4.2.2 Parameter settings of models 25
4.3 Evaluation metrics 27
4.3.1 Topic Coherence 27
4.3.2 Topic Diversity 28
4.3.3 Alignment quality of Cross-lingual document-topic distribution 29
4.3.4 Evaluation of clustering quality 29
4.4 Experimental results and discussion 30
4.4.1 Topic Coherence Performance 30
4.4.2 Topic Diversity Performance 32
4.4.3 Performance of Cross-lingual document-topic distribution 33
4.4.4 Purity Performance 34
4.4.5 Training efficiency 37
4.4.6 Language bias 37
4.4.7 Availability of monolingual seed words 38
CHAPTER 5 - Conclusion 39
Reference 40
Appendix 47
