國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,跨語言主題模型之比較研究,The research on the Comparisons of Cross-Lingual Topic models

論文名稱 Title	跨語言主題模型之比較研究 The research on the Comparisons of Cross-Lingual Topic models
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	109 學年度第 2 學期 The spring semester of Academic Year 109	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	51
研究生 Author	林意婕 YI CHIEH LIN
指導教授 Advisor	黃三益 Hwang, San-Yih
召集委員 Convenor	魏志平 Wei, Chih-Ping
口試委員 Advisory Committee	康藝晃 KANG, YI-HUANG
口試日期 Date of Exam	2021-07-22	繳交日期 Date of Submission	2021-08-31
關鍵字 Keywords	主題模型、跨語言主題模型、詞向量、最大期望演算法、AEVB、狄式分佈、高斯分佈 Topic Modeling, Cross-lingual topic modeling, Word vector, Expectation-maximization algorithm, Auto-Encoding Variational Bayes, Dirichlet distribution, Gaussian distribution
統計 Statistics	本論文已被瀏覽 643 次，被下載 8 次 The thesis/dissertation has been browsed 643 times, has been downloaded 8 times.

中文摘要
相較於傳統的主題模型只能針對一種語言，跨語言主題模型可以同時分析多種語言的文本，找出潛在主題分佈及各主題下不同語言的關鍵字。傳統跨語言主題模型大多是基於統計方法來訓練，並且需要對稱型語料的資源，但隨著網路發展，大規模且非對稱的文本分析變得日益重要。近年來，在無需平行語料的優勢下，將文字轉換成向量的方式被廣泛使用在主題模型上，透過空間對應，我們可以更精準的知道單詞的語義以及詞與詞之間的關係。基於詞向量的跨語言主題模型中，我們比較了使用統計方法的center-based cross-lingual topic model (Chang et al., 2021)與深度學習方法的 embedded topic model (Dieng et al., 2020)，發現先驗分佈與推論演算法是其中最大的差異：在Cb-CLTM中用了最大期望演算法並以狄式分佈作為主題模型的先驗分佈，而ETM則以AEVB（auto-encoding variational bayes）為演算法與高斯分佈為先驗分佈。經過實驗分析，發現兩者結果並無太大的優劣之分，然而透過深度學習的方法，我們將能更快速的分析大量的跨語言文件。
Abstract
Cross-lingual topic modeling analyzes corpora across languages, uncover latent topics and the keywords of the topics between different languages. Most traditional top-ic models are based on statistical training and require parallel corpus. However, as de-velopment of the Internet, analysis of large-scale and non-parallel corpus is becoming essential. In recent years, without non-parallel corpus, word-embeddings-based topic models have been widely used. Through mapping to vector space, we capture semantic regularities and the relationships among words more precisely. In this study, we compared two word-embeddings-based topic models, Cb-CLTM: center-based cross-lingual topic model (Chang et al., 2021), and ETM: embedded topic model (Dieng et al., 2020). The main differences are that Cb-CLTM is based on EM and uses Dirichlet distribution as its prior distribution, whereas ETM utilizes neural networks whose inference algorithm is AEVB (auto-encoding variational bayes) and applies Gaussian as prior. After experiments, we found the performance between the two models is comparable and nearly equal. However, with neural networks, we can analyze large-scale cross-lingual corpora more rapidly.

目次 Table of Contents
論文審定書 i 誌謝 ii 摘要 iii Abstract iv CHAPTER 1 - Introduction 1 CHAPTER 2 - Related Works 3 2.1. Cross-lingual Topic Modeling 3 2.1.1. Document linking 3 2.1.2. Vocabulary linking 4 2.1.3. Mixed linking 5 2.2. Word Embeddings 6 2.3. Continuous topic model 7 CHAPTER 3 - Comparisons of Cb-CLTM and ETM 9 3.1. Variational Autoencoder (VAE) 9 3.2. Preparations, Cb-CLTM and ETM 12 3.2.1. Cross-Lingual Alignments 12 3.2.2. Center-based cross-lingual topic model (Cb-CLTM) 14 3.2.3. Embedded Topic Model (ETM) 15 3.3. Comparisons 17 3.3.1. Difference in Prior Distributions 17 3.3.2. Difference in Inference Algorithms 18 CHAPTER 4 - Experiments and Results 19 4.1. Dataset Description 19 4.2. Evaluation Metrics 21 4.3. Parameters 24 4.4. Coherence Performance 27 4.5. Topic Diversity 29 4.6. Quality in Document Representation 31 CHAPTER 5 - Conclusion 36 Reference 37

參考文獻 References
1. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022. 2. Boyd-Graber, J. L., Hu, Y., & Mimno, D. (2017). Applications of topic models (Vol. 11). Now Publishers Incorporated. 3. Yang, W., Boyd-Graber, J., & Resnik, P. (2019, November). A multilingual topic model for learning weighted topic links across corpora with low comparability. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) (pp. 1243-1248). 4. Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2020). Topic modeling in embedding spac-es. Transactions of the Association for Computational Linguistics, 8, 439-453. 5. Yuan, M., Van Durme, B., & Ying, J. L. (2018, January). Multilingual Anchoring: Interactive Topic Modeling and Alignment Across Languages. In NeurIPS (pp. 8667-8677). 6. Heyman, G., Vulić, I., & Moens, M. F. (2016). C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content. Data Mining and Knowledge Discovery, 30(5), 1299-1323. 7. Hu, M., & Liu, B. (2004, August). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 168-177). 8. Mimno, D., Wallach, H., Naradowsky, J., Smith, D. A., & McCallum, A. (2009, Au-gust). Polylingual topic models. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 880-889). 9. Chang, C. H., & Hwang, S. Y. (2021). A word embedding-based approach to cross-lingual topic modeling. Knowledge and Information Systems, 63(6), 1529-1555. 10. Koehn, P. (2005, September). Europarl: A parallel corpus for statistical machine trans-lation. In MT summit (Vol. 5, pp. 79-86). 11. Jagarlamudi, J., & Daumé, H. (2010). Extracting Multilingual Topics from Unaligned Comparable Corpora. Advances in Information Retrieval, 444–456. Springer Berlin Heidelberg 12. Hao, S., & Paul, M. J. (2018b). Learning Multilingual Topics from Incomparable Corpora. Proceedings of the 27th International Conference on Computational Linguis-tics, 2595–2609. aclweb.org. 13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neu-ral information processing systems (pp. 3111-3119). 14. Harris, Z. (1954). Distributional hypothesis. Word World, 10(23), 146-162. 15. Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). 16. Sahlgren, M. (2008). The distributional hypothesis. Italian Journal of Disability Stud-ies, 20, 33-53. 17. Landauer, T. K. (1984, January). Statistical semantics-analysis of the potential per-formance of keyword information-systems, and a cure for an ancient problem. In Journal of psycholinguistic research (Vol. 13, No. 6, pp. 495-496). 233 SPRING ST, NEW YORK, NY 10013: PLENUM PUBL CORP. 18. Xun, G., Li, Y., Zhao, W. X., Gao, J., & Zhang, A. (2017, August). A correlated topic model using word embeddings. In IJCAI (pp. 4207-4213). 19. Das, R., Zaheer, M., & Dyer, C. (2015, July). Gaussian lda for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers) (pp. 795-804). 20. Batmanghelich, K., Saeedi, A., Narasimhan, K., & Gershman, S. (2016, August). Nonparametric spherical topic modeling with word embeddings. In Proceedings of the conference. Association for Computational Linguistics. Meeting (Vol. 2016, p. 537). NIH Public Access. 21. Reisinger, J., Waters, A., Silverthorn, B., & Mooney, R. J. (2010, January). Spherical topic models. In ICML. 22. Srivastava, A., & Sutton, C. (2017). Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488. 23. Card, C. Tan, and N. A. Smith. 2017. A neural framework for generalized topic mod-els. In arXiv:1705.09296. 24. Cong, Y., Chen, B., Liu, H., & Zhou, M. (2017, July). Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC. In International Conference on Machine Learning (pp. 864-873). PMLR. 25. Zhang, H., Chen, B., Guo, D., & Zhou, M. (2018). WHAI: Weibull hybrid autoen-coding inference for deep topic modeling. arXiv preprint arXiv:1803.01328. 26. Mikolov, T., Le, Q. V., & Sutskever, I. (2013b). Exploiting similarities among lan-guages for machine translation. CoRR, abs/1309.4168. 27. Smith, S. L., Turban, D. H., Hamblin, S., & Hammerla, N. Y. (2017). Offline bilin-gual word vectors, orthogonal transformations and the inverted softmax. arXiv pre-print arXiv:1702.03859. 28. Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized word embedding and or-thogonal transform for bilingual word translation. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu-man Language Technologies, 1006–1011. 29. Artetxe, M., Labaka, G., & Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of EMNLP, pp. 2289–2294. 30. Zhang, Y., Gaddy, D., Barzilay, R., & Jaakkola, T. (2016b). Ten Pairs to Tag – Multi-lingual POS tagging via coarse mapping between embeddings. In Proceedings of NAACL-HLT, pp. 1307–1317. 31. Zhang, M., Liu, Y., Luan, H., & Sun, M. (2017, July). Adversarial training for unsu-pervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1959-1970). 32. Faruqui, M., & Dyer, C. (2014, April). Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 462-471). 33. Ruder, S., Vulić, I., & S gaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65, 569–631. 34. Tian, L., Wong, D. F., Chao, L. S., Quaresma, P., Oliveira, F., & Yi, L. (2014, May). UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Trans-lation. In LREC (pp. 1837-1842). 35. Lazaridou, A., Dinu, G., & Baroni, M. (2015, July). Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 270-280). 36. Klementiev, A., Titov, I., & Bhattarai, B. (2012, December). Inducing crosslingual distributed representations of words. In Proceedings of COLING 2012 (pp. 1459-1474). 37. Lewis, D. D., Yang, Y., Russell-Rose, T., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of machine learning re-search, 5(Apr), 361-397. 38. Schwenk, H., & Li, X. (2018). A corpus for multilingual document classification in eight languages. arXiv preprint arXiv:1805.09821. 39. Conneau, A., Lample, G., Ranzato, M. A., Denoyer, L., & Jégou, H. (2017). Word translation without parallel data. arXiv preprint arXiv:1710.04087. 40. Bischof, J., & Airoldi, E. M. (2012). Summarizing topical content with word fre-quency and exclusivity. In Proceedings of the 29th International Conference on Ma-chine Learning (ICML-12) (pp. 201-208) 41. Hao, S., Boyd-Graber, J., & Paul, M. J. (2018). Lessons from the Bible on modern topics: Low-resource multilingual topic model evaluation. arXiv preprint arXiv:1804.10184. 42. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235. 43. Aletras, N., & Stevenson, M. (2013, March). Evaluating topic coherence using distri-butional semantics. In Proceedings of the 10th International Conference on Computa-tional Semantics (IWCS 2013)–Long Papers (pp. 13-22). 44. Fuglede, B., & Topsoe, F. (2004, June). Jensen-Shannon divergence and Hilbert space embedding. In International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings. (p. 31). IEEE. 45. Niwattanakul, S., Singthongchai, J., Naenudorn, E., & Wanapu, S. (2013, March). Using of Jaccard coefficient for keywords similarity. In Proceedings of the interna-tional multiconference of engineers and computer scientists (Vol. 1, No. 6, pp. 380-384) 46. Boyd-Graber, J., & Blei, D. (2012). Multilingual topic models for unaligned text. arXiv preprint arXiv:1205.2657. 47. Ma, T., & Nasukawa, T. (2016). Inverted bilingual topic models for lexicon extraction from non-parallel data. arXiv preprint arXiv:1612.07215. 48. Gutiérrez, E. D., Shutova, E., Lichtenstein, P., de Melo, G., & Gilardi, L. (2016). Detecting cross-cultural differences using a multilingual topic model. Transactions of the Association for Computational Linguistics, 4, 47-60. 49. Liu, X., Duh, K., & Matsumoto, Y. (2015). Multilingual topic models for bilingual dictionary extraction. ACM Transactions on Asian and Low-Resource Language In-formation Processing (TALLIP), 14(3), 1-22. 50. Guo, J., Che, W., Yarowsky, D., Wang, H., & Liu, T. (2015, July). Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Processing (Volume 1: Long Pa-pers) (pp. 1234-1244). 51. Ono, M., Miwa, M., & Sasaki, Y. (2015). Word embedding-based antonym detection using thesauri and distributional information. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu-man Language Technologies (pp. 984-989).

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0731121-145257.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS