Responsive image
博碩士論文 etd-0809121-204802 詳細資訊
Title page for etd-0809121-204802
論文名稱
Title
跨語言引導式主題模型設計之研究
The Research on the Design of a Cross Lingual Guided Topic Model
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
58
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2021-07-22
繳交日期
Date of Submission
2021-09-09
關鍵字
Keywords
跨語言空間、跨語言主題模型、引導式主題模型、半監督式主題模型、種子詞
Cross-lingual Word Vector Space, Word Space Mapping, Semi-supervised Topic Model, Guided Topic Model, Cross-lingual Topic Model, Seed Words
統計
Statistics
本論文已被瀏覽 500 次,被下載 7
The thesis/dissertation has been browsed 500 times, has been downloaded 7 times.
中文摘要
跨語言主題模型能夠幫助使用者在不需要閱讀大量文件和資料的情況下,了解包含不同語言文件的語料庫所涵蓋的主題和資訊。然而,非監督式主題模型容易產生不適合實際應用的主題結果,使用者也無法依據使用目的調整模型的產出結果。結合使用者偏好使主題模型轉為半監督式學習是解決此問題的有效方法之一,但是現今的半監督式主題模為單語言主題模型,少數擁有互動性的跨語言主題模型依賴完整的字典資料,如果字典資料過於稀少,便會失去跨語言的特性,變回單語言主題模型。Cb-CLTM (Chang et al., 2021)是一個結合word embedding的非監督式機率模型,憑藉cross-lingual word vector space的優勢,可以用較少的字典資源生成跨語言主題,本研究將延伸Cb-CLTM模型成為能夠依據使用者偏好產生相關主題的seeded Cb-CLTM,利用使用者提供主題的相關詞 (seed words),在跨語言空間中建立各主題的seed topic center,引導模型產生與seed topic相關的主題。我們的實驗證明,在加入種子資訊後,模型可以產生更具解釋性的主題結果。
Abstract
Cross-lingual topic models help users understand large corpora in different languages without reading every document, yet the unsupervised nature of topic models often results in low-quality topics which are impractical for end-user applications. Incorporating human interaction is a favorable approach to tackle this problem. However, most existing interactive topic models are monolingual topic models. Cross-lingual topic models with interactivity may lead to monolingual topics if dictionary entries are scarce. In this work, we extend Cb-CLTM (Chang et al., 2021), a probabilistic multilingual topic model integrating word embedding, to seeded Cb-CLTM. With the advantages of cross-lingual word space, Cb-CLTM can produce cross-lingual topics with fewer dictionary resources. We expand Cb-CLTM to learn topics based on users’ preferences with groups of seed words provided by users. The average of the word vectors of seed words construct seed topic centers in a cross-lingual space and encourage the model to generate topics related to seed topics. Extrinsic evaluations were implemented to explore the possibility of combining interactivity into Cb-CLTM. The experimental results show a superior performance when using seed information.
目次 Table of Contents
論文審定書 i
誌謝 ii
摘要 iii
Abstract iv
CHAPTER 1 - Introduction 1
CHAPTER 2 - Related Work 4
2.1 Guided topic model 4
2.2 Cross lingual Topic Model 7
2.3 Guided cross-lingual topic model 11
CHAPTER 3 - Methodology 14
3.1 Creating Word Embedding For Monolingual Corpus 14
3.2 Creating Cross-Lingual Word Vector Space 14
3.3 Background 15
3.4 Seeded Center-Based Cross-lingual Topic Model (Seeded Cb-CLTM) 17
3.5 Seed Words Selection 20
CHAPTER 4 - Experiment 23
4.1 Description of Dataset 23
4.2 Experimental Setup 24
4.2.1 Creating Cross-lingual word embedding space 24
4.2.2 Parameter settings of models 25
4.3 Evaluation metrics 27
4.3.1 Topic Coherence 27
4.3.2 Topic Diversity 28
4.3.3 Alignment quality of Cross-lingual document-topic distribution 29
4.3.4 Evaluation of clustering quality 29
4.4 Experimental results and discussion 30
4.4.1 Topic Coherence Performance 30
4.4.2 Topic Diversity Performance 32
4.4.3 Performance of Cross-lingual document-topic distribution 33
4.4.4 Purity Performance 34
4.4.5 Training efficiency 37
4.4.6 Language bias 37
4.4.7 Availability of monolingual seed words 38
CHAPTER 5 - Conclusion 39
Reference 40
Appendix 47
參考文獻 References
Arora, S., Ge, R. and Moitra, A., 2012, October. Learning topic models--going beyond SVD. In 2012 IEEE 53rd annual symposium on foundations of computer science (pp. 1-10). IEEE.
Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y. and Zhu, M., 2013, May. A practical algorithm for topic modeling with provable guarantees. In International conference on machine learning (pp. 280-288). PMLR.
Arora, S., Ge, R., Kannan, R. and Moitra, A., 2016. Computing a nonnegative matrix factorization---provably. SIAM Journal on Computing, 45(4), pp.1582-1611.
Artetxe, M., Labaka, G. and Agirre, E., 2016, November. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 2289-2294).
Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3, pp.993-1022.
Bouma, G., 2009. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, 30, pp.31-40.
Blei, D.M. and McAuliffe, J.D., 2010. Supervised topic models. arXiv preprint arXiv:1003.0783.
Boyd-Graber, J. and Resnik, P., 2010, October. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp. 45-55).
Boyd-Graber, J. and Blei, D., 2012. Multilingual topic models for unaligned text. arXiv preprint arXiv:1205.2657.
Boyd-Graber, J., Mimno, D. and Newman, D., 2014. Care and feeding of topic models: Problems, diagnostics, and improvements. Handbook of mixed membership models and their applications, 225255.
Chandrasekaran, R., Mehta, V., Valkunde, T. and Moustakas, E., 2020. Topics, trends, and sentiments of tweets about the COVID-19 pandemic: Temporal infoveillance study. Journal of medical Internet research, 22(10), p.e22624.
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L. and Blei, D.M., 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).
Chang, C.H., Hwang, S.Y. and Xui, T.H., 2018, July. Incorporating word embedding into cross-lingual topic modeling. In 2018 IEEE International Congress on Big Data (BigData Congress) (pp. 17-24). IEEE.
Chang, C.H. and Hwang, S.Y., 2021. A word embedding-based approach to cross-lingual topic modeling. Knowledge and Information Systems, 63(6), pp.1529-1555.
Choo, J., Lee, C., Reddy, C.K. and Park, H., 2013. Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE transactions on visualization and computer graphics, 19(12), pp.1992-2001.
Deng, Q., Gao, Y., Wang, C. and Zhang, H., 2020. Detecting information requirements for crisis communication from social media data: An interactive topic modeling approach. International Journal of Disaster Risk Reduction, 50, p.101692.
Gallagher, R.J., Reing, K., Kale, D. and Ver Steeg, G., 2017. Anchored correlation explanation: Topic modeling with minimal domain knowledge. Transactions of the Association for Computational Linguistics, 5, pp.529-542.
Halpern, Y., Choi, Y., Horng, S. and Sontag, D., 2014. Using anchors to estimate clinical state without labeled data. In AMIA Annual Symposium Proceedings (Vol. 2014, p. 606). American Medical Informatics Association.
Halpern, Y., Horng, S. and Sontag, D., 2015. Anchored discrete factor analysis. arXiv preprint arXiv:1511.03299.
Hao, S., Boyd-Graber, J.L. and Paul, M.J., 2018a, June. Lessons from the Bible on modern topics: Adapting topic model evaluation to multilingual and low-resource settings. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT (pp. 1-6).
Hao, S. and Paul, M., 2018b, August. Learning multilingual topics from incomparable corpora. In Proceedings of the 27th international conference on computational linguistics (pp. 2595-2609).
Hao, S. and Paul, M.J., 2020. An empirical study on crosslingual transfer in probabilistic topic models. Computational Linguistics, 46(1), pp.95-134.
Hu, Y., Boyd-Graber, J., Satinoff, B. and Smith, A., 2014a. Interactive topic modeling. Machine learning, 95(3), pp.423-469.
Hu, Y., Zhai, K., Eidelman, V. and Boyd-Graber, J., 2014b, June. Polylingual tree-based topic models for translation domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1166-1176).
Jagarlamudi, J. and Daumé, H., 2010, March. Extracting multilingual topics from unaligned comparable corpora. In European Conference on Information Retrieval (pp. 444-456). Springer, Berlin, Heidelberg.
Jagarlamudi, J., Daumé III, H. and Udupa, R., 2012, April. Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 204-213).
Koehn, P., 2005, September. Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5, pp. 79-86).
Kozachenko, L.F. and Leonenko, N.N., 1987. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2), pp.9-16.
Kraskov, A., Stögbauer, H. and Grassberger, P., 2004. Estimating mutual information. Physical review E, 69(6), p.066138.
Kuang, D., Choo, J. and Park, H., 2015. Nonnegative matrix factorization for interactive topic modeling and document clustering. In Partitional Clustering Algorithms (pp. 215-243). Springer, Cham.
Lau, J.H., Newman, D. and Baldwin, T., 2014, April. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 530-539).
Lee, T.Y., Smith, A., Seppi, K., Elmqvist, N., Boyd-Graber, J. and Findlater, L., 2017. The human touch: How non-expert users perceive, interpret, and fix topic models. International Journal of Human-Computer Studies, 105, pp.28-42.
Lund, J., Cook, C., Seppi, K. and Boyd-Graber, J., 2017, July. Tandem anchoring: A multiword anchor approach for interactive topic modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 896-905).
Manning, C.D., Raghavan, P. and Schutze, H., Introduction to information retrieval? cambridge university press 2008. Ch, 20, pp.405-416.
Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Le, Q.V. and Sutskever, I., 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J., 2013c. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
Mikolov, T., Yih, W.T. and Zweig, G., 2013d, June. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 746-751).
Mimno, D., Wallach, H., Naradowsky, J., Smith, D.A. and McCallum, A., 2009, August. Polylingual topic models. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 880-889).
Nguyen, T., Hu, Y. and Boyd-Graber, J., 2014, June. Anchors regularized: Adding robustness and extensibility to scalable topic-modeling algorithms. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 359-369).
Nguyen, T., Boyd-Graber, J., Lund, J., Seppi, K. and Ringger, E., 2015. Is your anchor going up or down? Fast and accurate supervised topic models. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746-755).
Ni, X., Sun, J.T., Hu, J. and Chen, Z., 2009, April. Mining multilingual topics from Wikipedia. In Proceedings of the 18th international conference on World wide web (pp. 1155-1156).
Ramage, D., Hall, D., Nallapati, R. and Manning, C.D., 2009, August. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 248-256).
Ross, B.C., 2014. Mutual information between discrete and continuous data sets. PloS one, 9(2), p.e87357.
Steeg, G.V. and Galstyan, A., 2014. Discovering structure in high-dimensional data through correlation explanation. arXiv preprint arXiv:1406.1222.
Smith, S.L., Turban, D.H., Hamblin, S. and Hammerla, N.Y., 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859.
Tian, L., Wong, D.F., Chao, L.S., Quaresma, P., Oliveira, F. and Yi, L., 2014, May. UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation. In LREC (pp. 1837-1842).
Ver Steeg, G. and Galstyan, A., 2015, February. Maximally informative hierarchical representations of high-dimensional data. In Artificial Intelligence and Statistics (pp. 1004-1012). PMLR.
Vulić, I., De Smet, W., Tang, J. and Moens, M.F., 2015. Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Information Processing & Management, 51(1), pp.111-147.
Xing, C., Wang, D., Liu, C. and Lin, Y., 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1006-1011).
Yuan, M., Van Durme, B. and Ying, J.L., 2018, January. Multilingual Anchoring: Interactive Topic Modeling and Alignment Across Languages. In NeurIPS (pp. 8667-8677).
Zhao, B. and Xing, E., 2006, July. BiTAM: Bilingual topic admixture models for word alignment. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions (pp. 969-976).
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code