跨語言文字分析之研究 A Study on Text Analysis Across Languages
跨語言文字資料、跨語言文字分析、跨語言情緒詞彙推導、跨語言主題模型、神經網路 multilingual textual data, cross-lingual text analysis, cross-lingual sentiment lexicon induction, cross-lingual topic model, neural network
隨著網際網路的快速發展,資訊傳播開始不分國界,相同的事件、產品、品牌更容易被不同國家的使用者在網路上被討論,跨語言文字分析技術正是分析如此跨語言文字資料的核心技術。然而,過往的文獻多以成本較高的跨語言資源作為參考資源,例如:平行文本 (parallel corpora)、機器翻譯和跨語言知識庫,這些資源不容易取得,也不適合特定領域文本 (domain-specific corpora)。為此,本論文專注在跨語言情緒分析以及跨語言主題模型兩技術,並提出具有高準確度且使用低成本跨語言資源的方法。因此,我們提出跨語言文字空間,其僅需少量跨語言監督資源的特性,使我們將其視為低成本的跨語言資源,並以其為輸入參考提出兩個具體方法:多步驟式雙語情緒詞彙推導 (MS-BSLI) 和以中心點為基礎的跨語言主題模型 (Cb-CLTM)。當中,MS-BSLI 目的在於將語意資源由支配語言 (dominant language, e.g., English) 透過跨語言文字空間推廣傳播至弱資源的語言上,進而調整空間以利產生品質更高的雙語情緒詞彙。Cb-CLTM 則是利用跨語言文字空間來擴展隱含狄利克雷分佈,使其能由跨語言文本資料中辨識出潛在的跨語言討論主題。最後,基於神經網路技術的快速發展,我們進一步的探討以神經網路搭建的跨語言主題模型,探討的方向具體分為兩項:(1) 提出兩個以神經網路為基礎的跨語言主題模型 xETM 和 cProdLDA (2) 與既有神經網路跨語言主題模型 ZeroShotTM比較,並衡量 xETM, cProdLDA 在擷取跨語言主題的表現。 |
Abstract |
The rapid development of the Internet facilitates the dissemination of information worldwide. People from different countries express opinions on the same entity, event, and product, which triggers the demand for analyzing texts across languages. In previous studies, analyzing such multilingual textual data requires expensive interlingual resources, such as parallel corpora, machine translators, and knowledge bases, to link the extracted information across languages. In this dissertation, we address the sentiment analysis and topic modeling in cross-lingual context, aiming to achieve high accuracy while requiring less resources. Specifically, the dissertation proposes two resource-light methods: multistep bilingual sentiment lexicon induction (MS-BSLI) and center-based cross-lingual topic model (Cb-CLTM). Both methods rely on cross-lingual word embedding for bridging the languages and minimizing the need for interlingual resources. MS-BSLI aims to propagate the dominant language’s (i.e., English) lexical resources to another resource-less language for generating a better bilingual sentiment lexicon. Cb-CLTM extends the generative process of Latent Dirichlet Allocation (LDA) using the cross-lingual word embedding for identifying the common hidden topics from the multilingual corpus. We then further investigate other cross-lingual topic models that are implemented using neural network (NN), due to the rising trend of NN. The investigations include: (1) proposing two NN-based topic models: xETM and cProdLDA, and (2) comparing the performances between three NN-based cross-lingual topic models, including xETM, ZeroShotTM, and cProdLDA. |
論文審定書 i 致謝 ii 中文摘要 iii 英文摘要 iv 目錄 v 圖次 vii 表次 viii Chapter 1 Introduction 1 DECLARATION 4 Chapter 2 The Development of Cross-lingual Word Embedding 5 2.1 CONSTRUCTION OF MONOLINGUAL WORD EMBEDDING 5 2.2 METHODS FOR CROSS-LINGUAL WORD EMBEDDING ALIGNMENT 7 2.3 SUMMARY 10 Chapter 3 A Multistep Approach for Cross-lingual Sentiment Lexicon Construction 11 3.1 RELATED WORK 14 3.1.1 Sentiment Analysis for Online Reviews 14 3.1.2 Bilingual Sentiment Lexicon Induction 15 3.2 THE MULTISTEP APPROACH 18 3.2.1 Step 1: Generate a Monolingual Word Vector Space 20 3.2.2 Step 2: Determine the Language Transformation 20 3.2.3 Step 3: Produce a Specialized Word Vector Space Using Lexical Resources 21 3.2.4 Step 4: Postmap the Word Vector Space for Unseen Words 23 3.2.5 Margin-Based Similarity Search Method 24 3.3 EVALUATION 25 3.3.1 Experimental Setups 25 3.3.2 Experiment 1: Comparison Between Existing Methods and Lexicons 29 3.3.3 Experiment 2: Comparison Between Variants of MS-BSLI 35 3.3.4 Experiment 3: Sensitivity Analysis of the MSS 39 3.4 SUMMARY 41 Chapter 4 A Word Embedding-based Approach to Cross-lingual Topic Model 44 4.1 RELATED WORKS 46 4.1.1 Cross-lingual LDA 46 4.1.2 Continuous LDA 48 4.2 OUR APPROACH 50 4.2.1 Background 50 4.2.2 Preparing the Cross-lingual Word Embedding 51 4.2.3 Center-Based Cross-lingual Topic Model 53 4.3 EXPERIMENTAL RESULTS 60 4.3.1 Description of Datasets 60 4.3.2 Performance Metrics 62 4.3.3 Parameter Settings 65 4.3.4 Coherence Performance 68 4.3.5 Diversity Performance 72 4.3.6 Performance in Cross-lingual Document Representation 74 4.3.7 Qualitative Analysis 77 4.4 SUMMARY 81 Chapter 5 Neural Network Based Cross-lingual Topic Models 83 5.1 BACKGROUND 85 5.1.1 Variational Auto-Encoder 85 5.1.2 Auto-Encoding Variational Bayes 86 5.1.3 Extension to Topic Model: ProdLDA 88 5.2 CROSS-LINGUAL NEURAL TOPIC MODELS 90 5.2.1 Extended Embedded Topic Model 90 5.2.2 ZeroShot Topic Model 95 5.2.3 Contextualized ProdLDA 96 5.3 EXPERIMENTS 97 5.3.1 Experiment Settings 97 5.3.2 Experimental Results 100 5.4 SUMMARY 107 Chapter 6 Conclusion 109 6.1 FUTURE WORKS 110 References 112 Appendix A: Table of Notations 119 |
