國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,利用語音合成和對抗性文本鑑別器對語音辨識進行訓練以改進單語言以及語碼轉換下的語音辨識系統,Improving Speech Recognition System under Monolingual and Code-Switching by Training with Speech Synthesis and an Adversarial Text Discriminator

論文名稱 Title	利用語音合成和對抗性文本鑑別器對語音辨識進行訓練以改進單語言以及語碼轉換下的語音辨識系統 Improving Speech Recognition System under Monolingual and Code-Switching by Training with Speech Synthesis and an Adversarial Text Discriminator
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	111 學年度第 1 學期 The fall semester of Academic Year 111	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	85
研究生 Author	林厚安 Hou-An Lin
指導教授 Advisor	陳嘉平 Chen,Chia-Ping
召集委員 Convenor	賴威光 Lai,Wei-Kuang
口試委員 Advisory Committee	蔡崇煒, 黃奕欽 Tsai,Chun-Wei; Huang,Yi-Qin
口試日期 Date of Exam	2022-11-16	繳交日期 Date of Submission	2022-12-12
關鍵字 Keywords	語音辨識、語音合成、對抗性文本鑑別器、上下文區塊處理、逐塊同步集束搜索、串流處理 automatic speech recognition, text to speech, adversarial text discriminator, contextual block processing, blockwise synchronous beam search, streaming method
統計 Statistics	本論文已被瀏覽 85 次，被下載 3 次 The thesis/dissertation has been browsed 85 times, has been downloaded 3 times.

中文摘要
本論文中，我們以基於注意力機制的卷積增強變換器 (Convolution-augmented Transformer, Conformer) 架構結合連續時序性分類來建立我們的端到端自動語音辨識 (Automatic Speech Recognition, ASR) 系統，同時使用上下文區塊處理 (Contextual Block Processing) 以及逐塊同步集束搜索 (Blockwise Synchronous Beam Search) 的方法使系統可以達到串流 (Streaming) 的可能，並以此架構做為我們本文的基礎系統，後續基於此基礎系統我們採用三種方法來提高自動語音辨識系統的性能。同時我們分別在各個系統上使用單語言以及語碼轉換的資料集進行訓練以及利用遷移學習的方式微調系統，並使用單一語言以及語碼轉換測試資料分別測試系統，並觀察改進後的系統在單語言以及語碼轉換情況下的結果。首先，我們添加了一個對抗性文本鑑別器模塊對語音辨識模型進行訓練以糾正辨識結果中的拼寫錯誤。實驗結果表明，加入對抗性文本鑑別器的單語言以及語碼轉換語音辨識系統的字符錯誤率 (Character Error Rate, CER) 分別從 12.6% 以及 48.7% 下降至 12.3% 以及 45.1%，而單詞錯誤率 (Word Error Rate, WER) 分別從 31.7% 以及 65.7% 下降至 31.4% 以及 65.4%。其次，我們在語音辨識模型中加入了對應語言情境下的預訓練的語音合成 (Text to Speech, TTS) 模型。語音合成模型可以將語音辨識模型的輸出結果作為輸入以合成對應的梅爾頻譜圖 (Mel-spectrogram)，並近似真實的 (Ground-truth) 梅爾頻譜圖。在加入了語音合成模型後，單語言及語碼轉換的字元錯誤率分別從 12.6% 以及 48.7% 下降至 10.0% 以及 43.4%，而單詞錯誤率分別從 31.7% 以及 65.7% 下降至 23.0% 以及 64.3%。這表明預訓練的語音合成系統可以幫助提升語音辨識系統的效能。最後，我們將對應語言情境下的預訓練語音合成模型和對抗性文本鑑別器合併並對語音辨識模型進行訓練。通過這樣做，不僅可以有效地糾正錯別字，而且可以繼承語音合成系統修正原始語音辨識系統的效能。實驗結果表明，單一語言及語碼轉換的字符錯誤率與單詞錯誤率分別達到了 9.6% 和 22.0% 以及 41.6% 和 62.1%。
Abstract
In this thesis, we implement our end-to-end automatic speech recognition system using the conformer architecture based on the attention mechanism and Connectionist Temporal Clas- sification, and we employed Contextual Block Processing and Blockwise Synchronous Beam Search towards real-time speech recognition, and this architecture has served as a baseline through the development of our system. The speech recognition system will be improved us- ing three methods based on this baseline. We train our systems using monolingual datasets and code-switching datasets. After training, we evaluate the improved system using monolin- gual and code-switching test data, and we observe how well they perform. First, we add an adversarial text discriminator module to train the speech recognition model to correct typos in recognition results. The experimental results show that the character error rates of the mono- lingual and code-switching speech recognition systems with text discriminators drop from 12.6% and 48.7% to 12.3% and 45.1%, respectively, and the word error rates from 31.7% and 31.7%, respectively. 65.7% down to 31.4% and 65.4%. Second, we added a pre-trained speech synthesis (text-to-speech, TTS) model to the ASR model for the corresponding lan- guage. TTS can synthesize the output of ASR into a mel-spectrogram and approximate the mel-spectrogram of the label. The character error rates for monolingual and code-switching dropped from 12.6% and 48.7% to 10.0% and 43.4%, respectively, while the word error rates dropped from 31.7% and 65.7% to 23.0% and 64.3%. Finally, we merge language-specific pre-trained TTS and an adversarial text discriminator to train the speech recognition model in different languages. By doing this, not only the typos can be corrected effectively, but also the advantages of pre-trained TTS can be inherited. According to the experimental results, the character error rate and word error rate of monolingual and code-switching are 9.6% and 22.0% and 41.6% and 62.1%, respectively.

目次 Table of Contents
目錄論文審定書 i 誌謝 ii 摘要 iii Abstract v 圖目錄 x 表目錄 xii 第 1 章緒論 1 1.1 研究動機與目標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 投稿論文 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 文章架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 第 2 章基礎端到端語音辨識系統 4 2.1 文本前處理模塊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 資料增強 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 連續時序性分類 (Connectionist Temporal Classification) 模型 . . . . . . . . 7 2.4 注意力機制之介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.1 變換器 (Transformer) 架構 . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.2 卷積增強變換器 (Conformer) 編碼器 . . . . . . . . . . . . . . . . . 14 2.5 結合連續時序性分類之端到端語音辨識系統 . . . . . . . . . . . . . . . . . . 15 2.5.1 訓練階段之損失函數 . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.2 解碼階段之計分方法 . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 實現串流處理之方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6.1 應用於編碼器之上下文區塊處理方法 . . . . . . . . . . . . . . . . . . 18 2.6.2 用於解碼階段之逐塊同步集束搜索方法 . . . . . . . . . . . . . . . . 21 2.7 遷移學習方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 第 3 章結合語音合成模型與對抗性文本鑑別器改進端到端語音辨識系統 25 3.1 對抗性文本鑑別器模塊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 利用對抗性文本鑑別器對語音辨識模型訓練之作法 . . . . . . . . . . . . . . 27 3.3 語音合成端到端模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 文本轉拼音模塊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.2 可變資訊轉接器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.3 長度調節器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.4 後網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.5 語音合成模型之編碼器以及解碼器架構 . . . . . . . . . . . . . . . . 35 3.4 平滑動態時間校正 (Soft-Dynamic Time Warping, Soft-DTW) . . . . . . . . 36 3.5 利用語音合成模型加強語音辨識模型訓練之作法 . . . . . . . . . . . . . . . . 38 3.6 循環生成對抗網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7 利用語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之作法 . . 44 第 4 章實驗設置 46 4.1 語音辨識模型使用之資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1.1 單語言資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1.2 語碼轉換資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1.3 FSR-2020 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 語音合成模型使用之資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.1 單語言資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.2 多語言資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 語音辨識模型以及語音合成模型之實驗設置 . . . . . . . . . . . . . . . . . . 49 第 5 章實驗結果 52 5.1 語音辨識系統之評估方式 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 基礎語音辨識系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . 54 5.2.1 單語言測試集在單語言基礎語音辨識系統的結果分析 . . . . . . . . . 54 5.2.2 語碼轉換測試集在基礎語音辨識系統經過微調後的結果分析 . . . . . 55 5.3 利用對抗性文本鑑別器對語音辨識模型進行訓練之系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.1 單語言測試集在單語言對抗性文本鑑別器對語音辨識模型進行訓練之系統的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3.2 語碼轉換測試集在對抗性文本鑑別器對語音辨識模型進行訓練之系統經過微調後的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 利用語音合成模型加強語音辨識模型訓練之系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4.1 單語言測試集在單語言語音合成模型加強語音辨識模型訓練之系統的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.4.2 語碼轉換測試集在語音合成模型加強語音辨識模型訓練之系統經過微調後的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.5 利用語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . . . . . . . . . . . . 58 5.5.1 單語言測試集在單語言語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之系統的結果分析 . . . . . . . . . . . . . . . . . . 58 5.5.2 語碼轉換測試集在利用語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之系統經過微調後的結果分析 . . . . . . . . . . . 59 5.6 在語碼轉換情況效能低落之分析 . . . . . . . . . . . . . . . . . . . . . . . . 59 5.7 計算相似度的平滑動態時間校正與平均絕對誤差的比較 . . . . . . . . . . . . 59 5.8 FSR-2020 台文漢字實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . 61 第 6 章結論與未來展望 63 參考文獻 65

參考文獻 References
參考文獻 [1] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989. [2] D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, et al., “The subspace gaussian mixture model—a structured model for speech recognition,” Computer Speech & Language, vol. 25, no. 2, pp. 404– 439, 2011. [3] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376, 2006. [4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014. [5] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232, 2017. [6] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376, 2006. [7] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Transformer ASR with contextual block processing,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 427–433, 2019. [8] E. Tsunoo, Y. Kashiwagi, and S. Watanabe, “Streaming transformer ASR with blockwise synchronous beam search,” in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 22–29, 2021. [9] J. Sun, “Jieba chinese word segmentation tool,” 2012. [10] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Interspeech 2015, pp. 3586–3589, 2015. [11] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv Preprint arXiv:1904.08779, 2019. [12] “Sox, audio manipulation tool.” Available: http://sox.sourceforge.net, accessed: March 25,2015. [13] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [14] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv Preprint arXiv:1412.3555, 2014. [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017. [16] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964, 2016. [17] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Advances in Neural Information Processing Systems, vol. 28, 2015. [18] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv Preprint arXiv:1607.06450, 2016. [19] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformerxl: Attentive language models beyond a fixed-length context,” arXiv Preprint arXiv:1901.02860, 2019. [20] Y. Lu, Z. Li, D. He, Z. Sun, B. Dong, T. Qin, L. Wang, and T.-Y. Liu, “Understanding and improving transformer from a multi-particle dynamic system point of view,” arXiv Preprint arXiv:1906.02762, 2019. [21] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv Preprint arXiv:1710.05941, 2017. [22] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in International Conference on Machine Learning, pp. 933–941, 2017. [23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, pp. 448–456, 2015. [24] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017. [25] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839, 2017. [26] T. Nakatani, “Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,” in Proc. Interspeech 2019, 2019. [27] T. Hori, S. Watanabe, and J. R. Hershey, “Joint ctc/attention decoding for end-to-end speech recognition,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 518–529, 2017. [28] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602– 610, 2005. [29] L. Dong, F. Wang, and B. Xu, “Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5656– 5660, 2019. [30] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” arXiv Preprint arXiv:1712.05382, 2017. [31] K. Kim, K. Lee, D. Gowda, J. Park, S. Kim, S. Jin, Y.-Y. Lee, J. Yeo, D. Kim, S. Jung, et al., “Attention based on-device streaming speech recognition with large speech corpus,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 956–963, 2019. [32] H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum latency training strategies for streaming sequence-to-sequence asr,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6064–6068, 2020. [33] D. Wang and T. F. Zheng, “Transfer learning for speech and language processing,” in 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1225–1237, 2015. [34] L. YouTube, “Youtube,” Retrieved, vol. 27, p. 2011, 2011. [35] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv Preprint arXiv:2006.04558, 2020. [36] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783, 2018. [37] P. J. Huber, “Robust estimation of a location parameter,” in Breakthroughs in Statistics, pp. 492–518, Springer, 1992. [38] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” Advances in Neural Information Processing Systems, vol. 32, 2019. [39] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” arXiv Preprint arXiv:1803.02155, 2018. [40] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43–49, 1978. [41] M. Cuturi and M. Blondel, “Soft-dtw: a differentiable loss function for time-series,” in International Conference on Machine Learning, pp. 894–903, 2017. [42] P. Blanchard, D. J. Higham, and N. J. Higham, “Accurately computing the log-sum-exp and softmax functions,” IMA Journal of Numerical Analysis, vol. 41, no. 4, pp. 2311– 2330, 2021. [43] E. L. Denton, S. Chintala, R. Fergus, et al., “Deep generative image models using a laplacian pyramid of adversarial networks,” Advances in Neural Information Processing Systems, vol. 28, 2015. [44] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv Preprint arXiv:1511.06434, 2015. [45] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on the natural image manifold,” in European Conference on Computer Vision, pp. 597–613, 2016. [46] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun, “Disentangling factors of variation in deep representation using adversarial training,” Advances in Neural Information Processing Systems, vol. 29, 2016. [47] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al., “Espnet: End-to-end speech processing toolkit,” arXiv Preprint arXiv:1804.00015, 2018. [48] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., “The kaldi speech recognition toolkit,” in 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 1–4, 2011. [49] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in Neural Information Processing Systems, vol. 32, 2019. [50] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source framework for deep learning,” in Proceedings of Workshop on Machine Learning Systems (LearningSys) in the Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), vol. 5, pp. 1–6, 2015. [51] R. Collobert, S. Bengio, and J. Mariéthoz, “Torch: A modular machine learning software library,” tech. rep., Idiap, 2002. [52] G. Van Rossum and F. L. Drake, Python 3 reference manual. CreateSpace, 2009. [53] B. B. T. C. L, “Chinese standard mandarin speech corpus.” https://www.data-baker. com/opensource.html, 2017. [54] H.-P. Lin, “Improving speech recognition systems for low-resource languages with hidden speaker information.” https://hdl.handle.net/11296/p2yng3, 2021. [55] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “Aishell-3: A multi-speaker mandarin tts corpus and the baselines,” arXiv Preprint arXiv:2010.11567, 2020. [56] C. Veaux, J. Yamagishi, K. MacDonald, et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017. [57] Y. Ren, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Almost unsupervised text to speech and automatic speech recognition,” in International Conference on Machine Learning, pp. 5410–5419, 2019.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-1112122-101919.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS