Responsive image
博碩士論文 etd-1112122-101919 詳細資訊
Title page for etd-1112122-101919
論文名稱
Title
利用語音合成和對抗性文本鑑別器對語音辨識進行訓練以改進單語言以及語碼轉換下的語音辨識系統
Improving Speech Recognition System under Monolingual and Code-Switching by Training with Speech Synthesis and an Adversarial Text Discriminator
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
85
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2022-11-16
繳交日期
Date of Submission
2022-12-12
關鍵字
Keywords
語音辨識、語音合成、對抗性文本鑑別器、上下文區塊處理、逐塊同步集束搜索、串流處理
automatic speech recognition, text to speech, adversarial text discriminator, contextual block processing, blockwise synchronous beam search, streaming method
統計
Statistics
本論文已被瀏覽 25 次,被下載 3
The thesis/dissertation has been browsed 25 times, has been downloaded 3 times.
中文摘要
本論文中,我們以基於注意力機制的卷積增強變換器 (Convolution-augmented Transformer, Conformer) 架構結合連續時序性分類來建立我們的端到端自動語音辨識 (Automatic Speech Recognition, ASR) 系統,同時使用上下文區塊處理 (Contextual Block Processing) 以及 逐塊同步集束搜索 (Blockwise Synchronous Beam Search) 的方法使系統可以達到串流 (Streaming) 的可能,並以此架構做為我們本文的基礎系統,後續基於此基礎系統我們採用 三種方法來提高自動語音辨識系統的性能。同時我們分別在各個系統上使用單語言以及語碼 轉換的資料集進行訓練以及利用遷移學習的方式微調系統,並使用單一語言以及語碼轉換測 試資料分別測試系統,並觀察改進後的系統在單語言以及語碼轉換情況下的結果。首先,我 們添加了一個對抗性文本鑑別器模塊對語音辨識模型進行訓練以糾正辨識結果中的拼寫錯 誤。實驗結果表明,加入對抗性文本鑑別器的單語言以及語碼轉換語音辨識系統的字符錯 誤率 (Character Error Rate, CER) 分別從 12.6% 以及 48.7% 下降至 12.3% 以及 45.1%,而 單詞錯誤率 (Word Error Rate, WER) 分別從 31.7% 以及 65.7% 下降至 31.4% 以及 65.4%。 其次,我們在語音辨識模型中加入了對應語言情境下的預訓練的語音合成 (Text to Speech, TTS) 模型。語音合成模型可以將語音辨識模型的輸出結果作為輸入以合成對應的梅爾頻譜 圖 (Mel-spectrogram),並近似真實的 (Ground-truth) 梅爾頻譜圖。在加入了語音合成模型 後,單語言及語碼轉換的字元錯誤率分別從 12.6% 以及 48.7% 下降至 10.0% 以及 43.4%, 而單詞錯誤率分別從 31.7% 以及 65.7% 下降至 23.0% 以及 64.3%。這表明預訓練的語音合 成系統可以幫助提升語音辨識系統的效能。最後,我們將對應語言情境下的預訓練語音合成 模型和對抗性文本鑑別器合併並對語音辨識模型進行訓練。通過這樣做,不僅可以有效地 糾正錯別字,而且可以繼承語音合成系統修正原始語音辨識系統的效能。實驗結果表明,單一語言及語碼轉換的字符錯誤率與單詞錯誤率分別達到了 9.6% 和 22.0% 以及 41.6% 和 62.1%。
Abstract
In this thesis, we implement our end-to-end automatic speech recognition system using the conformer architecture based on the attention mechanism and Connectionist Temporal Clas- sification, and we employed Contextual Block Processing and Blockwise Synchronous Beam Search towards real-time speech recognition, and this architecture has served as a baseline through the development of our system. The speech recognition system will be improved us- ing three methods based on this baseline. We train our systems using monolingual datasets and code-switching datasets. After training, we evaluate the improved system using monolin- gual and code-switching test data, and we observe how well they perform. First, we add an adversarial text discriminator module to train the speech recognition model to correct typos in recognition results. The experimental results show that the character error rates of the mono- lingual and code-switching speech recognition systems with text discriminators drop from 12.6% and 48.7% to 12.3% and 45.1%, respectively, and the word error rates from 31.7% and 31.7%, respectively. 65.7% down to 31.4% and 65.4%. Second, we added a pre-trained speech synthesis (text-to-speech, TTS) model to the ASR model for the corresponding lan- guage. TTS can synthesize the output of ASR into a mel-spectrogram and approximate the mel-spectrogram of the label. The character error rates for monolingual and code-switching dropped from 12.6% and 48.7% to 10.0% and 43.4%, respectively, while the word error rates dropped from 31.7% and 65.7% to 23.0% and 64.3%. Finally, we merge language-specific pre-trained TTS and an adversarial text discriminator to train the speech recognition model in different languages. By doing this, not only the typos can be corrected effectively, but also the advantages of pre-trained TTS can be inherited. According to the experimental results, the character error rate and word error rate of monolingual and code-switching are 9.6% and 22.0% and 41.6% and 62.1%, respectively.
目次 Table of Contents
目錄
論文審定書 i
誌謝 ii
摘要 iii
Abstract v
圖目錄 x
表目錄 xii
第 1 章 緒論 1
1.1 研究動機與目標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 研究貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 投稿論文 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 文章架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
第 2 章 基礎端到端語音辨識系統 4
2.1 文本前處理模塊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 資料增強 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 連續時序性分類 (Connectionist Temporal Classification) 模型 . . . . . . . . 7
2.4 注意力機制之介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 變換器 (Transformer) 架構 . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 卷積增強變換器 (Conformer) 編碼器 . . . . . . . . . . . . . . . . . 14
2.5 結合連續時序性分類之端到端語音辨識系統 . . . . . . . . . . . . . . . . . . 15
2.5.1 訓練階段之損失函數 . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 解碼階段之計分方法 . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 實現串流處理之方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.1 應用於編碼器之上下文區塊處理方法 . . . . . . . . . . . . . . . . . . 18
2.6.2 用於解碼階段之逐塊同步集束搜索方法 . . . . . . . . . . . . . . . . 21
2.7 遷移學習方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
第 3 章 結合語音合成模型與對抗性文本鑑別器改進端到端語音辨識系統 25
3.1 對抗性文本鑑別器模塊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 利用對抗性文本鑑別器對語音辨識模型訓練之作法 . . . . . . . . . . . . . . 27
3.3 語音合成端到端模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 文本轉拼音模塊 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 可變資訊轉接器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 長度調節器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.4 後網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.5 語音合成模型之編碼器以及解碼器架構 . . . . . . . . . . . . . . . . 35
3.4 平滑動態時間校正 (Soft-Dynamic Time Warping, Soft-DTW) . . . . . . . . 36
3.5 利用語音合成模型加強語音辨識模型訓練之作法 . . . . . . . . . . . . . . . . 38
3.6 循環生成對抗網路架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 利用語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之作法 . . 44
第 4 章 實驗設置 46
4.1 語音辨識模型使用之資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.1 單語言資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.2 語碼轉換資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.3 FSR-2020 資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 語音合成模型使用之資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 單語言資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 多語言資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 語音辨識模型以及語音合成模型之實驗設置 . . . . . . . . . . . . . . . . . . 49
第 5 章 實驗結果 52
5.1 語音辨識系統之評估方式 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 基礎語音辨識系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . 54
5.2.1 單語言測試集在單語言基礎語音辨識系統的結果分析 . . . . . . . . . 54
5.2.2 語碼轉換測試集在基礎語音辨識系統經過微調後的結果分析 . . . . . 55
5.3 利用對抗性文本鑑別器對語音辨識模型進行訓練之系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 單語言測試集在單語言對抗性文本鑑別器對語音辨識模型進行訓練之系統的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.2 語碼轉換測試集在對抗性文本鑑別器對語音辨識模型進行訓練之系統經過微調後的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 利用語音合成模型加強語音辨識模型訓練之系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.1 單語言測試集在單語言語音合成模型加強語音辨識模型訓練之系統的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4.2 語碼轉換測試集在語音合成模型加強語音辨識模型訓練之系統經過微調後的結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 利用語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之系統架構在單語言以及語碼轉換上的結果 . . . . . . . . . . . . . . . . . . . . . . . 58
5.5.1 單語言測試集在單語言語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之系統的結果分析 . . . . . . . . . . . . . . . . . . 58
5.5.2 語碼轉換測試集在利用語音合成模型以及對抗性文本鑑別器對語音辨識模型進行訓練之系統經過微調後的結果分析 . . . . . . . . . . . 59
5.6 在語碼轉換情況效能低落之分析 . . . . . . . . . . . . . . . . . . . . . . . . 59
5.7 計算相似度的平滑動態時間校正與平均絕對誤差的比較 . . . . . . . . . . . . 59
5.8 FSR-2020 台文漢字實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . 61
第 6 章 結論與未來展望 63
參考文獻 65
參考文獻 References
參考文獻
[1] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech
recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[2] D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel,
M. Karafiát, A. Rastrow, et al., “The subspace gaussian mixture model—a structured
model for speech recognition,” Computer Speech & Language, vol. 25, no. 2, pp. 404–
439, 2011.
[3] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376, 2006.
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014.
[5] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using
cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232, 2017.
[6] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376, 2006.
[7] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Transformer ASR with contextual block processing,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 427–433, 2019.
[8] E. Tsunoo, Y. Kashiwagi, and S. Watanabe, “Streaming transformer ASR with blockwise synchronous beam search,” in 2021 IEEE Spoken Language Technology Workshop
(SLT), pp. 22–29, 2021.
[9] J. Sun, “Jieba chinese word segmentation tool,” 2012.
[10] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Interspeech 2015, pp. 3586–3589, 2015.
[11] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le,
“Specaugment: A simple data augmentation method for automatic speech recognition,”
arXiv Preprint arXiv:1904.08779, 2019.
[12] “Sox, audio manipulation tool.” Available: http://sox.sourceforge.net, accessed:
March 25,2015.
[13] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation,
vol. 9, no. 8, pp. 1735–1780, 1997.
[14] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent
neural networks on sequence modeling,” arXiv Preprint arXiv:1412.3555, 2014.
[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing
Systems, vol. 30, 2017.
[16] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for
large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964, 2016.
[17] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based
models for speech recognition,” Advances in Neural Information Processing Systems,
vol. 28, 2015.
[18] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv Preprint
arXiv:1607.06450, 2016.
[19] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformerxl: Attentive language models beyond a fixed-length context,” arXiv Preprint
arXiv:1901.02860, 2019.
[20] Y. Lu, Z. Li, D. He, Z. Sun, B. Dong, T. Qin, L. Wang, and T.-Y. Liu, “Understanding
and improving transformer from a multi-particle dynamic system point of view,” arXiv
Preprint arXiv:1906.02762, 2019.
[21] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv
Preprint arXiv:1710.05941, 2017.
[22] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in International Conference on Machine Learning, pp. 933–941,
2017.
[23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” in International Conference on Machine Learning,
pp. 448–456, 2015.
[24] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal
Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
[25] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 4835–4839, 2017.
[26] T. Nakatani, “Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,” in Proc. Interspeech
2019, 2019.
[27] T. Hori, S. Watanabe, and J. R. Hershey, “Joint ctc/attention decoding for end-to-end
speech recognition,” in Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics, pp. 518–529, 2017.
[28] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional
lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–
610, 2005.
[29] L. Dong, F. Wang, and B. Xu, “Self-attention aligner: A latency-control end-to-end
model for asr using self-attention network and chunk-hopping,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5656–
5660, 2019.
[30] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” arXiv Preprint
arXiv:1712.05382, 2017.
[31] K. Kim, K. Lee, D. Gowda, J. Park, S. Kim, S. Jin, Y.-Y. Lee, J. Yeo, D. Kim,
S. Jung, et al., “Attention based on-device streaming speech recognition with large
speech corpus,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 956–963, 2019.
[32] H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum latency training strategies for streaming sequence-to-sequence asr,” in 2020 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 6064–6068, 2020.
[33] D. Wang and T. F. Zheng, “Transfer learning for speech and language processing,” in
2015 Asia-Pacific Signal and Information Processing Association Annual Summit and
Conference (APSIPA), pp. 1225–1237, 2015.
[34] L. YouTube, “Youtube,” Retrieved, vol. 27, p. 2011, 2011.
[35] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and
high-quality end-to-end text to speech,” arXiv Preprint arXiv:2006.04558, 2020.
[36] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang,
Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel
spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 4779–4783, 2018.
[37] P. J. Huber, “Robust estimation of a location parameter,” in Breakthroughs in Statistics,
pp. 492–518, Springer, 1992.
[38] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast,
robust and controllable text to speech,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[39] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” arXiv Preprint arXiv:1803.02155, 2018.
[40] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word
recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26,
no. 1, pp. 43–49, 1978.
[41] M. Cuturi and M. Blondel, “Soft-dtw: a differentiable loss function for time-series,” in
International Conference on Machine Learning, pp. 894–903, 2017.
[42] P. Blanchard, D. J. Higham, and N. J. Higham, “Accurately computing the log-sum-exp
and softmax functions,” IMA Journal of Numerical Analysis, vol. 41, no. 4, pp. 2311–
2330, 2021.
[43] E. L. Denton, S. Chintala, R. Fergus, et al., “Deep generative image models using a
laplacian pyramid of adversarial networks,” Advances in Neural Information Processing
Systems, vol. 28, 2015.
[44] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep
convolutional generative adversarial networks,” arXiv Preprint arXiv:1511.06434, 2015.
[45] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on the natural image manifold,” in European Conference on Computer Vision,
pp. 597–613, 2016.
[46] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun, “Disentangling factors of variation in deep representation using adversarial training,” Advances
in Neural Information Processing Systems, vol. 29, 2016.
[47] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al., “Espnet: End-to-end speech processing
toolkit,” arXiv Preprint arXiv:1804.00015, 2018.
[48] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann,
P. Motlicek, Y. Qian, P. Schwarz, et al., “The kaldi speech recognition toolkit,” in 2011
IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 1–4, 2011.
[49] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep
learning library,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[50] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source
framework for deep learning,” in Proceedings of Workshop on Machine Learning Systems (LearningSys) in the Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), vol. 5, pp. 1–6, 2015.
[51] R. Collobert, S. Bengio, and J. Mariéthoz, “Torch: A modular machine learning software
library,” tech. rep., Idiap, 2002.
[52] G. Van Rossum and F. L. Drake, Python 3 reference manual. CreateSpace, 2009.
[53] B. B. T. C. L, “Chinese standard mandarin speech corpus.” https://www.data-baker.
com/opensource.html, 2017.
[54] H.-P. Lin, “Improving speech recognition systems for low-resource languages with hidden speaker information.” https://hdl.handle.net/11296/p2yng3, 2021.
[55] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “Aishell-3: A multi-speaker mandarin tts
corpus and the baselines,” arXiv Preprint arXiv:2010.11567, 2020.
[56] C. Veaux, J. Yamagishi, K. MacDonald, et al., “Cstr vctk corpus: English multi-speaker
corpus for cstr voice cloning toolkit,” University of Edinburgh. The Centre for Speech
Technology Research (CSTR), 2017.
[57] Y. Ren, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Almost unsupervised text
to speech and automatic speech recognition,” in International Conference on Machine
Learning, pp. 5410–5419, 2019.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code