國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,設計與實現應用於Transformer 機器學習模型之 Multi-Head Self-Attention運算加速器,Design and Implementation of the Multi-Head Self-Attention Accelerator for Transformer Machine Learning Model

論文名稱 Title	設計與實現應用於Transformer 機器學習模型之 Multi-Head Self-Attention運算加速器 Design and Implementation of the Multi-Head Self-Attention Accelerator for Transformer Machine Learning Model
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	113 學年度第 1 學期 The fall semester of Academic Year 113	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	79
研究生 Author	林右昕 Yu-Hsin Lin
指導教授 Advisor	邱日清 Chiu,Jih-Ching
召集委員 Convenor	鄺獻榮 Kuang,Shiann-Rong
口試委員 Advisory Committee	周育樑, 楊凱名, 謝東佑 Chou,Yu-Liang; Yang,Kai-Ming; Hsieh,Tong-Yu
口試日期 Date of Exam	2024-10-15	繳交日期 Date of Submission	2024-11-05
關鍵字 Keywords	Transformer、Multi-Head Self-Attention、半精度浮點數量化、Thresholding、Matrix Compression Transformer, Multi-Head Self-Attention, Half-Precision Floating-Point Quantization, Thresholding, Matrix Compression
統計 Statistics	本論文已被瀏覽 176 次，被下載 0 次 The thesis/dissertation has been browsed 176 times, has been downloaded 0 times.

中文摘要
近年來機器學習被廣泛應用於各個領域，並且有許多良好的發展。其中，Transformer是現代機器學習中的核心，適合應用於自然語言處理(NLP) 和計算機視覺領域，許多知名的機器學習模型也都是基於Transformer進行開發，如ChatGPT、Attention Weights MakerBERT和ViT等。在Transformer中最重要且應用最多的計算為Multi-Head Self-Attention，因此本論文旨在針對Multi-Head Self-Attention設計加速電路，以有效的提升Transformer的運算效率。將Multi-Head Self-Attention分成Q、K、V Generator、Scaled Attention Score Maker、Matrix Compression、Context Vector Maker和Linear Projection六個部分，再依據參數的大小，選擇適當的Bus寬度和乘法器數量，並以多種硬體加速設計和pipeline的方式，實現Multi-Head Self-Attention運算加速器。本論文主要透過以下方式設計與實現加速電路。 1. 半精度浮點數量化：相較於單精度浮點數量化，半精度浮點數有儲存空間較小和計算量較小的優點；相較於整數量化，半精度浮點數有較高精確度和較大的數值表示範圍。 2. Double Local Buffer：在計算資料的同時，讀取下一筆資料，使資料運算和讀取能並行執行，提升運算效率。 3. Pipeline：將Q Generator和Scaled Attention Score Maker、Attention Weights Maker、Matrix Compression三個module並行計算，提升運算效率。 4. Thresholding：將最龐大的Attention Weights矩陣設定一個最小值，若低於最小值則視為零，且不影響結果判斷。 5. Matrix Compression：在Multi-Head Self-Attention運算中稀疏度最高的矩陣為Attention Weights，因此將Attention Weights以Matrix Compression的方式壓縮，減少矩陣乘法計算量。本論文以 Verilog 設計與實現上述架構，並藉由Testbench模擬結果和C語言驗證功能。電路合成使用 TSMC 40nm 製程，電路面積為1773429〖μm〗^2，功耗為31.62mW，頻率為100MHz。效能方面，和Intel Core i9-13900F的CPU比較運算速度為364倍，和NVIDIA GeForce RTX 3070比較運算速度為8.58倍。
Abstract
In recent years, machine learning has been widely applied in various fields and has seen many promising developments. Among these machine learning, Transformer is the most important component of modern machine learning, well-suited for applications in Natural Language Processing (NLP) and computer vision. Many well-known machine learning models, such as ChatGPT, BERT, ViT, etc., are based on the Transformer architecture. The most important and widely used computation within the Transformer is Multi-Head Self-Attention architecture. Therefore, this thesis aims to design an accelerator circuit specifically for Multi-Head Self-Attention to effectively enhance the computational efficiency of the Transformer. The Multi-Head Self-Attention architecture is divided into six parts: Q, K, V Generator, Scaled Attention Score Maker, Matrix Compression, Context Vector Maker, and Linear Projection. Based on the parameter sizes, appropriate bus widths and the number of multipliers are selected, while various hardware acceleration techniques and pipeline designs are employed to implement the Multi-Head Self-Attention computation accelerator. In this thesis, the accelerator circuit is primarily designed and implemented using the following methods: 1. Half-Precision Floating-Point Quantization: Compared to single-precision floating-point quantization, half-precision floating-point quantization has the advantages of smaller storage space and reduced computational load. Compared to integer quantization, it provides higher accuracy and a wider range of value representation. 2. Double Local Buffer: During the computation of the current data, the next data is fetched simultaneously. Parallel execution of computation and data fetching, which improves overall efficiency 3. Pipeline: The Q Generator is parallel execution with Scaled Attention Score Maker, Attention Weights Maker, and Matrix Compression modules to improve overall efficiency. 4. Thresholding: Setting a minimum threshold for the largest Attention Weights matrix. If an element falls below this threshold, it is considered zero which is not affected the accuracy of the result. 5. Matrix Compression: In the Multi-Head Self-Attention computation, the matrix with the highest sparsity is the Attention Weights matrix. Therefore, the Attention Weights are compressed using Matrix Compression to reduce the computational load of matrix multiplications. In this thesis, I designed and implemented the aforementioned architecture using Verilog, with functionality verified through testbench simulation results and validation in C. Circuit synthesis is performed using TSMC's 40nm process, resulting in a circuit area of 1,773,429 μm², power consumption of 31.62mW, and an operating frequency of 100MHz. In terms of performance, the computation speed is 364 times faster than the Intel Core i9-13900F CPU, and 8.58 times faster than the NVIDIA GeForce RTX 3070 GPU.

目次 Table of Contents
論文審定書 i 致謝 ii 摘要 iii Abstract v 目錄 vii 圖目錄 x 表格目錄 xiv 第一章前言 1 1.1 研究動機 1 1.2 研究目的 2 1.3 論文架構 2 第二章相關研究 4 2.1 Transformer 4 2.1.1 簡介 4 2.1.2 Multi-Head Self-Attention 6 2.1.3 Transformer運算方式 6 2.2 Attention相關研究 18 2.2.1 FreFlex A High Performance Processor for Convolution and Attention Computations via Sparsity-Adaptive Dynamic Frequency Boosting 18 2.2.2 Full-Stack Optimizing Transformer Inference on ARM Many-Core CPU 19 第三章設計理念 21 第四章 Multi-Head Self-Attention運算加速器設計 22 4.1 半精度浮點數量化 22 4.2 加速電路整體架構 23 4.3 Double Local Buffer 26 4.4 Q、K、V Generator 26 4.5 Scaled Attention Score Maker 31 4.6 Attention Weights Maker 32 4.7 Matrix Compression 36 4.8 Context Vector Maker 39 4.9 Linear Projection 40 第五章效能分析與驗證 43 5.1 加速電路合成結果 43 5.2 加速電路模擬驗證 45 5.3 加速電路波形驗證 47 5.3.1 整體波型驗證 47 5.3.2 K Generator波型驗證 49 5.3.3 Q Generator波型驗證 50 5.3.4 Context Vector Maker波型驗證 52 5.3.5 Linear Projection 53 5.4 加速電路效能分析 54 5.4.1 量化儲存空間和運算效率分析 54 5.4.2 Bus寬度與乘法器數量分析 55 5.5 加速電路與CPU和GPU執行比較 57 第六章結論與未來展望 60 6.1 論文總結 60 6.2 未來展望 60 參考資料 62

參考文獻 References
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv, vol. 1706.03762, Jun. 2017. [2] Elman, J.L., “Finding Structure in Time”, Cogn. Sci., 14, 179-211, 1990 [3] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 9, pp. 249-256,2010. Available: https://proceedings.mlr.press/v9/glorot10a.html [4] S. Shanmuga Sundaram, Y. Khodke, Y. Li, S. -J. Jang, S. -S. Lee and M. Kang, "FreFlex: A High-Performance Processor for Convolution and Attention Computations via Sparsity-Adaptive Dynamic Frequency Boosting," in IEEE Journal of Solid-State Circuits, vol. 59, no. 3, pp. 855-866, March 2024, doi: 10.1109/JSSC.2023.3341348. [5] J. Jiang, J. Du, D. Huang, Z. Chen, Y. Lu and X. Liao, "Full-Stack Optimizing Transformer Inference on ARM Many-Core CPU," in IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 7, pp. 2221-2235, July 2023, doi: 10.1109/TPDS.2023.3280805. [6] International Organization for Standardization and International Electrotechnical Commission, “Information technology — Language independent arithmetic — Part 1: Integer and floating point arithmetic,” ISO/IEC 10967-1:2012, 2012. [7] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv, vol. 1607.06450, Jul. 2016. [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv, vol. 1512.03385, Dec. 2015. [9] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv, vol. 1508.07909, Aug. 2015. [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," arXiv:1810.04805, 2018. [11] TensorFlow, "Neural machine translation with a Transformer and Keras," GitHub, https://github.com/tensorflow/text/blob/master/docs/tutorials/transformer.ipynb. Accessed: Oct. 9, 2024. [12] X. Glorot, A. Bordes, & Y. Bengio, “Deep Sparse Rectifier Neural Networks. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics,” in Proceedings of Machine Learning Research, 2011, Available from https://proceedings.mlr.press/v15/glorot11a.html. [13] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient flow in recurrent nets: The difficulty of learning long-term dependencies," 2001. [14] R. Pascanu, T. Mikolov, and Y. Bengio, "Understanding the exploding gradient problem," CoRR, vol. abs/1211.5063, 2012. [15] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014. [16] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv:1301.3781, 2013. [17] G. E. Hinton, "A practical guide to training restricted Boltzmann machines," Cognitive Science, vol. 34, no. 8, pp. 1816-1853, 2010. [18] D. P. Kingma and M. Welling, "Auto-Encoding Variational Bayes," arXiv:1412.6980, 2014. [19] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors," Nature, vol. 323, no. 6088, pp. 533-536, 1986. [20] D. Bahdanau, K. Cho, and Y. Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate," arXiv:1409.3215, 2014. [21] Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H., "Mixed Precision Training," arXiv:1710.03740, 2017.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：開放下載的時間 available 2026-11-05 校外 Off-campus：開放下載的時間 available 2026-11-05 您的 IP(校外) 位址是 216.73.216.26 現在時間是 2025-05-25 論文校外開放下載的時間是 2026-11-05 Your IP address is 216.73.216.26 The current date is 2025-05-25 This thesis will be available to you on 2026-11-05.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 2026-11-05

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS