國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,多功能與高效節能之邊緣計算深度神經網路加速晶片硬體設計與實作,Deep Neural Network Acceleration Chip Hardware Design and Implementation for the Multi-task , High-Performance, and Energy-Efficient Edge Computing

論文名稱 Title	多功能與高效節能之邊緣計算深度神經網路加速晶片硬體設計與實作 Deep Neural Network Acceleration Chip Hardware Design and Implementation for the Multi-task , High-Performance, and Energy-Efficient Edge Computing
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	109 學年度第 1 學期 The fall semester of Academic Year 109	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	99
研究生 Author	侯皓晟 Hou,Hau-Cheng
指導教授 Advisor	郭可驥 Ko-Chi Kuo
召集委員 Convenor	鄺獻榮 Shiann-Rong Kuang
口試委員 Advisory Committee	謝東佑 Tong-Yu Hsieh
口試日期 Date of Exam	2021-01-26	繳交日期 Date of Submission	2021-02-01
關鍵字 Keywords	深度可分離卷積運算、卷積運算、深度神經網路、硬體加速器、神經網路硬體加速器 Deep Neural Network, Hardware Accelerators, Neural Network Hardware Accelerator, Depthwise Separable Convolution, Convolution, Neural Network Model Compression
統計 Statistics	本論文已被瀏覽 504 次，被下載 0 次 The thesis/dissertation has been browsed 504 times, has been downloaded 0 times.

中文摘要
本論文為研究深度神經網路應用於邊緣運算的演算法開發模型與硬體加速晶片之設計，對於神經網路模型演算法的開發，本論文使用輕量化網路模型來使模型進行一定程度的篩撿，進而達到硬體計算量的下降與實現邊緣運算高即時性(Real-Time)的要求。本論文的硬體設計為朝向多功能型、低功率、高效能化、可即時運算、低面積的硬體架構。本論文硬體架構以Tiling為架構核心，其支援的運算包含卷積運算(Convolution)與深度可分離卷積運算(Depthwise Separable Convolution)，和不同Kernel Size與不同Stride的支援，並配合資料流設計、Data Reuse方式與PE單元的設計，來減少Memory的Access，進而達成高效能運算與功率消耗下降等優點。近年來由於Edge AI為各家公司和學術單位研究的重點，而為了達到高即時運算，模型網路的壓縮與輕量化網路模型都是很好實行的方式。因此本論文不僅在軟體網路進行分析運算，也在硬體晶片設計上加入了分離卷積運算，讓本論文的硬體晶片不僅能運算常用的卷積網路，也能運算輕量化神經網路，如此一來便更能貼近邊緣運算的需求。本論文主要設計重點為使用Tile架構來切割輸入圖片，並透過tile size分析DRAM讀取次數與Input SRAM儲存面積來找到切tile最合適的大小，而本論文也透過資料流與平行度分析來讓DNN硬體電路的Memory的Access為最小與實現運算高效能化，並透過PE設計與切kernel size的方式來到達到硬體不含資料載入時間，其PE在不同Kernel size運算下均可以達到硬體使用率100%。而本論文另一項設計重點為加入padding的硬體電路，如此一來即可以大大提升硬體的運算時間與降低資料重新排列的時間，並同時降低Memory的Access與降低硬體的功率消耗。
Abstract
This thesis is to study the algorithm development model of the edge algorithm of deep neural network (DNN) model and the design of hardware acceleration chip. For the development of neural network model algorithm, this paper uses the lightweight network model to make the model screen Reduced, the boots meet the requirements of a decrease in hardware calculations and high real-time edge computing. The hardware design of this paper is multi-functional, low-power, high-performance, and capable of real-time computing. The hardware architecture of the low-level paper uses Tiling as the core of the architecture, and the supported processors include Convolution and deep computing. Separate convolution operation, and support for the different kernel sizes and stride, and cooperate with data flow design, data reuse method and PE unit design, reduce memory access, and complete high-efficiency computing and power consumption Down and other advantages. In recent years, since Edge AI has been the focus of research by various companies and academic units, in order to achieve high real-time calculations, the compression of model networks and lightweight network models are both well-implemented methods. Therefore, this thesis not only performs the operational analysis on the software network, but also adds a separate convolution operation to the hardware chip design, so that the hardware chip can not only calculate the commonly used convolution network, but also calculate the lightweight neural network. In this way, it can be closer to the needs of edge computing. The main design focus of this thesis is to use a tiled structure to seperate the input image, and through the block size analysis of the number of DRAM reads and the input SRAM storage area to find the most suitable size of the block, and this thesis can also use data flow and parallelism analysis to minimize the access to the memory of the DNN hardware circuit and realize the high efficiency of the operation, and to achieve the hardware elimination data loading time through the PE design and reduce the kernel size. The PE hardware utilization rate can reach 100% operations under the different kernel sizes. In addition, the padding technique is proposed in this thesis. It can result in the enhanced operation time for the hardware and the reduced time for the data rearrangement. It also can reduce the memory access and the power consumption.

目次 Table of Contents
目錄論文審定書 i 摘要 ii Abstract iii 第 1 章概論 1 1.1 研究動機 1 1.2 本文大綱 2 第 2 章研究背景與相關研究 3 2.1 常見神經網路運算 3 2.1.1 影像處理網路運算架構 3 2.1.2 語音處理網路運算架構 7 2.1.3 激勵函數 8 2.1.4 池化層 8 2.2 經典神經網路模型 9 2.2.1 影像應用神經網路模型 9 2.2.2 輕量化神經網路模型 11 2.2.3 物件偵測應用神經網路 13 2.2.4 生成對抗網路 14 2.3 經典DNN加速器架構 15 2.3.1 DianNao[11] 15 2.3.2 TPU[12] 15 2.3.3 Eyeriss[13] 15 2.3.4 DNA[14] 16 2.3.5 FlexFlow[15] 16 2.3.6 Stripes[16] 16 2.3.7 經典DNN加速器數據表現 17 第 3 章邊緣運算軟、硬體設計與分析 18 3.1 邊緣運算神經網路模型分析 19 3.1.1 深層可分離卷積運算分析 19 3.2 DNN硬體加速器資料切割Tile分析 22 3.3 DNN硬體加速器資料流分析 26 3.3.1 1D Array System Dataflow 26 3.3.2 2D Array System Dataflow 29 3.4 DNN硬體加速器平行度選擇與分析 34 3.4.1 Input Channel Parallel 34 3.4.2 Output Channel Parallel 34 3.4.3 Kernel Window Parallel 35 3.4.4 Output Window Parallel 36 3.4.5 DNN硬體加速器平行度相依性 36 第 4 章 DNN加速硬體設計與實作 39 4.1 Tile設計 40 4.1.1 不同Output Size、stride、kernel size下的Tile大小設計 40 4.2 資料流與平行度的選擇 43 4.3 On-Chip Memory設計 45 4.3.1 Input Buffer設計 47 4.3.2 Weight buffer設計 48 4.3.3 Output buffer設計 49 4.4 Temporary Buffer設計 51 4.5 Line Buffer設計 52 4.5.1 Line Buffer支援不同Tile Size設計 54 4.5.2 Line Buffer支援不同kernel Size設計 55 4.5.3 Line Buffer支援不同卷積運算設計 56 4.5.4 Line Buffer支援不同Stride設計 58 4.6 PE單元設計(Process element) 59 4.6.1 PE硬體使用率考量 59 4.6.2 不同kernel size下的PE運算方式 60 4.6.3 PE設計與Critical Path 62 4.6.4 不同卷積運算下PE運算的資料流 63 4.7 硬體Padding設計 64 4.7.1 Padding分析 64 4.7.2 Padding硬體設計 65 4.7.3 補Padding方式 70 4.8 Batch Normalization & Activation 72 4.9 System Controller 73 第 5 章實驗數據與論文比較 74 5.1 實驗數據分析 74 5.2 論文比較 78 5.3 論文比較結論 80 第 6 章結論與未來展望 81 6.1 結論 81 6.2 未來展望 82 參考文獻 83

參考文獻 References
參考文獻 [1] A. G. Howard, et al, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017. [2] X. Zhang, et al., “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices,” CVPR, 2017. [3] Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W., & Keutzer, K. (2017). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. ArXiv, abs/1602.07360. [4] Chollet, F.. “Xception: Deep Learning with Depthwise Separable Convolutions.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1800-1807. [5] Krizhevsky, I. Sutskever, and G. E. Hinton. “Imagenet classification with deepconvolutional neural networks”. in Advances in Neural Information Processing Systems 25, p.p. 1106–1114, 2012. [6] Russakovsky, Olga & Deng, J. & Su, Hao & Krause, J. & Satheesh, Sanjeev & Ma, S. & Huang, Z. & Karpathy, A. & Khosla, A. & Bernstein, M, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115. pp. 211-252, 2015. [7] Srivastava, Nitish & Hinton, Geoffrey & Krizhevsky, Alex & Sutskever, Ilya & Salakhutdinov, Ruslan. (2014). “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” in Journal of Machine Learning Research. 15. pp. 1929-1958. [8] Simonyan, K. and Zisserman, A. “Very deep convolutional networks for large-scale image recognition”. arXiv preprint arXiv:1409.1556, 2014. [9] Redmon, Joseph & Divvala, Santosh & Girshick, Ross & Farhadi, Ali. (2016). “You Only Look Once: Unified, Real-Time Object Detection”. 779-788. 10.1109/ CVPR.2016.91. [10] Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR, abs/1511.06434. [11] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning, "International conference on Architectural support for programming languages and operating systems ,2014. [12] N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, 2017, pp. 1-12. [13] Y. Chen, T. Krishna, J. S. Emer and V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Jan. 2017. [14] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu and S. Wei, "Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 8, pp. 2220-2233, Aug. 2017. [15] W. Lu, G. Yan, J. Li, S. Gong, Y. Han and X. Li, "FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks," 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, TX, 2017, pp. 553-564. [16] P. Judd, J. Albericio and A. Moshovos, "Stripes: Bit-Serial Deep Neural Network Computing," in IEEE Computer Architecture Letters, vol. 16, no. 1, pp. 80-83, 1 Jan.-June 2017. [17] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. CoRR, abs/1312.6229. [18] C. Szegedy et al., "Going deeper with convolutions," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 1-9. [19] K. He, et al., “Deep Residual Learning for Image Recognition,” CVPR, 2016. [20] L. Cavigelli and L. Benini, "Origami: A 803-GOp/s/W Convolutional Network Accelerator," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 11, pp. 2461-2475, Nov. 2017. [21] K. Guo et al., "Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35-47, Jan. 2018. [22] J. Yan, S. Yin, F. Tu, L. Liu and S. Wei, "GNA: Reconfigurable and Efficient Architecture for Generative Network Acceleration," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2519-2529, Nov. 2018. [23] Hanif, M., Putra, R.V., Tanvir, M., Hafiz, R., Rehman, S., & Shafique, M. (2018). MPNA: A Massively-Parallel Neural Array Accelerator with Dataflow Optimization for Convolutional Neural Networks. ArXiv, abs/1810.12910. [24] V. Sze, Y. Chen, T. Yang and J. S. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey," in Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, Dec. 2017. [25] Y. Chen, T. Yang, J. Emer and V. Sze, "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, June 2019. [26] S. Han et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 243-254. [27] Y. Jiang, J. Ren, X. Xie and C. Zhang, "Hardware Implementation of Depthwise Separable Convolution Neural Network," 2020 IEEE 15th International Conference on Solid-State & Integrated Circuit Technology (ICSICT), Kunming, 2020, pp. 1-3.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：開放下載的時間 available 2026-02-01 校外 Off-campus：開放下載的時間 available 2026-02-01 您的 IP(校外) 位址是 216.73.216.54 現在時間是 2025-06-14 論文校外開放下載的時間是 2026-02-01 Your IP address is 216.73.216.54 The current date is 2025-06-14 This thesis will be available to you on 2026-02-01.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 2026-02-01

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS