國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,深度類神經網路硬體加速器之架構設計與實作,Architecture Design and Implementation of Deep Neural Network Hardware Accelerators

論文名稱 Title	深度類神經網路硬體加速器之架構設計與實作 Architecture Design and Implementation of Deep Neural Network Hardware Accelerators
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	107 學年度第 1 學期 The fall semester of Academic Year 107	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	89
研究生 Author	吳佩軒 Pei-Hsuan Wu
指導教授 Advisor	蕭勝夫 Shen-Fu Hsiao
召集委員 Convenor	陳中和 Chung-Ho Chen
口試委員 Advisory Committee	張雲南, 陳嘉平, 陳銘志 Yun-Nan Chang; Chia-Ping Chen; Ming-Chih Chen
口試日期 Date of Exam	2018-08-20	繳交日期 Date of Submission	2018-09-03
關鍵字 Keywords	硬體加速器、深度類神經網路、卷積式類神經網路、機器學習 CNN hardware accelerator, deep neural network (DNN), convolutional neural network (CNN), machine learning
統計 Statistics	本論文已被瀏覽 5839 次，被下載 4 次 The thesis/dissertation has been browsed 5839 times, has been downloaded 4 times.

中文摘要
深度類神經網路(Deep Neural Networks, DNN)已經廣泛應用於各種電腦視覺處理，並在圖像分類、物件偵測等應用中有很優秀的表現。然而DNN在運算過程中需要面對大量資料搬移及計算複雜度的問題，對功率消耗和性能造成了巨大的挑戰，為了能在嵌入式系統(例如：智慧型手機和智慧車載系統等)上達到即時處理的效果，並解決功率消耗過大的問題，DNN主要朝著FPGA、ASIC硬體加速的部分前進。本論文提出了Memory Access分析方法及DNN硬體加速器設計，根據不同的SRAM size及Data flow方式，在硬體設計前事先分析哪一種Data reuse方式可以運用較少的memory access，並降低整體能量。利用Mixed Input/Output Reuse 設計出可以平行運算32張Output map的DNN硬體加速器，在辨識精確度不受影響的情況下，對指標型DNN架構VGG16中Conv層的運算進行加速，在TSMC40nm製程下，電路工作頻率為515MHz，內部記憶體size為280 KB，加速VGG16運算可達到Peak performance為139(Gop/s)，與Eyeriss [21]比較，雖然需要較大的記憶體儲存空間，但在資料搬移及計算所花費的總體能量及運算時間都有比較好的表現。
Abstract
Deep Neural Networks (DNN) widely used in computer vision applications have superior performance in image classification and object detection. However, the huge amount of data movement and computation complexity are two challenges if DNN is used in embedded systems where real-time processing and power consumption are two major design considerations. Hardware DNN accelerators are usually designed using FPGA or ASIC. In this proposal, we develop a memory access method and design a DNN hardware accelerator with fewer memory access and lower power consumption. Using mixed input/output/reuse method, we design a DNN hardware accelerator with 32 processing elements (PEs) that accelerates the computation of VGG16 convolutional layers. The accelerator can achieve a maximum frequency of 515MHz with internal SRAM size of 280 KB using TSMC 40nm process technology. The peak performance of the accelerator is 139 GOP/s, which has better computation speed and power compared to Eyeriss [21].

目次 Table of Contents
審定書 i 摘要 ii Abstract iii 目錄 (Table of Contents) iv 圖目錄 (List of Figures) vi 表目錄 (List of Tables) ix 第 1 章概論 1 1.1 研究動機 1 1.2 本文大綱 3 第 2 章研究背景與相關研究 4 2.1 指標型DNN架構模型 4 2.2 指標性DNN硬體加速器 8 第 3 章 DNN運算分類 11 3.1 CNN演算法 11 3.2 資料切割 15 3.3 DNN架構分類 20 3.4 處理單元 25 3.5 稀疏性考量 26 第 4 章 DNN分析 27 4.1 流程介紹 27 4.2 精確度分析 28 4.3 資料切割分析 34 4.4 Memory Access Simulator 40 第 5 章 DNN硬體加速器設計 45 5.1 DNN硬體加速器 45 5.1.1 Input SRAM 46 5.1.2 Input buffer 47 5.1.3 Processing element 51 5.1.4 Weight SRAM 53 5.1.5 Output SRAM 56 5.1.6 Activation 57 5.1.7 Pooling 57 5.1.8 System-Controller (Finite-State Machine)狀態介紹 59 5.2 雙重精確度DNN硬體加速器 64 第 6 章數據分析 67 6.1 邏輯數據和分析 67 6.2 論文比較 72 第 7 章結論與未來展望 75 7.1 結論 75 7.2 未來展望 75 參考文獻(References) 76

參考文獻 References
[1] J. Lemley, S. Bazrafkan, and P. Corcoran, “Deep Learning for Consumer Devices and Services,” IEEE Consumer Electronics Mag., pp. 48-56, Apr. 2017. [2] I. Goodfellow, Y. Bengio, and A. Courville, “Deep Learning,” MIT Press, Cambridge, MA, USA, 2016. [3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521, pp. 436-444, May 2015. [4] (LeNet) Y. LeCun, et al., “Gradient-Based Learning Applied to Document Recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2012, pp. 1097–1105. [6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent.(ICLR), May 2015, pp. 1–14. [7] (AlexNet) A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Proc. Adv. Neural Inf. Process. System (NIPS), 2012. [8] (VGG) K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” Proc. Intl. Conf. Learning Representation (ICLR), Sept. 2015. [9] (GoogLeNet) C. Szegedy, et al., “Going Deeper with Convolutions,” Proc. IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), 2015. [10] (ILSVRC) O. Russakovsky, et al., “ImageNet Large Scale Visual Recognition Challenge,” Intl. Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211-252, Dec. 2015. [11] (ImageNet) J. Deng, et al., “ImageNet: A Large Scale Hierarchical Image Database,” Proc. Intl. Conf. Computer Vision and Pattern Recognition, 2009. [12] (ResNet) K. He, et al., “Deep Residual Learning for Image Recognition,” CVPR, 2016. [13] (ResNeXt) S. Xie, et al., “Aggregated Residual Transformations for Deep Neural Networks,” CVPR, 2017. [14] (R-CNN) R. Girshick, et al, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” CVPR, 2014. [15] (Fast R-CNN) R. Girshick, “Fast R-CNN”, arXiv:1504.08083,, Proc. Intl. Conf. Computer Vision (ICCV), 2015. [16] (SSD) W. Liu, et al., “SSD: Single Shot Multibox Detector,” Proc. European Conf. On Computer Vision (ECCV), 2016. [17] (DaDianNao) Y. Chen, et al., “DaDianNao: A Machine Learning Supercomputer,” Proc. IEEE/ACM Intl. Symp. Microarchitecture (MICRO), pp. 609-622, 2014. [18] (ShiDianNao) Z. Du, et al., “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” Proc. Intl. Symp. Computer Architecture (ISCA), pp. 92-104, 2015. [19] (Cambricon-X) S. Zhang, et al., “Cambricon-X: An Accelerator for Sparse Neural Networks,” MICRO, 2016. [20] (TPU) N. P. Jouppi, et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” Proc. Intl. Symp. Computer Architecture (ISCA), 2017. [21] (Eyeriss) Y.-H. Chen, et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid-State Circuits (JSSC), vol. 52, no. 1, pp. 127-138, Jan. 2017. [22] (PS-ConvNet) B. Moons and M. Werhelst, “An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm CMOS,” IEEE Journal of Solid-State Circuits (JSSC), vol. 52, no. 4, pp. 903-914, Apr. 2017. [23] (Origami) L. Cavigelli and L. Benini, “Origami: A 803 GOp/s/W Convolutional Network Accelerator,” IEEE Trans. Circuits and Systems for Video Technology (TCSVT), DOI 10.1109/TCSVT, 2017. [24] (Angel-Eye) Guo, et al., “Angel-Eye: A Complete Design Flow for Mapping CNN onto Embedded FPGA,” IEEE Trans. Computer Aided Design of Integrated Circuits and Systems (TCAD), DOI 10.1109/TCAD, 2017. [25] Qiu, et al., “Going Deeper with Embedded FPGA Platform for Convolutional Neural Network,” Proc. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pp. 26-35, Feb. 2016. [26] (EIE) Han, et al., “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” Proc. Intl. Symp. Computer Architecture (ISCA), pp. 243-254, 2016. [27] (DNA) F. Tu, et al., “Deep Convolutional Neural Network Architecture with Reconfigurable Computation Patterns,” IEEE Trans. VLSI Systems (TVLSI), vol. 25, no. 8, pp. 2220-2233, Aug. 2017. [28] (ZeNA) D. Kim, J. Ahn, and S.]Yoo, “ZeNA: Zero-Aware Neural Network Accelerator,” IEEE Designs and Tests, DOI 10.1109/MDAT, 2017. [29] M. Horowitz. Energy table for 45nm process, Stanford VLSI wiki.[Online]. Available: https://sites.google.com/site/seecproject [30] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer. Efficient processing of deep neural networks: A tutorial and survey. arXiv preprint arXiv:1703.09039, 2017. [31] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” in MICRO, 2016. [32] (Cnvlutin)J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: ineffectual-neuron-free deep neural network computing,” in ISCA, 2016. [33] (Caffe) Y. Jia, “Caffe: An Open Source Convolutional Architecture for Fast Feature Embedding”, http://caffe.berkeleyvision.org/, 2013. [34] (Caffe) Y. Jia, et al., “Caffe: Convolutional Architecture for Fast Feature Embedding,” Proc. ACM Intl. Conf. Multimedia, pp. 675-678, 2014. [35] (TensorFlow) M. Abadi, et al, “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,” arXiv:1603.04467v2, 2016. [36] Y. Jia, “Caffe model zoo,” https://github.com/BVLC/caffe/wiki/Model-Zoo, 2015.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0729118-154714.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS