論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available
論文名稱 Title |
深度類神經網路硬體加速器之架構設計與實作 Architecture Design and Implementation of Deep Neural Network Hardware Accelerators |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
89 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2018-08-20 |
繳交日期 Date of Submission |
2018-09-03 |
關鍵字 Keywords |
硬體加速器、深度類神經網路、卷積式類神經網路、機器學習 CNN hardware accelerator, deep neural network (DNN), convolutional neural network (CNN), machine learning |
||
統計 Statistics |
本論文已被瀏覽 5839 次,被下載 4 次 The thesis/dissertation has been browsed 5839 times, has been downloaded 4 times. |
中文摘要 |
深度類神經網路(Deep Neural Networks, DNN)已經廣泛應用於各種電腦視覺處理,並在圖像分類、物件偵測等應用中有很優秀的表現。然而DNN在運算過程中需要面對大量資料搬移及計算複雜度的問題,對功率消耗和性能造成了巨大的挑戰,為了能在嵌入式系統(例如:智慧型手機和智慧車載系統等)上達到即時處理的效果,並解決功率消耗過大的問題,DNN主要朝著FPGA、ASIC硬體加速的部分前進。本論文提出了Memory Access分析方法及DNN硬體加速器設計,根據不同的SRAM size及Data flow方式,在硬體設計前事先分析哪一種Data reuse方式可以運用較少的memory access,並降低整體能量。利用Mixed Input/Output Reuse 設計出可以平行運算32張Output map的DNN硬體加速器,在辨識精確度不受影響的情況下,對指標型DNN架構VGG16中Conv層的運算進行加速,在TSMC40nm製程下,電路工作頻率為515MHz,內部記憶體size為280 KB,加速VGG16運算可達到Peak performance為139(Gop/s),與Eyeriss [21]比較,雖然需要較大的記憶體儲存空間,但在資料搬移及計算所花費的總體能量及運算時間都有比較好的表現。 |
Abstract |
Deep Neural Networks (DNN) widely used in computer vision applications have superior performance in image classification and object detection. However, the huge amount of data movement and computation complexity are two challenges if DNN is used in embedded systems where real-time processing and power consumption are two major design considerations. Hardware DNN accelerators are usually designed using FPGA or ASIC. In this proposal, we develop a memory access method and design a DNN hardware accelerator with fewer memory access and lower power consumption. Using mixed input/output/reuse method, we design a DNN hardware accelerator with 32 processing elements (PEs) that accelerates the computation of VGG16 convolutional layers. The accelerator can achieve a maximum frequency of 515MHz with internal SRAM size of 280 KB using TSMC 40nm process technology. The peak performance of the accelerator is 139 GOP/s, which has better computation speed and power compared to Eyeriss [21]. |
目次 Table of Contents |
審定書 i 摘要 ii Abstract iii 目錄 (Table of Contents) iv 圖目錄 (List of Figures) vi 表目錄 (List of Tables) ix 第 1 章 概論 1 1.1 研究動機 1 1.2 本文大綱 3 第 2 章 研究背景與相關研究 4 2.1 指標型DNN架構模型 4 2.2 指標性DNN硬體加速器 8 第 3 章 DNN運算分類 11 3.1 CNN演算法 11 3.2 資料切割 15 3.3 DNN架構分類 20 3.4 處理單元 25 3.5 稀疏性考量 26 第 4 章 DNN分析 27 4.1 流程介紹 27 4.2 精確度分析 28 4.3 資料切割分析 34 4.4 Memory Access Simulator 40 第 5 章 DNN硬體加速器設計 45 5.1 DNN硬體加速器 45 5.1.1 Input SRAM 46 5.1.2 Input buffer 47 5.1.3 Processing element 51 5.1.4 Weight SRAM 53 5.1.5 Output SRAM 56 5.1.6 Activation 57 5.1.7 Pooling 57 5.1.8 System-Controller (Finite-State Machine)狀態介紹 59 5.2 雙重精確度DNN硬體加速器 64 第 6 章 數據分析 67 6.1 邏輯數據和分析 67 6.2 論文比較 72 第 7 章 結論與未來展望 75 7.1 結論 75 7.2 未來展望 75 參考文獻(References) 76 |
參考文獻 References |
[1] J. Lemley, S. Bazrafkan, and P. Corcoran, “Deep Learning for Consumer Devices and Services,” IEEE Consumer Electronics Mag., pp. 48-56, Apr. 2017. [2] I. Goodfellow, Y. Bengio, and A. Courville, “Deep Learning,” MIT Press, Cambridge, MA, USA, 2016. [3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521, pp. 436-444, May 2015. [4] (LeNet) Y. LeCun, et al., “Gradient-Based Learning Applied to Document Recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2012, pp. 1097–1105. [6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent.(ICLR), May 2015, pp. 1–14. [7] (AlexNet) A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Proc. Adv. Neural Inf. Process. System (NIPS), 2012. [8] (VGG) K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” Proc. Intl. Conf. Learning Representation (ICLR), Sept. 2015. [9] (GoogLeNet) C. Szegedy, et al., “Going Deeper with Convolutions,” Proc. IEEE Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), 2015. [10] (ILSVRC) O. Russakovsky, et al., “ImageNet Large Scale Visual Recognition Challenge,” Intl. Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211-252, Dec. 2015. [11] (ImageNet) J. Deng, et al., “ImageNet: A Large Scale Hierarchical Image Database,” Proc. Intl. Conf. Computer Vision and Pattern Recognition, 2009. [12] (ResNet) K. He, et al., “Deep Residual Learning for Image Recognition,” CVPR, 2016. [13] (ResNeXt) S. Xie, et al., “Aggregated Residual Transformations for Deep Neural Networks,” CVPR, 2017. [14] (R-CNN) R. Girshick, et al, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” CVPR, 2014. [15] (Fast R-CNN) R. Girshick, “Fast R-CNN”, arXiv:1504.08083,, Proc. Intl. Conf. Computer Vision (ICCV), 2015. [16] (SSD) W. Liu, et al., “SSD: Single Shot Multibox Detector,” Proc. European Conf. On Computer Vision (ECCV), 2016. [17] (DaDianNao) Y. Chen, et al., “DaDianNao: A Machine Learning Supercomputer,” Proc. IEEE/ACM Intl. Symp. Microarchitecture (MICRO), pp. 609-622, 2014. [18] (ShiDianNao) Z. Du, et al., “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” Proc. Intl. Symp. Computer Architecture (ISCA), pp. 92-104, 2015. [19] (Cambricon-X) S. Zhang, et al., “Cambricon-X: An Accelerator for Sparse Neural Networks,” MICRO, 2016. [20] (TPU) N. P. Jouppi, et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” Proc. Intl. Symp. Computer Architecture (ISCA), 2017. [21] (Eyeriss) Y.-H. Chen, et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid-State Circuits (JSSC), vol. 52, no. 1, pp. 127-138, Jan. 2017. [22] (PS-ConvNet) B. Moons and M. Werhelst, “An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm CMOS,” IEEE Journal of Solid-State Circuits (JSSC), vol. 52, no. 4, pp. 903-914, Apr. 2017. [23] (Origami) L. Cavigelli and L. Benini, “Origami: A 803 GOp/s/W Convolutional Network Accelerator,” IEEE Trans. Circuits and Systems for Video Technology (TCSVT), DOI 10.1109/TCSVT, 2017. [24] (Angel-Eye) Guo, et al., “Angel-Eye: A Complete Design Flow for Mapping CNN onto Embedded FPGA,” IEEE Trans. Computer Aided Design of Integrated Circuits and Systems (TCAD), DOI 10.1109/TCAD, 2017. [25] Qiu, et al., “Going Deeper with Embedded FPGA Platform for Convolutional Neural Network,” Proc. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pp. 26-35, Feb. 2016. [26] (EIE) Han, et al., “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” Proc. Intl. Symp. Computer Architecture (ISCA), pp. 243-254, 2016. [27] (DNA) F. Tu, et al., “Deep Convolutional Neural Network Architecture with Reconfigurable Computation Patterns,” IEEE Trans. VLSI Systems (TVLSI), vol. 25, no. 8, pp. 2220-2233, Aug. 2017. [28] (ZeNA) D. Kim, J. Ahn, and S.]Yoo, “ZeNA: Zero-Aware Neural Network Accelerator,” IEEE Designs and Tests, DOI 10.1109/MDAT, 2017. [29] M. Horowitz. Energy table for 45nm process, Stanford VLSI wiki.[Online]. Available: https://sites.google.com/site/seecproject [30] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer. Efficient processing of deep neural networks: A tutorial and survey. arXiv preprint arXiv:1703.09039, 2017. [31] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” in MICRO, 2016. [32] (Cnvlutin)J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: ineffectual-neuron-free deep neural network computing,” in ISCA, 2016. [33] (Caffe) Y. Jia, “Caffe: An Open Source Convolutional Architecture for Fast Feature Embedding”, http://caffe.berkeleyvision.org/, 2013. [34] (Caffe) Y. Jia, et al., “Caffe: Convolutional Architecture for Fast Feature Embedding,” Proc. ACM Intl. Conf. Multimedia, pp. 675-678, 2014. [35] (TensorFlow) M. Abadi, et al, “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,” arXiv:1603.04467v2, 2016. [36] Y. Jia, “Caffe model zoo,” https://github.com/BVLC/caffe/wiki/Model-Zoo, 2015. |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:自定論文開放時間 user define 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus: 已公開 available |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |