論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available
論文名稱 Title |
具高重組性之多位元精確度卷積神經網路硬體設計與實作 Design and Implementation of a Highly Reconfigurable Multi-Precision Neural Network Hardware Accelerator |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
78 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2019-09-04 |
繳交日期 Date of Submission |
2019-08-29 |
關鍵字 Keywords |
卷積運算、動態調整神經網路架構、多重精確度神經網路加速器、深度神經網路、神經網路硬體加速器 deep neural network (DNN), neural network hardware accelerator, dynamic reconfigurable neural network architecture, multi-precision DNN accelerator, convolution |
||
統計 Statistics |
本論文已被瀏覽 5676 次,被下載 1 次 The thesis/dissertation has been browsed 5676 times, has been downloaded 1 times. |
中文摘要 |
深度類神經網路(Deep Neural Networks, DNN)硬體加速器旨在解決DNN大量計算並應用於嵌入式設備達到低功耗、即時處理的效果。然而嵌入式邊緣裝置記憶體容量有限,導致DNN運算過程需要資料大量搬移,對於低功耗和及時處理設計造成巨大的挑戰。而DNN的演算法可減少運算單元的位元寬度,同時不影響圖像分類辨識的準確性,但為了防止分類辨識準確性減少,DNN每層的位元寬度可能不盡相同,因此,固定位元寬度的硬體加速器設計需以最差狀況的位元寬度做設計,否則會導致運算最後結果的圖像辨識準確度下降。為了提升低位元精度運算時的效率,以及於嵌入式設備需求之功耗、面積問題,本論文提出三個主要設計考量和方法︰(1)圖像切割(tiling)、data reuse、運算平行方法與SRAM size綜合考量的分析法,決定圖像的前處理方式與卷積(Convolution)的運算順序,最終採以output reuse、以及32x32大小的tile size作為設計,最大幅降低內外部記憶體搬次數,減少功耗問題;(2)可動態位元精度調整的運算單元設計,以資料(input data)8 bit與權重(kernel)2 bit的運算單元位元精度設計,實現各層不同位元精度的運算,以不損失辨識精度下,提高運算效能;(3)考量內外部記憶體傳輸速度,分析出哪些kernel size/stride對應的Conv層需要多少的位元精度,硬體加速器的運算效率最好。綜合以上考量之硬體加速器,本論文設計可支援目前大部分的卷積運算,並以不考慮傳輸時間狀況下,不同kernel size/stride的Conv層運算效率皆接近100%。硬體加速器以Verilog實作,於TSMC40nm製程合成,電路工作頻率為200MHz,內部記憶體大小為150 KB,資料為8位元精確度和權重2位元精確度,於VGG16運算之performance為1073(Gop/s),於Alexnet運算之performance為608(Gop/s)。 |
Abstract |
Hardware acceleration of Deep Neural Networks (DNN) aims to solve enormous computation complexity to achieve low power and real-time processing on embedded systems. However, the memory capacity of the embedded device is limited, which causes a large amount of data transfer during DNN computation and is a big challenge for speeding up DNN computation in edge devices. Bit-width of DNN operations can be reduced without affecting the classification accuracy. However, to prevent loss of accuracy, the bit-width varies significantly in different DNN layers. Thus, the fixed bit-width hardware acceleration across all DNN layers would either offer limited benefit with the worst-case bit-width, or lead to a degradation in accuracy. In order to improve the operation efficiency of different bit-width in low power and limited-area embedded systems, this thesis proposes three major design methods: (a) we present an analysis method that comprehensively considers the balance between tiling of feature maps, data reuse, parallelism, and SRAM size; (b) Based on the processing component with 8-bit input data and 2-bit filter kernel weights, we design a DNN accelerator that can efficiently support different precision modes in different layers; (c) In consideration of the band-width between external and internal memory, we analyze the best bit-width of different kernel sizes /strides in DNN layers to efficiently accelerate the DNN operations. Combining the above considerations, we design a DNN hardware accelerator that supports almost all the current convolution operations. The usage of processing elements is closed to 100% in different kernel size/stride of convolution layer if data transfer time is neglected. The accelerator is implemented in Verilog and synthesized with TSMC 40nm process technology. Our design can achieve a maximum frequency of 200HMz with internal SRAM size of only 150 KB. The performance is 1,173 Gop/s on VGG-16 and is 608 Gop/s on Alexnet using 8-bit data and 2-bit kernel weights. |
目次 Table of Contents |
論文審定書 i 摘要 ii Abstract iii 目錄 (Table of Contents) v 圖目錄 (List of Figures) vii 表目錄 x 第1章 概論 1 1.1 研究動機 1 1.2 本文大綱 3 第2章 研究背景與相關研究 4 2.1 指標型DNN模型 4 2.1.1 LeNet 4 2.1.2 AlexNet 5 2.1.3 VGG 6 2.1.4 inception v1~v3 7 2.1.5 ResNet, ResNeXt 8 2.1.6 MobileNet 9 2.1.7 U-net 10 2.1.8 YOLO系列 11 2.2 指標性DNN硬體加速器 12 2.2.1 SCSD(Single Cycle Single Data) 13 2.2.2 MCSD(Multiple Cycle Single Data) 14 2.2.3 SCMD(Single Cycle Multiple Data) 15 第3章 DNN分析 16 3.1 資料切割分析 16 3.1.1 不同形狀、大小的Tile分割 18 3.1.2 平行運算 20 3.1.3 data reuse 22 3.1.4 data reuse 與 運算平行之間的關係 24 3.2 運算單元乘法器對應不同kernel size 分析 26 3.3 bit-level(SCMD)多精度運算單元設計分析 27 3.3.1 精度選擇考量 27 3.3.2 平行運算方式考量 28 3.3.3 考量平行運算與多精度設計之SRAM size 分析 29 3.4 Line buffer vs SRAM access 分析 32 3.5 資料流的throughput分析 34 第4章 DNN硬體加速器設計 38 4.1 DNN硬體設計 38 4.2 Buffer設計 39 4.2.1 Input Buffer 39 4.2.2 Kernel Buffer 41 4.2.3 Output Buffer 43 4.3 Line Buffer 設計 45 4.4 processing element 48 4.5 資料流設計 52 第5章 數據分析 55 5.1 邏輯數據與分析 55 5.2 論文比較 60 第6章 結論與未來展望 64 6.1 結論 64 6.2 未來展望 64 參考文獻(References) 65 |
參考文獻 References |
[1] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, 2015. [2] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE , 1998. [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, Lake Tahoe, Nevada, 2012, pp. 1097-1105. [4] K. Fukushima, and S. Miyake, "Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition," IEEE Transactions on Systems, Man, and Cybernetics,1983,pp. 267-285. [5] K. Simonyan, and A. Zisserman,"Very Deep Convolutional Networks for Large-Scale Image Recognition," ICLR , 2014. [6] C. Szegedy, L. Wei, J. Yangqing, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,2015, pp. 1-9. [7] J. Deng, W. Dong, R. Socher, L. Li, L. Kai, and F.-F. Li, "ImageNet: A large-scale hierarchical image database," IEEE Conference on Computer Vision and Pattern Recognition , 2009, pp. 248-255. [8] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, 2015, pp. 211-252. [9] Y. Lecun, et al, " Gradient-based learning applied to document recognition.," Proceedings of the IEEE, 1998. [10] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition,"IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016, pp. 770-778. [11] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated Residual Transformations for Deep Neural Networks.," IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,2017, pp. 5987-5995. [12] S. Ioffe, and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning, 2015, pp. 448-456. [13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the Inception Architecture for Computer Vision.," IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,2016,pp. 2818-2826. [14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” eprint arXiv:1704.04861,2017. [15] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” International Conference on Medical Image Computing and Computer-Assisted Intervention,2015. [16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [17] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning,"International conference on Architectural support for programming languages and operating systems ,2014. [18] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "DaDianNao: A Machine-Learning Supercomputer.," 47th Annual IEEE/ACM International Symposium on Microarchitecture,2014. [19] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "ShiDianNao: Shifting vision processing closer to the sensor.," Annual International Symposium on Computer Architecture (ISCA),2015. [20] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid-State Circuits, 2017. [21] L. Cavigelli, and L. Benini, “Origami: A 803-GOp/s/W Convolutional Network Accelerator,” IEEE Transactions on Circuits and Systems for Video Technology, 2017. [22] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, “Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017. [23] W. Lu, G. Yan, J. Li, G. Shijun, Y. Han, and X.-W. Li, "FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks," IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017. [24] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, "Stripes: Bit-serial deep neural network computing.,"International Symposium on Microarchitecture (MICRO),2016. [25] J. Albericio, P. Judd, A. Delmás, S. Sharify, and A. Moshovos, “Bit-pragmatic Deep Neural Network Computing,”International Symposium on Microarchitecture (MICRO),2017 [26] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H. Yoo, “UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision,” IEEE Journal of Solid-State Circuits, 2019. [27] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, J. K. Kim, V. Chandra, and H. Esmaeilzadeh, “Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks,”International Symposium on Computer Architecture (ISCA),2018. [28] S. Yin, P. Ouyang, S. Tang, F. Tu, X. Li, S. Zheng, T. Lu, J. Gu, L. Liu, and S. Wei, “A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications,” IEEE Journal of Solid-State Circuits, 2018. [29] S. Yin, P. Ouyang, J. Yang, T. Lu, X. Li, L. Liu, and S. Wei, “An Energy-Efficient Reconfigurable Processor for Binary-and Ternary-Weight Neural Networks With Flexible Data Bit Width,” IEEE Journal of Solid-State Circuits, 2019. [30] M. Horowitz., “ Energy table for 45nm process. ,” Stanford VLSI wiki. [31] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang, “Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018. [32] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations,” eprint arXiv:1609.07061, 2016. [33] F. Li, B. Zhang, and B. Liu, “Ternary Weight Networks,” eprint arXiv:1605.04711, 2016. [34] S. Hsiao, P. Wu, J. Chen, and K. Chen, "Dual-Precision Acceleration of Convolutional Neural Network Computation with Mixed Input and Output Data Reuse," IEEE International Symposium on Circuits and Systems (ISCAS) ,2019. |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:自定論文開放時間 user define 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus: 已公開 available |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |