國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,使用對數資料表示法之深度神經網路硬體加速器,Deep Neural Network Hardware Accelerators using Logarithmic Data Representation

論文名稱 Title	使用對數資料表示法之深度神經網路硬體加速器 Deep Neural Network Hardware Accelerators using Logarithmic Data Representation
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	110 學年度第 2 學期 The spring semester of Academic Year 110	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	73
研究生 Author	曾歆庭 Hsin-Ting Tseng
指導教授 Advisor	蕭勝夫 Hsiao,Shen-Fu
召集委員 Convenor	莊作彬 Juang,Tso-Bing
口試委員 Advisory Committee	陳坤志, 張雲南 CHEN, KUN-CHIH; Chang,Yun-Nan
口試日期 Date of Exam	2022-03-28	繳交日期 Date of Submission	2022-04-12
關鍵字 Keywords	機器學習、深度神經網路、神經網路硬體加速器、對數量化、卷積運算 machine learning, deep neural network, neural network hardware accelerator, logarithmic quantization, convolution
統計 Statistics	本論文已被瀏覽 257 次，被下載 0 次 The thesis/dissertation has been browsed 257 times, has been downloaded 0 times.

中文摘要
深度神經網路(Deep Neural Network, DNN)近年來已經廣泛運用在影像識別處理領域，特別是在圖像分類與物件偵測中擁有優秀的表現。為了能夠應用在更多場合中，使用嵌入式邊緣裝置來達到即時處理已逐漸成為趨勢，然而在進行 DNN 運算時面臨兩大問題，第一，嵌入式邊緣裝置記憶體容量有限，在運算過程中會造成大量的資料搬移，導致功率消耗過大的問題；第二，為了提升精確度，不斷加深神經網路層數，卻造成計算複雜度的問題。因此，本論文使用對數量化(Logarithmic Quantization)，在不重新訓練(retraining)的前提下，降低輸入資料(inputs)與權重(weights)所需的位元寬度，並且使用三種對數基數對每一層個別進行量化，使精確度損失平均維持在1.5% 以內，較小的位元寬度降低了大量資料搬移次數，同時也減少了內部記憶體的儲存空間，使面積大幅下降，另外，因為使用對數量化的資料，所以對運算單元(Processing Element, PE)重新設計，使用加法器與位移器取代乘法器，解決運算複雜度的問題，相比使用 8 bits定點的乘法器功耗可下降 17%。在考量圖像切割(tiling)、data reuse 與運算平行方法後，實現可分層使用不同對數基數運算的 DNN 硬體加速器，在TSMC 40nm 製程技術合成，工作頻率為200MHz，內部記憶體大小為 26.8 KB，輸入資料與權重位寬為 5 位元，執行VGG16 神經網路模型運算時，Peak Performance 達到 51.2 GOPS，Area Efficiency可達到 62.9 GOPS/MGE，Power Efficiency可達到1190.7 GOPS/W。
Abstract
Deep Neural Network (DNN) have been widely applied to image classification and object detection. Many DNN accelerators are implemented on resource-limited embedded systems aiming to achieve real-time processing in edge devices. But there are two main problems during DNN computation. Firstly, the memory capacity of embedded devices is limited, which causes a large amount of DRAM accesses and power consumption. Secondly, in order to increase the accuracy, the layers of DNN become deeper, which increases computation complexity. So, we apply logarithmic quantization to encode inputs and weights to smaller-bit-width data representation without retraining, and each layer can choose one of three logarithm bases, which reduces lots of DRAM access and keeps the average classification accuracy loss below 1.5%. In the design of Processing Elements (PE), we replace multipliers with adders and shifters to reduce computation complexity with 17% power computation reduction compared to 8-bit-fixed-point multiplier-based design. In this proposal, we design a log-based DNN hardware accelerator after analyzing the impact of tiling, data reuse and parallelism. The DNN accelerator can achieve peak performance of 51.2 GOPS, area efficiency of 62.9 GOPS/MGE and power efficiency of 1190.7 GOPS/W at 200MHz frequency.

目次 Table of Contents
論文審定書 i 摘要 ii Abstract iii 目錄（Table of Contents） iv 圖表目錄 (List of Figures) vii 表目錄 (List of Tables) ix 第1章概論 1 1.1 研究動機 1 1.2 本文大綱 3 第2章研究背景與相關研究 4 2.1 指標型 DNN 模型 4 2.1.1 AlexNet 4 2.1.2 VGG 5 2.1.3 Inception v1-v4 5 2.1.4 ResNet 6 2.2 指標性 DNN 硬體加速器 7 2.2.1 Eyeriss 7 2.2.2 Flexflow 7 2.2.3 TPU 7 2.2.4 DNA、GNA 8 2.2.5 DNPU、UNPU 8 2.2.6 Thinker 9 2.3 使用對數資料表示的DNN硬體加速器 11 2.3.1 NeuroMAX 11 2.3.2 QUEST 12 2.3.3 Area and Energy Optimization for Bit-Serial Log-Quantized DNN Accelerator with Shared Accumulators [12] 13 2.3.4 Efficient Hardware Acceleration of CNNs using Logarithmic Data Representation with Arbitrary log-base [26] 14 第3章 DNN 分析 16 3.1 位元寬度分析與精確度考量 17 3.1.1 線性量化 (Linear Quantization) 17 3.1.2 對數量化 (Logarithmic Quantization) 18 3.1.3 兩種量化方式比較 20 3.1.4 對數量化的精度比較 22 3.2 資料切割分析 25 3.3 Data Reuse 分析 27 3.3.1 Input Reuse 27 3.3.2 Output Reuse 28 3.3.3 Weight Reuse 28 3.3.4 三種 Data Reuse 比較 29 3.4 平行度分析 32 3.4.1 Input Channel Parallel (ICP) 32 3.4.2 Output Channel Parallel (OCP) 32 3.4.3 平行度選擇 33 第4章硬體加速器設計 40 4.1 硬體加速器架構 40 4.2 內部記憶體 41 4.2.1 Input Buffer 41 4.2.2 Weight Buffer 41 4.2.3 Output Buffer 42 4.3 Processing Element 43 4.4 Quantizer Unit 46 4.5 Pooling 46 4.6 Activation 47 4.7 System Controller 流程介紹 48 第5章數據分析 51 5.1 邏輯合成數據分析 51 5.1.1 面積數據分析 52 5.1.2 外部記憶體傳輸次數分析 53 5.1.3 功耗數據分析 54 5.2 論文比較 55 5.2.1 與使用對數量化的DNN硬體加速器比較 56 5.2.2 與使用FPGA之對數量化的DNN硬體加速器比較 57 5.2.3 與使用線性量化的DNN硬體加速器比較 58 第6章結論與未來展望 59 6.1 結論 59 6.2 未來展望 59 參考文獻 (Reference) 59

參考文獻 References
[1] Cavigelli, L., & Benini, L., “Origami: A 803-gop/s/w convolutional network accelerator,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 27, No. 11, pp. 2461-2475, 2016 [2] Chang, H.-J., “Design and Implementation of Sparsity-Aware Low-Power Neural Network Hardware Accelerators,” 2019 [3] Chen, Y.-H., “Design and Implementation of a Highly Reconfigurable Multi-Precision Neural Network Hardware Accelerator,” 2019 [4] Chen, Y.-H., Krishna, T., Emer, J. S., & Sze, V., “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE journal of solid-state circuits, Vol. 52, No. 1, pp. 127-138, 2016 [5] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L., “Imagenet: A large-scale hierarchical image database,” 2009 IEEE conference on computer vision and pattern recognition, pp. 248-255, 2009 [6] Fukushima, K., & Miyake, S., “Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition,” In Competition and cooperation in neural nets, pp. 267-285, Springer, 1982 [7] Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., & Dally, W. J., “EIE: Efficient inference engine on compressed deep neural network,” ACM SIGARCH Computer Architecture News, Vol. 44, No. 3, pp. 243-254, 2016 [8] He, K., Zhang, X., Ren, S., & Sun, J., “Deep residual learning for image recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016 [9] Horowitz, M., “Energy table for 45nm process,” In Stanford VLSI wiki, 2014 [10] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., & Borchers, A., “In-datacenter performance analysis of a tensor processing unit,” Proceedings of the 44th annual international symposium on computer architecture, pp. 1-12, 2017 [11] Krizhevsky, A., Sutskever, I., & Hinton, G. E., “ImageNet classification with deep convolutional neural networks,” Commun. ACM, Vol. 60, No. 6, pp. 84–90, https://doi.org/10.1145/3065386, 2017 [12] Kudo, T., Ueyoshi, K., Ando, K., Hirose, K., Uematsu, R., Oba, Y., Ikebe, M., Asai, T., Motomura, M., & Takamaeda-Yamazaki, S., “Area and Energy Optimization for Bit-Serial Log-Quantized DNN Accelerator with Shared Accumulators,” 2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 237-243, 2018 [13] Lee, J., Kim, C., Kang, S., Shin, D., Kim, S., & Yoo, H.-J., “UNPU: A 50.6 TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision,” 2018 IEEE International Solid-State Circuits Conference-(ISSCC), pp. 218-220, 2018 [14] Lu, W., Yan, G., Li, J., Gong, S., Han, Y., & Li, X., “Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks,” 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 553-564, 2017 [15] Miyashita, D., Lee, E. H., & Murmann, B., “Convolutional neural networks using logarithmic data representation,” arXiv preprint arXiv:1603.01025, Vol., No., pp. 1-8, 2016 [16] Qureshi, M. A., & Munir, A., “NeuroMAX: a high throughput, multi-threaded, log-based accelerator for convolutional neural networks,” 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pp. 1-9, 2020 [17] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., & Bernstein, M., “Imagenet large scale visual recognition challenge,” International journal of computer vision, Vol. 115, No. 3, pp. 211-252, 2015 [18] Shin, D., Lee, J., Lee, J., Lee, J., & Yoo, H.-J., “DNPU: An energy-efficient deep-learning processor with heterogeneous multi-core architecture,” IEEE Micro, Vol. 38, No. 5, pp. 85-93, 2018 [19] Simonyan, K., & Zisserman, A., “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014 [20] Sze, V., Chen, Y.-H., Yang, T.-J., & Emer, J. S., “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, Vol. 105, No. 12, pp. 2295-2329, 2017 [21] Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A., “Inception-v4, inception-resnet and the impact of residual connections on learning,” Thirty-first AAAI conference on artificial intelligence, pp., 2017 [22] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A., “Going deeper with convolutions,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1-9, 2015 [23] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z., “Rethinking the inception architecture for computer vision,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818-2826, 2016 [24] Tu, F., Yin, S., Ouyang, P., Tang, S., Liu, L., & Wei, S., “Deep convolutional neural network architecture with reconfigurable computation patterns,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 8, pp. 2220-2233, 2017 [25] Ueyoshi, K., Ando, K., Hirose, K., Takamaeda-Yamazaki, S., Hamada, M., Kuroda, T., & Motomura, M., “QUEST: Multi-purpose log-quantized DNN inference engine stacked on 96-MB 3-D SRAM using inductive coupling technology in 40-nm CMOS,” IEEE journal of solid-state circuits, Vol. 54, No. 1, pp. 186-196, 2018 [26] Vogel, S., Liang, M., Guntoro, A., Stechele, W., & Ascheid, G., “Efficient hardware acceleration of CNNs using logarithmic data representation with arbitrary log-base,” Proceedings of the International Conference on Computer-Aided Design, pp. 1-8, 2018 [27] Wu, P.-H., “Architecture Design and Implementation of Deep Neural Network Hardware Accelerators,” 2018 [28] Yan, J., Yin, S., Tu, F., Liu, L., & Wei, S., “GNA: Reconfigurable and efficient architecture for generative network acceleration,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 37, No. 11, pp. 2519-2529, 2018 [29] Yin, S., Ouyang, P., Tang, S., Tu, F., Li, X., Zheng, S., Lu, T., Gu, J., Liu, L., & Wei, S., “A high energy efficient reconfigurable hybrid neural network processor for deep learning applications,” IEEE journal of solid-state circuits, Vol. 53, No. 4, pp. 968-982, 2017 [30] Yin, S., Ouyang, P., Yang, J., Lu, T., Li, X., Liu, L., & Wei, S., “An energy-efficient reconfigurable processor for binary-and ternary-weight neural networks with flexible data bit width,” IEEE journal of solid-state circuits, Vol. 54, No. 4, pp. 1120-1136, 2018

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：開放下載的時間 available 2027-04-12 校外 Off-campus：開放下載的時間 available 2027-04-12 您的 IP(校外) 位址是 216.73.216.125 現在時間是 2025-06-01 論文校外開放下載的時間是 2027-04-12 Your IP address is 216.73.216.125 The current date is 2025-06-01 This thesis will be available to you on 2027-04-12.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 2027-04-12

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS