Responsive image
博碩士論文 etd-0706121-070307 詳細資訊
Title page for etd-0706121-070307
論文名稱
Title
多教師知識蒸餾方法應用於自我蒸餾模型訓練
Multi-Teacher Knowledge Distillation Method On Self-Distillation Model Training
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
65
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2021-07-23
繳交日期
Date of Submission
2021-08-06
關鍵字
Keywords
模型壓縮、知識蒸餾、多教師知識蒸餾、自我蒸餾、深度學習、影像辨識
Model Compression, Knowledge Distillation, Multi-Teacher Knowledge Distillation, Self-Distillation, Deep Learning, Image Recognition
統計
Statistics
本論文已被瀏覽 1797 次,被下載 0
The thesis/dissertation has been browsed 1797 times, has been downloaded 0 times.
中文摘要
  近年來由於卷積神經網路的崛起,如今許多影像辨識技術正在加速發展,但在追求模型辨識準確度的同時卻不斷地增加模型的計算量,從而導致模型在許多小型裝置上無法進行計算或計算時間過長。為了解決上述發生的狀況出現,模型壓縮是其中一種方式,而在模型壓縮演算法當中,又存在著各種不同形式的知識蒸餾演算法架構,例如:無資料知識蒸餾、多教師知識蒸餾、跨模態知識蒸餾與自我蒸餾等等。
  在2019年由Zhang, L.[1]學者等人提出一種新型的自我蒸餾方法,主要將模型切割成不同區塊,較淺層的區塊其準確度較低,反之較深層的區塊則準確度較高,其演算法使得模型擁有非常彈性的結構,而在多教師知識蒸餾演算法當中則是能夠吸收多個教師模型的知識量來輔助訓練其學生模型。
  綜上所述,本研究提出:多教師知識蒸餾方法應用於自我蒸餾模型訓練。其方法是將不同區塊的知識視為不同的教師,對於學生模型進行多教師知識蒸餾,此外本研究也考量到在不同的訓練階段中,學生模型在與多個教師進行學習的過程中,應該要有不同的比例進行學習,同時隨著學生模型的成長,學生模型自身也應該被重視,因此本研究提出新的自適應權重來進行調整,並結合學生模型視為教師模型的方式,使得知識蒸餾演算法不再需要設定教師模型的學習率,減少實驗設計所需耗費的時間。
  最後在實驗結果部分,我們進行了不同數量教師模型差異與不同資料量中知識轉移程度的實驗,從不同數量教師模型差異來看,不論是在單一教師模型中或多教師模型中,我們的方法與平均法相比都有顯著的提升,而在不同資料量中知識轉移程度來看, 我們的方法與平均法相比在較少資料量中知識轉移程度也擁有較好的表現,因此可以看出,我們的方法在加入了學生模型視為教師模型後對於知識蒸餾是有明顯的幫助。
Abstract
  In recent years, people continue to increase the amount of model calculations while pursuing model accuracy. As a result, the model cannot be calculated on many small devices. In order to solve the above situation, model compression algorithm is one of the ways. In the model compression algorithm, there are different forms of knowledge distillation algorithm architecture, such as multi-teacher knowledge distillation and self-distillation, etc.
  Zhang, l.[1] proposed a new self-distillation method in 2019. they divided the model into different blocks. The shallower blocks get lower accuracy, while the deeper blocks get higher accuracy. Their algorithm gives the model a very flexible structure. In the multi-teacher knowledge distillation algorithm, it can obtain the knowledge of multiple teacher models to train the student model.
  In summary, this research proposes that “multi-teacher knowledge distillation method on self-distillation”. Our method treats different blocks of knowledge as different teachers and uses multi-teacher knowledge distillation algorithm. In addition, we also consider that in different training stages, the student model should have different weights in the process of learning with multiple teachers. Therefore, we propose a new adaptive weighting method, combined with the student model as a teacher model. To make the algorithm does not need to set the learning rate of the teacher model, and reduces the time of experimental design.
  Finally, in different numbers of teacher models, whether in a single teacher model or a multi-teacher model. Our method is better than the average method. In degree of knowledge transfer in different data volumes. Our method is better than the average method in a small amount of data. This shows that our method has a significant improvement in knowledge distillation after adding the student model as a teacher model.
目次 Table of Contents
論文審定書 i
論文公開授權書 ii
誌謝 iii
摘要 iv
Abstract v
目錄 vi
圖次 viii
表次 ix
第一章 緒論 1
1.1. 研究背景 1
1.2. 研究動機 2
1.3. 研究目的 3
第二章 文獻探討 4
2.1. Convolutional Neural Networks for Image Recognition(CNN影像辨識) 4
2.1.1. LeNet 4
2.1.2. AlexNet 4
2.1.3. VGG 5
2.1.4. GoogleLeNet 6
2.1.5. ResNet 6
2.1.6. ResNext 7
2.1.7. SENet 7
2.2. Model Compression(模型壓縮) 8
2.2.1、 Lightweight Model (輕量級模型) 8
2.2.2、 Pruning (修剪) 10
2.2.3、 Low-rank factorization (低秩分解) 11
2.2.4、 Knowledge Distillation (知識蒸餾) 12
2.3. Knowledge Distillation(知識蒸餾) 13
2.3.1、 Multi-Teacher Knowledge Distillation(多教師知識蒸餾) 13
2.3.2、 Self-Distillation(自我蒸餾) 15
第三章 研究方法 17
3.1. 模型架構 17
3.2. Loss 1:Cross Entropy 19
3.3. Loss 2:KD Loss 19
3.4. Loss 3:Feature Loss 20
3.5. Loss 4:Adaptive Weight (自適應權重) 21
3.6. Total Loss 22
第四章 實驗結果 24
4.1. 資料集 24
4.1.1. CIFAR-100 24
4.1.2. Tiny ImageNet 25
4.2. 實驗設計 26
4.3. 評估方法 26
4.4. 教師模型訓練(CIFAR-100) 27
4.4.1、 ResNet50-SD 27
4.4.2、 ResNet152-SD 28
4.5. 實驗參數設置(CIFAR-100) 29
4.5.1、 多教師知識蒸餾 - SMTKD 29
4.5.2、 多教師知識蒸餾 - SMTKDS 31
4.6. 不同數量教師模型差異(CIFAR-100) 31
4.6.1、 多教師知識蒸餾 - MTKD(Avg) 32
4.6.2、 多教師知識蒸餾 - SMTKD 33
4.6.3、 多教師知識蒸餾 - SMTKDS 34
4.6.4、 比較各方法差異 36
4.7. 不同資料量中知識轉移程度(CIFAR-100) 37
4.7.1、 ResNet50-SD 37
4.7.2、 多教師知識蒸餾 - MTKD(Avg) 38
4.7.3、 多教師知識蒸餾 - SMTKD 40
4.7.4、 多教師知識蒸餾 - SMTKDS 41
4.7.5、 比較各方法差異 42
4.8. 教師模型訓練(Tiny ImageNet) 44
4.8.1. ResNet152-SD 44
4.9. 不同資料量中知識轉移程度(Tiny ImageNet) 45
4.9.1. 多教師知識蒸餾 - MTKD(Avg) 45
4.9.2. 多教師知識蒸餾 - SMTKDS 47
4.9.3. 比較各方法差異 48
4.10. 實驗結果探討 50
第五章 結論 51
5.1. 結論 51
5.2. 未來展望 52
參考文獻 53
參考文獻 References
1. Zhang, L., et al. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. in Proceedings of the IEEE International Conference on Computer Vision. 2019.
2. Hu, J., L. Shen, and G. Sun. Squeeze-and-excitation networks. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
3. Huang, G., et al. Densely connected convolutional networks. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
4. He, K., et al. Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
5. Song, L., et al., A deep multi-modal CNN for multi-instance multi-label image classification. IEEE Transactions on Image Processing, 2018. 27(12): p. 6025-6038.
6. Chen, Z.-M., et al. Multi-label image recognition with graph convolutional networks. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
7. Bochkovskiy, A., C.-Y. Wang, and H.-Y.M. Liao, YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:.10934, 2020.
8. He, K., et al. Mask r-cnn. in Proceedings of the IEEE international conference on computer vision. 2017.
9. Law, H. and J. Deng. Cornernet: Detecting objects as paired keypoints. in Proceedings of the European Conference on Computer Vision (ECCV). 2018.
10. Zhang, H., et al. Co-occurrent features in semantic segmentation. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
11. Takikawa, T., et al. Gated-scnn: Gated shape cnns for semantic segmentation. in Proceedings of the IEEE International Conference on Computer Vision. 2019.
12. Howard, A.G., et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:.04861, 2017.
13. Zhang, X., et al. Shufflenet: An extremely efficient convolutional neural network for mobile devices. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
14. Ding, X., et al., Lossless CNN Channel Pruning via Gradient Resetting and Convolutional Re-parameterization. arXiv preprint arXiv:.03260, 2020.
15. Srinivas, S. and R.V. Babu, Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:.06149, 2015.
16. Gupta, S., et al. Deep learning with limited numerical precision. in International Conference on Machine Learning. 2015.
17. Han, S., H. Mao, and W.J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:.00149, 2015.
18. Romero, A., et al., Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
19. Wang, T., et al. Distilling object detectors with fine-grained feature imitation. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
20. Chen, H., et al. Data-free learning of student networks. in Proceedings of the IEEE International Conference on Computer Vision. 2019.
21. Fang, G., et al., Data-Free Adversarial Distillation. arXiv preprint arXiv:.11006, 2019.
22. You, S., et al. Learning from multiple teacher networks. in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017.
23. Nguyen, L.T., K. Lee, and B. Shim, Stochasticity and Skip Connection Improve Knowledge Transfer.
24. Roheda, S., et al. Cross-modality distillation: A case for conditional generative adversarial networks. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. IEEE.
25. Tian, Y., D. Krishnan, and P. Isola, Contrastive representation distillation. arXiv preprint arXiv:.10699, 2019.
26. Luan, Y., et al., MSD: Multi-Self-Distillation Learning via Multi-classifiers within Deep Neural Networks. arXiv preprint arXiv:.09418, 2019.
27. LeCun, Y., et al., Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. 86(11): p. 2278-2324.
28. Krizhevsky, A., I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional neural networks. in Advances in neural information processing systems. 2012.
29. Russakovsky, O., et al., Imagenet large scale visual recognition challenge. International journal of computer vision, 2015. 115(3): p. 211-252.
30. Simonyan, K. and A. Zisserman, Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
31. Szegedy, C., et al. Going deeper with convolutions. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
32. Xie, S., et al. Aggregated residual transformations for deep neural networks. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
33. Cheng, Y., et al., A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:.09282, 2017.
34. Choudhary, T., et al., A comprehensive survey on model compression and acceleration. Artificial Intelligence Review, 2020: p. 1-43.
35. Iandola, F.N., et al., SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:.07360, 2016.
36. Li, H., et al., Pruning filters for efficient convnets. arXiv preprint arXiv:.08710, 2016.
37. Krizhevsky, A. and G. Hinton, Learning multiple layers of features from tiny images. 2009.
38. Zhao, C., et al. Variational convolutional neural network pruning. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
39. Jaderberg, M., A. Vedaldi, and A. Zisserman, Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
40. Tai, C., et al., Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:.06067, 2015.
41. Hinton, G., O. Vinyals, and J. Dean, Distilling the knowledge in a neural network. arXiv preprint arXiv:.02531, 2015.
42. Zhang, J., et al. Class-incremental learning via deep model consolidation. in The IEEE Winter Conference on Applications of Computer Vision. 2020.
43. Gou, J., et al., Knowledge Distillation: A Survey. arXiv preprint arXiv:.05525, 2020.
44. paperswithcode. Available from: https://paperswithcode.com/.
45. Kwon, J., et al., ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks. arXiv preprint arXiv:.11600, 2021.
46. Le, Y. and X. Yang, Tiny imagenet visual recognition challenge. CS 231N, 2015. 7: p. 7.
47. Rame, A., R. Sun, and M. Cord, MixMo: Mixing Multiple Inputs for Multiple Outputs via Deep Subnetworks. arXiv preprint arXiv:.06132, 2021.
48. Tseng, C.-H., et al., UPANets: Learning from the Universal Pixel Attention Networks. arXiv preprint arXiv:.08640, 2021.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code