國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,應用半監督式自訓練演算法提取網路攻擊關聯,Semi-supervised Cyber Attack Relation Extraction via Self-Training

論文名稱 Title	應用半監督式自訓練演算法提取網路攻擊關聯 Semi-supervised Cyber Attack Relation Extraction via Self-Training
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	110 學年度第 2 學期 The spring semester of Academic Year 110	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	74
研究生 Author	周郁津 Yu-Chin Chou
指導教授 Advisor	陳嘉玫 Chen,Chia-Mei
召集委員 Convenor	賴谷鑫 Lai,Gu-Hsin
口試委員 Advisory Committee	楊惠芳, 歐雅惠, 林孝忠 Yang,Huei-Fang; Ou,Ya-Hui; Lin, Hsiao-Chung
口試日期 Date of Exam	2022-07-19	繳交日期 Date of Submission	2022-08-18
關鍵字 Keywords	APT、網路威脅情資、自然語言處理、關聯提取、自訓練方法、偽標籤、網路安全知識圖譜 APT, CTI, NLP, Relation Extraction, Self-Training, Pseudo Label, Cyber Security Knowledge Graph
統計 Statistics	本論文已被瀏覽 286 次，被下載 0 次 The thesis/dissertation has been browsed 286 times, has been downloaded 0 times.

中文摘要
隨著新興科技在硬體與軟體的快速發展，網路在企業組織中所扮演的角色漸趨重要，對於網路的高度依賴同時加劇了網路資安威脅的嚴重性。現今攻擊事件大多以APT（Advanced Persistent Threat）攻擊為主，隨著APT組織攻擊頻率與攻擊手法日益升級，使得企業組織難以進行防範。因此，有效利用網路威脅情資，提前部署與擬定完善的防禦對策，企業組織才能應對複雜的APT攻擊。近年隨著網路威脅情資的快速發展，已出現許多威脅情資平台或公開來源情報提供有關惡意程式、漏洞、威脅行為等攻擊資訊。然而，每日產出的多筆網路威脅情資卻衍生出巨量資料問題，若仰賴人工方式進行資料蒐集與分析，將耗費許多人力與時間。因此，如何在短時間內有效地篩選和運用自身所需的網路威脅情資成為許多企業組織的課題。基於上述，本研究提出名為「CyberRex」（Cyber Attack Relation Extraction）的網路攻擊關聯提取系統。在半監督式學習的框架下，採用自然語言處理(Natural Language Processing，簡稱NLP）與BERT預訓練模型取得句子特徵，提取攻擊事件中實體與實體間的關聯，並透過自訓練方法生成偽標籤加入訓練資料集中進行迭代訓練，解決訓練集標註數量不足問題。最後，產生實體關聯三元組列表與網路安全知識圖譜，提供資安人員進行視覺化查詢與研究。實驗結果顯示，CyberRex的關聯提取模型擁有73%的F1-score，證實能夠有效利用未標註資料與判斷網路攻擊關聯，幫助企業組織快速且全面地了解網路威脅情資的重點與攻擊溯源。
Abstract
With the rapid development of emerging technologies in hardware and software, the network plays an increasingly important role in enterprises. The high dependence on the network also exacerbates the severity of cyber security threats. Nowadays, most of the attacks are based on APT (Advanced Persistent Threat) attacks. The emergence of APT (Advanced Persistent Threat) group has extended complexity and frequency of cyber-attack, which make enterprises difficult to prevent them. Therefore, the effective use of cyber threat intelligence can help enterprises deploy and formulate comprehensive defense countermeasures in advance to respond to complex APT attacks. With the rapid development of cyber threat intelligence in recent years, many threat intelligence platforms or open source intelligence have emerged to provide attack information such as malware, vulnerabilities, and threat behaviors. However, the cyber threat intelligence generated by these platforms every day has created a big data problem. Relying on manual methods for data collection and analysis will consume a lot of manpower and time. Therefore, how to filtering out the needed information is a crucial issue for enterprises and organizations. To solve the abovementioned issues, this study proposes a cyber attack relation extraction system called “CyberRex” (Cyber Attack Relation Extraction). Under the framework of semi-supervised learning, natural language processing (NLP) and BERT pre-training model are used to obtain sentence features to extract the relation between entities in cyber attack. In addition, pseudo-labels are generated through self-training and added to the training dataset for iterative training to solve the problem of insufficient labels. Finally, a list of entity relations and a cyber security knowledge graph will be generated to provide information security analyst for visual query and research. The experimental results show that CyberRex's relation extraction model has an F1-score of 73%, which proves that it can effectively use unlabeled data to determine the relation of cyber attacks, so as to help enterprises quickly and comprehensively understand cyber threat intelligence and cyber attribution.

目次 Table of Contents
論文審定書 i 中文摘要 ii ABSTRACT iii CONTENTS v LIST OF FIGURES vii LIST OF TABLES viii Chapter 1緒論 1 1.1研究背景 1 1.2研究動機 2 Chapter 2文獻探討 6 2.1背景相關研究 6 2.2網路威脅情資（Cyber Threat Intelligence，CTI） 8 2.3自然語言處理（Natural Language Processing，NLP） 9 2.3.1關聯提取模型（Relation Extraction） 10 2.4 類神經網路 12 2.4.1 BERT 13 2.5 半監督式學習 14 2.5.1 自訓練(Self-Training)方法 15 2.5.2 模型不可知元學習(Model-Agnostic Meta-Learning，MAML) 16 Chapter 3研究方法 19 3.1 資料蒐集模組 21 3.2 斷句模組（Sentence Tokenization Module） 22 3.3 語言翻譯模組（Language Translation Module） 23 3.4 實體與關聯標註模組（Entities and Relation Annotation Module） 23 3.5 關聯提取模組（Relation Extraction Module） 26 3.6 偽標籤生成模組（Pseudo Label Generation Module） 28 Chapter 4系統評估 31 4.1 實驗一、模型評估 36 4.1.1 語義特徵提取方法比較 36 4.1.2 比較不同優化器與參數設置對系統效能的影響 39 4.2 實驗二、偽標籤生成評估 42 4.2.1 未標註資料集利用 42 4.2.2 偽標籤選擇機制 45 4.3 實驗三、半監督式學習基準模型(Baseline Model)比較 46 4.4 實驗四、資安關聯提取 47 4.4.1 CyberRex資安關聯提取 48 4.4.2 關聯提取系統比較 52 4.5 小結 56 Chapter 5 研究貢獻與未來展望 57 REFERENCE 59 附錄一 63 LIST OF FIGURES 圖2-1、ATT&CK Enterprise Matrix 8 圖2-2、類神經網路架構 12 圖2-3、BERT預訓練模型架構 13 圖2-4、Pre-Training和Fine-Tuning階段的過程 14 圖2-5、自訓練方法步驟 15 圖2-6、MAML演算法步驟 17 圖3-1、系統架構圖 21 圖3-2、資料蒐集流程 22 圖3-3、Brat標註工具 24 圖3-4、關聯提取模型架構 27 圖3-5、偽標籤生成模組架構 28 圖4-1、10次增量迭代訓練折線圖 45 圖4-2、網路安全知識圖譜 51 LIST OF TABLES 表3-1、中文句子範例 22 表3-2、實體類別定義 24 表4-1、混淆矩陣 31 表4-2、各來源的實體類別數量 33 表4-3、各來源的關聯數量 33 表4-4、實驗環境 34 表4-5、實驗項目總表 36 表4-6、資料集切分與關聯種類 37 表4-7、TACRED訓練集中標註訓練樣本與未標註訓練樣本比例和數量 38 表4-8、CyberREx訓練集中標註訓練樣本與未標註訓練樣本比例和數量 38 表4-9、不同語義提取方式比較(TACRED) (單位:%) 39 表4-10、不同語義提取方式比較(CyberRex) (單位:%) 39 表4-11、CyberRex訓練集中未標註資料的比例分配與數量 40 表4-12、不同優化器比較(單位:%) 41 表4-13、不同Epochs比較(單位:%) 41 表4-14、關聯提取模型最佳參數設定 42 表4-15、TACRED不同比例未標註訓練樣本 (單位:%) 43 表4-16、CyberRex不同比例未標註訓練樣本 (單位:%) 43 表4-17、TACRED標註與未標註訓練樣本偽標籤選擇平均數量 43 表4-18、CyberRex標註與未標註訓練樣本以及偽標籤選擇平均數量 44 表4-19、10次增量迭代訓練 (單位%) 44 表4-20、模型透過不同閾值訓練的F1-score(單位%) 46 表4-21、不同半監督式學習方法比較(TACRED) 47 表4-22、不同半監督式學習方法比較(CyberRex) 47 表4-23、案例 49 表4-24、實體類別對應顏色 52 表4-25、關聯提取系統比較 53 表4-26、CyberRex與CARE示例圖 55 表4-27、文章長度與標註時間 56

參考文獻 References
T. Micro. "Trend Micro Blocks Over 94 Billion Threats in 2021." https://newsroom.trendmicro.com/2022-01-24-Trend-Micro-Blocks-Over-94-Billion-Threats-in-2021 (accessed. [2] C. Ventures. "Global Ransomware Damage Costs Predicted To Reach $20 Billion (USD) By 2021." https://cybersecurityventures.com/global-ransomware-damage-costs-predicted-to-reach-20-billion-usd-by-2021/ (accessed. [3] Fireeye. "Russia’s APT28 Strategically Evolves its Cyber Operations." https://www.fireeye.com/current-threats/apt-groups/rpt-apt28.html (accessed 06/12, 2021). [4] Fireeye. "Advanced Persistent Threat Groups Who's who of cyber threat actors." https://www.fireeye.com/current-threats/apt-groups.html (accessed 06/12, 2021). [5] USCERT. "USCERT." https://us-cert.cisa.gov/ncas/alerts/aa20-301a (accessed 06/12, 2021). [6] J. Devlin. "Bert." https://github.com/google-research/bert (accessed 04/07, 2021). [7] D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao, "Relation classification via convolutional deep neural network," in Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, 2014, pp. 2335-2344. [8] A. Pingle, A. Piplai, S. Mittal, A. Joshi, J. Holt, and R. Zak, "RelExt: Relation Extraction using Deep Learning approaches for Cybersecurity Knowledge Graph Improvement," in 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 27-30 Aug. 2019 2019, pp. 879-886, doi: 10.1145/3341161.3343519. [9] C. Finn, P. Abbeel, and S. Levine, "Model-agnostic meta-learning for fast adaptation of deep networks," in International conference on machine learning, 2017: PMLR, pp. 1126-1135. [10] G. Husari, E. Al-Shaer, M. Ahmed, B. Chu, and X. Niu, "Ttpdrill: Automatic and accurate extraction of threat actions from unstructured text of cti sources," in Proceedings of the 33rd Annual Computer Security Applications Conference, 2017, pp. 103-115. [11] R. Kwon, T. Ashley, J. Castleberry, P. Mckenzie, and S. N. G. Gourisetti, "Cyber Threat Dictionary Using MITRE ATT&CK Matrix and NIST Cybersecurity Framework Mapping," in 2020 Resilience Week (RWS), 19-23 Oct. 2020 2020, pp. 106-112, doi: 10.1109/RWS50334.2020.9241271. [12] J.-Y. Kan, "應用資訊檢索提取網路威脅情資 (Extracting Cyber Threat Intelligence by Using Information Retrieval)," 2020. [13] W. Xiong, E. Legrand, O. Åberg, and R. Lagerström, "Cyber security threat modeling based on the MITRE Enterprise ATT&CK Matrix," Software and Systems Modeling, vol. 21, no. 1, pp. 157-177, 2022. [14] 吳佳翰. "網路威脅情資淺談." https://www2.deloitte.com/tw/tc/pages/risk/articles/cyber-threat-intelligence.html (accessed 06/18, 2021). [15] G. Husari, X. Niu, B. Chu, and E. Al-Shaer, "Using entropy and mutual information to extract threat actions from cyber threat intelligence," in 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), 2018: IEEE, pp. 1-6. [16] N. Dionísio, F. Alves, P. M. Ferreira, and A. N. Bessani, "Cyberthreat Detection from Twitter using Deep Neural Networks," 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1-8, 2019. [17] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao, "Deep learning--based text classification: a comprehensive review," ACM Computing Surveys (CSUR), vol. 54, no. 3, pp. 1-40, 2021. [18] C. Chu and R. Wang, "A Survey of Domain Adaptation for Neural Machine Translation. CoRR abs/1806.00258 (2018)," arXiv preprint arXiv:1806.00258, 2018. [19] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, and K. Lee, "Natural questions: a benchmark for question answering research," Transactions of the Association for Computational Linguistics, vol. 7, pp. 453-466, 2019. [20] W. S. El-Kassas, C. R. Salama, A. A. Rafea, and H. K. Mohamed, "Automatic text summarization: A comprehensive survey," Expert Systems with Applications, vol. 165, p. 113679, 2021. [21] T. H. Nguyen, "Deep learning for information extraction," New York University, 2018. [22] 李涛, 郭渊博, and 琚安康, "融合对抗主动学习的网络安全知识三元组抽取," 通信学报, vol. 41, no. 10, pp. 80-91, 2020. [23] S. Soderland, D. Fisher, J. Aseltine, and W. Lehnert, "CRYSTAL: Inducing a conceptual dictionary," arXiv preprint cmp-lg/9505020, 1995. [24] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997. [25] S. Lai, L. Xu, K. Liu, and J. Zhao, "Recurrent convolutional neural networks for text classification," in Proceedings of the AAAI Conference on Artificial Intelligence, 2015, vol. 29, no. 1. [26] T. Linzen, E. Dupoux, and Y. Goldberg, "Assessing the ability of LSTMs to learn syntax-sensitive dependencies," Transactions of the Association for Computational Linguistics, vol. 4, pp. 521-535, 2016. [27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. [28] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu, "Attention-based bidirectional long short-term memory networks for relation classification," in Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers), 2016, pp. 207-212. [29] F.-H. Hsu, "應用深度學習提取網路攻擊關聯 (Extracting Cyber Attack Relations by Using Deep Learning)," 2022. [30] 趙上涵, "整合半監督式模型架構萃取實體關係三元組以建構中文知識圖譜," 工業工程學研究所, 國立臺灣大學, 2022年, 2022. [31] H. Lin, J. Yan, M. Qu, and X. Ren, "Learning dual retrieval module for semi-supervised relation extraction," in The World Wide Web Conference, 2019, pp. 1073-1083. [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017. [33] Wiki. "Wiki." https://en.wikipedia.org/wiki/Main_Page (accessed 03/19, 2021). [34] BooksCorpus. "BooksCorpus." https://www.english-corpora.org/googlebooks/ (accessed 03/18, 2021). [35] CoNLL2003. "CoNLL2003." https://huggingface.co/datasets/conll2003 (accessed 06/16, 2021). [36] S. Wu and Y. He, "Enriching pre-trained language model with entity information for relation classification," in Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp. 2361-2364. [37] J. Hou, X. Li, H. Yao, H. Sun, T. Mai, and R. Zhu, "Bert-based chinese relation extraction for public security," IEEE Access, vol. 8, pp. 132367-132375, 2020. [38] C. Rosenberg, M. Hebert, and H. Schneiderman, "Semi-supervised self-training of object detection models," 2005. [39] D.-H. Lee, "Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks," in Workshop on challenges in representation learning, ICML, 2013, vol. 3, no. 2, p. 896. [40] W. Zeng, Y. Lin, Z. Liu, and M. Sun, "Incorporating relation paths in neural relation extraction," arXiv preprint arXiv:1609.07479, 2016. [41] W. Li and T. Qian, "Exploit multiple reference graphs for semi-supervised relation extraction," arXiv preprint arXiv:2010.11383, 2020. [42] Y. Wang, S. Mukherjee, H. Chu, Y. Tu, M. Wu, J. Gao, and A. H. Awadallah, "Meta self-training for few-shot neural sequence labeling," in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 1737-1747. [43] C. Liang, Y. Yu, H. Jiang, S. Er, R. Wang, T. Zhao, and C. Zhang, "Bond: Bert-assisted open-domain named entity recognition with distant supervision," in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1054-1064. [44] Z. Li, D. Zhang, T. Cao, Y. Wei, Y. Song, and B. Yin, "Metats: Meta teacher-student network for multilingual sequence labeling with minimal supervision," in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3183-3196. [45] Y. Lin, "A review on semi-supervised relation extraction," arXiv preprint arXiv:2103.07575, 2021. [46] J. Huggins. "Selenium." https://pypi.org/project/selenium/ (accessed 03/08, 2021). [47] L. Richardson. "BeautifulSoup." https://www.crummy.com/software/BeautifulSoup/bs4/doc/ (accessed 03/08, 2021). [48] A. A. Rescigno, E. Vanmassenhove, J. Monti, and A. Way, "A Case Study of Natural Gender Phenomena in Translation. A Comparison of Google Translate, Bing Microsoft Translator and DeepL for English to Italian, French and Spanish," in CLiC-it, 2020. [49] M. Aiken, "Original Paper An Updated Evaluation of Google Translate Accuracy." [50] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, and J. i. Tsujii, "BRAT: a web-based tool for NLP-assisted text annotation," in Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 102-107. [51] Google. "Colab." https://colab.research.google.com/?utm_source=scs-index (accessed 07/07, 2021). [52] Y. Zhang, V. Zhong, D. Chen, G. Angeli, and C. D. Manning, "Position-aware attention and supervised data improve slot filling," in Conference on Empirical Methods in Natural Language Processing, 2017.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：開放下載的時間 available 2027-08-18 校外 Off-campus：開放下載的時間 available 2027-08-18 您的 IP(校外) 位址是 18.224.44.21 現在時間是 2024-07-27 論文校外開放下載的時間是 2027-08-18 Your IP address is 18.224.44.21 The current date is 2024-07-27 This thesis will be available to you on 2027-08-18.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 2027-08-18

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS