國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,應用深度學習提取網路攻擊關聯,Extracting Cyber Attack Relations by Using Deep Learning

論文名稱 Title	應用深度學習提取網路攻擊關聯 Extracting Cyber Attack Relations by Using Deep Learning
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	110 學年度第 1 學期 The fall semester of Academic Year 110	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	70
研究生 Author	許芳瑄 Fang-Hsuan Hsu
指導教授 Advisor	陳嘉玫 Chen,Chia-Mei
召集委員 Convenor	郭文中 Kuo,Wen-Chung
口試委員 Advisory Committee	江明朝, 賴谷鑫, 林俊吾 Chiang,Ming-chao; Lai,Gu-Hsin; Lin, Jun-Wu
口試日期 Date of Exam	2021-12-07	繳交日期 Date of Submission	2022-01-05
關鍵字 Keywords	APT、網路威脅情資、自然語言處理、關聯提取、預訓練模型 APT, CTI, NLP, Relation Extraction, Pre-Trained Model
統計 Statistics	本論文已被瀏覽 549 次，被下載 0 次 The thesis/dissertation has been browsed 549 times, has been downloaded 0 times.

中文摘要
隨著網路科技的發達，網路攻擊數量逐年增加，現今以APT（Advanced Persistent Threat）攻擊為主，常使組織難以防範。為了提升組織資安防禦力，彙整網路威脅情資（Cyber Threat Intelligence，簡稱CTI）變得十分重要。其中，由於中國人口龐大，擁有豐富的 CTI ，因此視為重要的CTI來源。但由於中文CTI為非結構化資料，若透過人工處理是費時又費工的過程，因此，使用自然語言處理（Natural Language Processing，簡稱 NLP）將其擷取為結構化資料，藉此提取資訊輔助資安人員進行判斷。基於上述，本研究提出名為 CARE（Cyber Attack Relation Extraction）的網路攻擊關聯提取系統，目的為找出攻擊實體間的關聯性。首先，蒐集中國網路威脅情資並以不同的前處理方式處理中文文章。之後，利用 BERT 預訓練模型取得句子特徵，再經過深度學習的方式提取網路攻擊中實體與實體間的關聯。最後，將產生實體關聯列表並且將結果儲存至圖形資料庫，以協助資安人員分析網路攻擊達到加強自我防禦。實驗結果顯示， CARE 的關聯提取模型擁有 97% 的 F1-score ，證實有效判斷網路攻擊關聯，達到自動提取之目的。
Abstract
With the advancement of network technology, the number of cyber attacks is increasing year by year, and nowadays, APT (Advanced Persistent Threat) attacks are the main ones, which make organizations difficult to prevent. In order to enhance the information security of organizations, it is important to collect and organize Cyber Threat Intelligence (CTI). Among them, China is considered an important source of CTI because of its large population and abundant CTI. However, since Chinese CTI is unstructured data, it is time-consuming and labor-intensive to process it manually. Therefore, Natural Language Processing (NLP) is used to extract it into structured data and extract information to assist information security analyst in making decisions. Based on the above, this study proposes a cyber attack relation extraction system called CARE (Cyber Attack Relation Extraction), which aims to identify the relationship between attack entities. First, Chinese cyber threat information is collected and processed in different pre-processing methods for Chinese articles. After that, BERT pre-training model is used to obtain sentence features, and then deep learning is used to extract the relation between entities in the cyber attack. Finally, a list of entity relations is generated and the results are stored in a graphical database to help information security analyst analyze the cyber attacks for better self-defense. Experimental results show that CARE's relation extraction model has an F1-score of 97%, which proves to be effective in determining cyber attack relation and achieving automated extraction.

目次 Table of Contents
目錄論文審定書 i 誌謝 ii 摘要 iii Abstract iv 目錄 v 圖次 vii 表次 viii 第一章緒論 1 1.1 研究背景 1 1.2 研究動機 3 第二章文獻探討 6 2.1 背景相關研究 6 2.2 進階持續威脅（ADVANCED PERSISTENT THREAT，APT） 7 2.3 網路威脅情資（CYBER THREAT INTELLIGENCE，CTI） 8 2.4 自然語言處理（NATURAL LANGUAGE PROCESSING，NLP） 9 2.4.1 關聯提取模型 10 2.5 預訓練模型 11 2.5.1 BERT預訓練模型 12 2.5.2 ROBERTA預訓練模型 14 2.5.3 ALBERT預訓練模型 15 第三章研究方法 17 3.1 資料蒐集模組（DATA COLLECTION MODULE） 19 3.2 斷句模組（SENTENCE TOKENIZATION MODULE） 19 3.3 語言翻譯模組（LANGUAGE TRANSLATION MODULE） 21 3.4 實體標註模組（ENTITIES ANNOTATION MODULE） 21 3.5 關聯標註模組（RELATION ANNOTATION MODULE） 25 3.6 關聯提取模型（ RELATION EXTRACTION MODEL） 27 第四章系統評估 33 4.1 實驗一、資料集切割比較 38 4.2 實驗二、關聯提取模型參數比較 40 4.3 實驗三、測試不同來源的資料集 43 4.4 實驗四、資安關聯提取 45 第五章研究貢獻與未來展望 55 參考資料 56 附錄一 60 圖次圖1-1、關聯示意圖…………………………………………………………………...3 圖2-1、ATT&CK矩陣[1]……………………………………………………………..7 圖2-2、BERT預訓練模型架構[2]…………………………………………………..12 圖2-3、BERT的輸入[3]……………………………………………………………..13 圖2-4、Pre-Training和Fine-Tuning階段的過程…………………………………..14 圖3-1、CARE系統架構圖…………………………………………………………..18 圖3-2、資料蒐集流程……………………………………………………………….19 圖3-3、標註流程…………………………………………………………………….21 圖3-4、Brat標註工具……………………………………………………………….25 圖3-5、關聯提取流程……………………………………………………………….27 圖3-6、序列、實體與關聯………………………………………………………….28 圖3-7、關聯提取模型架構………………………………………………………….29 圖3-8、關聯圖範例………………………………………………………………….31 圖4-1、判斷關聯方式……………………………………………………………….46 圖4-2、資安攻擊關聯圖…………………………………………………………….53 表次表2 1、BERT、RoBERTa和ALBERT模型的訓練比較表………………………16 表3-1、中文句子範例………………………………………………………………20 表3-2、符號定義……………………………………………………………………30 表 3 3、關聯圖範例…………………………………………………………………32 表 4 1、混淆矩陣……………………………………………………………………33 表 4 2、各來源的實體類別數量……………………………………………………35 表 4 3、各來源的關聯數量…………………………………………………………35 表 4 4、實驗環境……………………………………………………………………36 表 4 5、實驗項目總表………………………………………………………………38 表 4 6、實驗一的資料數量…………………………………………………………39 表 4 7、實驗一的參數設定…………………………………………………………39 表 4 8、實驗一結果…………………………………………………………………40 表 4 9、各預訓練模型資訊…………………………………………………………40 表 4 10、各預訓練模型中最佳的參數設定………………………………………..41 表 4 11、各預訓練模型中最佳的結果……………………………………………..41 表 4 12、BERT各類別的Precision、Recall和F1-score………………………….42 表4 13、RoBERTa各類別的Precision、Recall和F1-score……………………...42 表 4 14、ALBERT各類別的Precision、Recall和F1-score………………………43 表 4 15、資料來源…………………………………………………………………..43 表 4 16、各子實驗的訓練關聯數量………………………………………………..44 表4 17、實驗三的參數設定………………………………………………………..44 表 4 18、實驗三結果………………………………………………………………..45 表4 19、3.2子實驗各分類的Precision、Recall和F1-score……………………..45 表4 20、3.4子實驗各分類的Precision、Recall和F1-score……………………..45 表4-21、案例………………………………………………………………………..47 表4-22、系統結果比較……………………………………………………………..54

參考文獻 References
[1] MITRE. "MITRE ATT&CK." https://attack.mitre.org/ (accessed: Nov. 23, 2021). [2] H. Yu, Y. Cao, G. Cheng , et al., "Relation extraction with BERT-based pre-trained model," in 2020 International Wireless Communications and Mobile Computing (IWCMC), 2020: IEEE, pp. 1382-1387. [3] J. Devlin, M.-W. Chang, K. Lee , et al., "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. [4] TREND MICRO. "A CONSTANT STATE OF FLUX : Trend Micro 2020 Annual Cybersecurity Report." https://www.trendmicro.com/vinfo/us/security/research-and-analysis/threat-reports/roundup/a-constant-state-of-flux-trend-micro-2020-annual-cybersecurity-report (accessed: Sep. 11, 2021). [5] GeeksforGeeks. "Top 10 Cyber Threats World is Facing in 2021." https://www.geeksforgeeks.org/top-10-cyber-threats-world-is-facing-in-2021/ (accessed: Sep. 11, 2021). [6] CYWARE SOCIAL. "TOP 10 COUNTRIES WITH MOST HACKERS IN THE WORLD." https://cyware.com/news/top-10-countries-with-most-hackers-in-the-world-42e1c94e (accessed: Oct. 1, 2021). [7] N. Ismail. "The value of data: forecast to grow 10-fold by 2025." https://www.information-age.com/data-forecast-grow-10-fold-2025-123465538/ (accessed: Sep. 11, 2021). [8] Wei-Chih Chao, "Leverage Text Analysis in discovering Cyber Threat Intelligence from Hacker Forums," 2019. [9] G. Husari, X. Niu, B. Chu , et al., "Using entropy and mutual information to extract threat actions from cyber threat intelligence," in 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), 2018: IEEE, pp. 1-6. [10] I. Tenney, D. Das, and E. Pavlick, "BERT rediscovers the classical NLP pipeline," arXiv preprint arXiv:1905.05950, 2019. [11] S. Samtani, R. Chinn, H. Chen , et al., "Exploring emerging hacker assets and key hackers for proactive cyber threat intelligence," Journal of Management Information Systems, vol. 34, no. 4, pp. 1023-1053, 2017. [12] W. Xiong, E. Legrand, O. Åberg , et al., "Cyber security threat modeling based on the MITRE Enterprise ATT&CK Matrix," Software and Systems Modeling, pp. 1-21, 2021. [13] R. Kwon, T. Ashley, J. Castleberry , et al., "Cyber Threat Dictionary Using MITRE ATT&CK Matrix and NIST Cybersecurity Framework Mapping," in 2020 Resilience Week (RWS), 2020: IEEE, pp. 106-112. [14] R. Kissel, Glossary of key information security terms. Diane Publishing, 2011. [15] Q. Zou, A. Singhal, X. Sun , et al., "Automatic recognition of advanced persistent threat tactics for enterprise security," in Proceedings of the Sixth International Workshop on Security and Privacy Analytics, 2020, pp. 43-52. [16] Z. Syed, A. Padia, T. Finin , et al., "UCO: A unified cybersecurity ontology," in Workshops at the thirtieth AAAI conference on artificial intelligence, 2016. [17] G. Kim, C. Lee, J. Jo , et al., "Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network," International journal of machine learning and cybernetics, vol. 11, no. 10, pp. 2341-2355, 2020. [18] Jing-Yun Kan, "Extracting Cyber Threat Intelligence by Using Information Retrieval," 2020. [19] A. Pingle, A. Piplai, S. Mittal , et al., "Relext: Relation extraction using deep learning approaches for cybersecurity knowledge graph improvement," in Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2019, pp. 879-886. [20] T. Kwiatkowski, J. Palomaki, O. Redfield , et al., "Natural questions: a benchmark for question answering research," Transactions of the Association for Computational Linguistics, vol. 7, pp. 453-466, 2019. [21] C. Chu and R. Wang, "A survey of domain adaptation for neural machine translation," arXiv preprint arXiv:1806.00258, 2018. [22] S. Minaee, N. Kalchbrenner, E. Cambria , et al., "Deep Learning--based Text Classification: A Comprehensive Review," ACM Computing Surveys (CSUR), vol. 54, no. 3, pp. 1-40, 2021. [23] A. Celikyilmaz, E. Clark, and J. Gao, "Evaluation of text generation: A survey," arXiv preprint arXiv:2006.14799, 2020. [24] W. S. El-Kassas, C. R. Salama, A. A. Rafea , et al., "Automatic text summarization: A comprehensive survey," Expert Systems with Applications, vol. 165, p. 113679, 2021. [25] W. Hersh, "Information retrieval," in Biomedical Informatics: Springer, 2021, pp. 755-794. [26] T. H. Nguyen, "Deep learning for information extraction," New York University, 2018. [27] X. Han, T. Gao, Y. Lin , et al., "More data, more relations, more context and more openness: A review and outlook for relation extraction," arXiv preprint arXiv:2004.03186, 2020. [28] S. Soderland, D. Fisher, J. Aseltine , et al., "CRYSTAL: Inducing a conceptual dictionary," arXiv preprint cmp-lg/9505020, 1995. [29] D. Zeng, K. Liu, S. Lai , et al., "Relation classification via convolutional deep neural network," in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 2335-2344. [30] Y. Shen and X.-J. Huang, "Attention-based convolutional neural network for semantic relation extraction," in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 2526-2536. [31] L. Wang, Z. Cao, G. De Melo , et al., "Relation classification via multi-level attention cnns," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1298-1307. [32] J. Lee, S. Seo, and Y. S. Choi, "Semantic relation classification via bidirectional lstm networks with entity-aware attention using latent entity typing," Symmetry, vol. 11, no. 6, p. 785, 2019. [33] X. Qiu, T. Sun, Y. Xu , et al., "Pre-trained models for natural language processing: A survey," Science China Technological Sciences, pp. 1-26, 2020. [34] T. Mikolov, K. Chen, G. Corrado , et al., "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013. [35] J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532-1543. [36] M. E. Peters, W. Ammar, C. Bhagavatula , et al., "Semi-supervised sequence tagging with bidirectional language models," arXiv preprint arXiv:1705.00108, 2017. [37] A. Radford, J. Wu, R. Child , et al., "Language models are unsupervised multitask learners," OpenAI blog, vol. 1, no. 8, p. 9, 2019. [38] A. Vaswani, N. Shazeer, N. Parmar , et al., "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998-6008. [39] Y. Wu, M. Schuster, Z. Chen , et al., "Google's neural machine translation system: Bridging the gap between human and machine translation," arXiv preprint arXiv:1609.08144, 2016. [40] S. Wu and Y. He, "Enriching pre-trained language model with entity information for relation classification," in Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp. 2361-2364. [41] 金立達. "小資料福音！BERT在極小資料下帶來顯著提升的開源實現." https://www.itread01.com/iepil.html (accessed: Sep. 18, 2021). [42] J. Hou, X. Li, H. Yao , et al., "Bert-based chinese relation extraction for public security," IEEE Access, vol. 8, pp. 132367-132375, 2020. [43] Y. Liu, M. Ott, N. Goyal , et al., "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019. [44] Z. Lan, M. Chen, S. Goodman , et al., "Albert: A lite bert for self-supervised learning of language representations," arXiv preprint arXiv:1909.11942, 2019. [45] L.-H. Lee, M.-C. Hung, C.-H. Lu , et al., "Classification of Tweets Self-reporting Adverse Pregnancy Outcomes and Potential COVID-19 Cases Using RoBERTa Transformers," in Proceedings of the Sixth Social Media Mining for Health (# SMM4H) Workshop and Shared Task, 2021, pp. 98-101. [46] X. Yang, Z. Yu, Y. Guo , et al., "Clinical Relation Extraction Using Transformer-based Models," arXiv preprint arXiv:2107.08957, 2021. [47] U. Naseem, A. G. Dunn, M. Khushi , et al., "Benchmarking for biomedical natural language processing tasks with a domain specific albert," arXiv preprint arXiv:2107.04374, 2021. [48] B. Muthukadan. "Selenium with Python." https://selenium-python.readthedocs.io/ (accessed: Nov. 15, 2020). [49] M. Aiken, "An updated evaluation of Google Translate accuracy," Studies in linguistics and literature, vol. 3, no. 3, pp. 253-260, 2019. [50] G. Husari, E. Al-Shaer, M. Ahmed , et al., "Ttpdrill: Automatic and accurate extraction of threat actions from unstructured text of cti sources," in Proceedings of the 33rd Annual Computer Security Applications Conference, 2017, pp. 103-115. [51] C. T. I. T. Committee. "Structured Threat Information Expression (STIX)." https://oasis-open.github.io/cti-documentation/ (accessed: Sep. 18, 2021). [52] P. Stenetorp, S. Pyysalo, G. Topić , et al., "BRAT: a web-based tool for NLP-assisted text annotation," in Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 102-107. [53] T. Wolf, L. Debut, V. Sanh , et al., "Huggingface's transformers: State-of-the-art natural language processing," arXiv preprint arXiv:1910.03771, 2019.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：開放下載的時間 available 2027-01-05 校外 Off-campus：開放下載的時間 available 2027-01-05 您的 IP(校外) 位址是 3.135.192.97 現在時間是 2024-07-27 論文校外開放下載的時間是 2027-01-05 Your IP address is 3.135.192.97 The current date is 2024-07-27 This thesis will be available to you on 2027-01-05.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 2027-01-05

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS