Semi-supervised Cyber Attack Relation Extraction via Self-Training
APT, CTI, NLP, Relation Extraction, Self-Training, Pseudo Label, Cyber Security Knowledge Graph
隨著新興科技在硬體與軟體的快速發展,網路在企業組織中所扮演的角色漸趨重要,對於網路的高度依賴同時加劇了網路資安威脅的嚴重性。現今攻擊事件大多以APT(Advanced Persistent Threat)攻擊為主,隨著APT組織攻擊頻率與攻擊手法日益升級,使得企業組織難以進行防範。因此,有效利用網路威脅情資,提前部署與擬定完善的防禦對策,企業組織才能應對複雜的APT攻擊。
基於上述,本研究提出名為「CyberRex」(Cyber Attack Relation Extraction)的網路攻擊關聯提取系統。在半監督式學習的框架下,採用自然語言處理(Natural Language Processing,簡稱NLP)與BERT預訓練模型取得句子特徵,提取攻擊事件中實體與實體間的關聯,並透過自訓練方法生成偽標籤加入訓練資料集中進行迭代訓練,解決訓練集標註數量不足問題。最後,產生實體關聯三元組列表與網路安全知識圖譜,提供資安人員進行視覺化查詢與研究。實驗結果顯示,CyberRex的關聯提取模型擁有73%的F1-score,證實能夠有效利用未標註資料與判斷網路攻擊關聯,幫助企業組織快速且全面地了解網路威脅情資的重點與攻擊溯源。
With the rapid development of emerging technologies in hardware and software, the network plays an increasingly important role in enterprises. The high dependence on the network also exacerbates the severity of cyber security threats. Nowadays, most of the attacks are based on APT (Advanced Persistent Threat) attacks. The emergence of APT (Advanced Persistent Threat) group has extended complexity and frequency of cyber-attack, which make enterprises difficult to prevent them. Therefore, the effective use of cyber threat intelligence can help enterprises deploy and formulate comprehensive defense countermeasures in advance to respond to complex APT attacks.
With the rapid development of cyber threat intelligence in recent years, many threat intelligence platforms or open source intelligence have emerged to provide attack information such as malware, vulnerabilities, and threat behaviors. However, the cyber threat intelligence generated by these platforms every day has created a big data problem. Relying on manual methods for data collection and analysis will consume a lot of manpower and time. Therefore, how to filtering out the needed information is a crucial issue for enterprises and organizations.
To solve the abovementioned issues, this study proposes a cyber attack relation extraction system called “CyberRex” (Cyber Attack Relation Extraction). Under the framework of semi-supervised learning, natural language processing (NLP) and BERT pre-training model are used to obtain sentence features to extract the relation between entities in cyber attack. In addition, pseudo-labels are generated through self-training and added to the training dataset for iterative training to solve the problem of insufficient labels. Finally, a list of entity relations and a cyber security knowledge graph will be generated to provide information security analyst for visual query and research. The experimental results show that CyberRex's relation extraction model has an F1-score of 73%, which proves that it can effectively use unlabeled data to determine the relation of cyber attacks, so as to help enterprises quickly and comprehensively understand cyber threat intelligence and cyber attribution.
目次 Table of Contents
論文審定書 i
中文摘要 ii
Chapter 1緒論 1
1.1研究背景 1
1.2研究動機 2
Chapter 2文獻探討 6
2.1背景相關研究 6
2.2網路威脅情資(Cyber Threat Intelligence,CTI) 8
2.3自然語言處理(Natural Language Processing,NLP) 9
2.3.1關聯提取模型(Relation Extraction) 10
2.4 類神經網路 12
2.4.1 BERT 13
2.5 半監督式學習 14
2.5.1 自訓練(Self-Training)方法 15
2.5.2 模型不可知元學習(Model-Agnostic Meta-Learning,MAML) 16
Chapter 3研究方法 19
3.1 資料蒐集模組 21
3.2 斷句模組(Sentence Tokenization Module) 22
3.3 語言翻譯模組(Language Translation Module) 23
3.4 實體與關聯標註模組(Entities and Relation Annotation Module) 23
3.5 關聯提取模組(Relation Extraction Module) 26
3.6 偽標籤生成模組(Pseudo Label Generation Module) 28
Chapter 4系統評估 31
4.1 實驗一、模型評估 36
4.1.1 語義特徵提取方法比較 36
4.1.2 比較不同優化器與參數設置對系統效能的影響 39
4.2 實驗二、偽標籤生成評估 42
4.2.1 未標註資料集利用 42
4.2.2 偽標籤選擇機制 45
4.3 實驗三、半監督式學習基準模型(Baseline Model)比較 46
4.4 實驗四、資安關聯提取 47
4.4.1 CyberRex資安關聯提取 48
4.4.2 關聯提取系統比較 52
4.5 小結 56
Chapter 5 研究貢獻與未來展望 57
附錄一 63

圖2-1、ATT&CK Enterprise Matrix 8
圖2-2、類神經網路架構 12
圖2-3、BERT預訓練模型架構 13
圖2-4、Pre-Training和Fine-Tuning階段的過程 14
圖2-5、自訓練方法步驟 15
圖2-6、MAML演算法步驟 17
圖3-1、系統架構圖 21
圖3-2、資料蒐集流程 22
圖3-3、Brat標註工具 24
圖3-4、關聯提取模型架構 27
圖3-5、偽標籤生成模組架構 28
圖4-1、10次增量迭代訓練折線圖 45
圖4-2、網路安全知識圖譜 51

表3-1、中文句子範例 22
表3-2、實體類別定義 24
表4-1、混淆矩陣 31
表4-2、各來源的實體類別數量 33
表4-3、各來源的關聯數量 33
表4-4、實驗環境 34
表4-5、實驗項目總表 36
表4-6、資料集切分與關聯種類 37
表4-7、TACRED訓練集中標註訓練樣本與未標註訓練樣本比例和數量 38
表4-8、CyberREx訓練集中標註訓練樣本與未標註訓練樣本比例和數量 38
表4-9、不同語義提取方式比較(TACRED) (單位:%) 39
表4-10、不同語義提取方式比較(CyberRex) (單位:%) 39
表4-11、CyberRex訓練集中未標註資料的比例分配與數量 40
表4-12、不同優化器比較(單位:%) 41
表4-13、不同Epochs比較(單位:%) 41
表4-14、關聯提取模型最佳參數設定 42
表4-15、TACRED不同比例未標註訓練樣本 (單位:%) 43
表4-16、CyberRex不同比例未標註訓練樣本 (單位:%) 43
表4-17、TACRED標註與未標註訓練樣本偽標籤選擇平均數量 43
表4-18、CyberRex標註與未標註訓練樣本以及偽標籤選擇平均數量 44
表4-19、10次增量迭代訓練 (單位%) 44
表4-20、模型透過不同閾值訓練的F1-score(單位%) 46
表4-21、不同半監督式學習方法比較(TACRED) 47
表4-22、不同半監督式學習方法比較(CyberRex) 47
表4-23、案例 49
表4-24、實體類別對應顏色 52
表4-25、關聯提取系統比較 53
表4-26、CyberRex與CARE示例圖 55
表4-27、文章長度與標註時間 56

參考文獻 References
