國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於自然語言處理挖掘新興網路威脅情資之研究,A Study of Discovering Emergent Cyber Threat Intelligence Based on Natural Language Processing

論文名稱 Title	基於自然語言處理挖掘新興網路威脅情資之研究 A Study of Discovering Emergent Cyber Threat Intelligence Based on Natural Language Processing
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	111 學年度第 2 學期 The spring semester of Academic Year 111	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	52
研究生 Author	張孫杰 Sun-Jie Zhang
指導教授 Advisor	陳嘉玫 Chen,Chia-Mei
召集委員 Convenor	鄭炳強 Jeng,Bing-Chiang
口試委員 Advisory Committee	林耕霈, 韓毅, 吳東興 Lin, Keng-Pei; Han, Yi; Dong-Shing, Wu
口試日期 Date of Exam	2023-06-29	繳交日期 Date of Submission	2023-07-07
關鍵字 Keywords	威脅情資、文字探勘、主題偵測、事件偵測、新興分析 CTI, Text Mining, Topic detection, Event detection, Emergent analysis
統計 Statistics	本論文已被瀏覽 307 次，被下載 0 次 The thesis/dissertation has been browsed 307 times, has been downloaded 0 times.

中文摘要
資訊科技技術持續進步，企業運用資訊科技改善工作效率、提升企業競爭力。運用資訊科技提供自動化與智慧化的營運模式之同時，企業也面臨伴隨而來的網路安全風險。為應對不斷演進的網路威脅，傳統上仰賴被動式的防禦機制，如建置防火牆、入侵偵測系統等防禦策略，已無法面對日益複雜和隱蔽的網路威脅。因此企業改採取主動式防禦策略，蒐集多來源的資安威脅資訊，提前建立相對應的防護措施。隨著不斷演進的攻擊趨勢，為應對高頻且複雜的網路攻擊事件，資安人員須取得多樣化的網路威脅情資，從中發現新興網路攻擊事件，以便快速部署資安防護措施。資安人員從新興資安新聞發現新興網路攻擊事件，以取得第一手的資安情資，但此過程須耗費大量人工及時間。因此，開發自動化系統來自動收集、分析並彙整新興威脅情資已然成為必要，可幫助資安人員快速應對網路威脅。本研究提出新興網路威脅情資挖掘系統 TTID（Trending Threat Intelligence Discovery；簡稱 TTID）自動蒐集各大網站資安新聞，利用自然語言處理、機器學習及新興事件偵測技術，挖掘出新興資安事件。現實世界的情況下進行偵測，短期偵測的 F1 高達 91%，長期偵測的 F1 為 87%，讓資安人員能夠快速發現新興資安事件。此外，自動化系統可將收集到的情資進行彙整，提供更全面的新興網路威脅情報。
Abstract
With the continuous development and progress of information technology, businesses have increasingly relied on the widespread application of information tools to enhance operational efficiency. However, the growing prominence of information security issues has posed significant challenges. To counter evolving network threats, enterprises primarily rely on passive defense strategies such as firewall and intrusion detection system. Yet, as network threats become more complex and covert, proactive defense strategies that involve gathering multi-sourced security-related information and establishing corresponding security measures have become essential. To address the high-frequency and complex nature of network attacks, cybersecurity professionals must acquire diverse threat intelligence to quickly identify emerging network attack events and deploy appropriate security measures. Cybersecurity professionals typically discover emerging network attack events from emerging security news, which requires significant manual effort and time. Therefore, the development of an automated system to collect, analyze, and aggregate emerging threat intelligence has become necessary to assist cybersecurity professionals in rapidly responding to network threats. In view of this, this research proposes the system called “TTID” (Trending Threat Intelligence Discovery), an automated system designed to collect cybersecurity news from various major websites and employ Natural Language Processing (NLP), machine learning, and emerging event detection techniques to unearth emerging security incidents. Realworld testing chieved an F1 score of 91% for short-term detection and 87% for long-term detection, enabling cybersecurity professionals to promptly identify emerging security events. Additionally, the automated system consolidates the collected intelligence, providing a comprehensive repository of emerging network threat information.

目次 Table of Contents
論文審定書 i 中文摘要 ii ABSTRACT iii 目錄 v 圖次 vii 表次 viii 第 1 章序論 1 1.1 研究背景 1 1.2 研究動機與目的 2 第 2 章文獻探討 4 2.1 背景相關研究 4 2.2 文字特徵提取 4 2.2.1 Doc2Vec 5 2.2.2 SBERT 5 2.3 主題分群 6 2.3.1 UMAP 6 2.3.2 HDBSCAN 7 2.3.3 K-means++ 7 2.3.4 Hierarchical Clustering 7 2.4 事件偵測 8 2.4.1 新興事件偵測 9 2.4.2 新興關鍵字 9 第 3 章研究方法 12 3.1 資料蒐集模組 15 3.2 資料前處理模組 15 3.3 資料分析模組 16 3.4 新興分析模組 16 3.4.1 新興字彙提取 16 3.4.2 挑選新興關鍵字 18 第 4 章系統評估 20 4.1 資料來源 21 4.2 評估指標 22 4.3 實驗一、詞幹/詞條提取 24 4.4 實驗二、詞嵌入 27 4.5 實驗三、新興事件偵測 27 4.5.1 挑選新興關鍵字 28 4.6 實驗四、相關事件偵測之比較 34 4.6.1 TTID VS ESED 34 4.6.2 TWCERT電子報事件偵測 37 第 5 章結論與未來展望 40 參考文獻 41

參考文獻 References
[1] 趨勢科技. "趨勢科技2022年度網路資安報告." https://www.trendmicro.com/zh_tw/security-intelligence/threat-report/2022-annual-cybersecurity-report.html (accessed 05/07, 2023). [2] iThome. "【資安日報】3月31日，攻擊工具也跟上「多雲」風潮，一口氣搜括18種雲端服務平臺的組態不當設定資料." https://www.ithome.com.tw/news/156233 (accessed 05/07, 2023). [3] M. Ramina, N. Darnay, C. Ludbe, and A. Dhruv, "Topic level summary generation using BERT induced Abstractive Summarization Model," in 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), 13-15 May 2020 2020, pp. 747-752, doi: 10.1109/ICICCS48265.2020.9120997. [4] Misp. "MISP Open Source Threat Intelligence Platform & Open Standards For Threat Information Sharing." https://www.misp-project.org/ (accessed 09/11, 2022). [5] 方俊傑, "從新聞分析發現資安趨勢之研究," 碩士, 資訊管理學系研究所, 國立中山大學, 高雄市, 2020. [Online]. Available: https://hdl.handle.net/11296/6z7t5s [6] J. Allan, "Introduction to topic detection and tracking," in Topic detection and tracking: event-based information organization: Kluwer Academic Publishers, 2002, pp. 1–16. [7] X. Lu, X. Zhou, W. Wang, P. Lio, and P. Hui, "Domain-Oriented Topic Discovery Based on Features Extraction and Topic Clustering," IEEE Access, vol. PP, pp. 1-1, 01/01 2020, doi: 10.1109/ACCESS.2020.2994516. [8] t. F. n. c. a. A. Filigran, the CERT-EU and the Luatix "OpenCTI-Platform." https://github.com/OpenCTI-Platform/opencti (accessed 2022. [9] J. E. Ramos, "Using TF-IDF to Determine Word Relevance in Document Queries," 2003. [10] F. Bianchi, S. Terragni, and D. Hovy, "Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence," Online, August 2021: Association for Computational Linguistics, in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 759-766, doi: 10.18653/v1/2021.acl-short.96. [Online]. Available: https://aclanthology.org/2021.acl-short.96 [11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013. [12] Q. Le and T. Mikolov, "Distributed representations of sentences and documents," in International conference on machine learning, 2014: PMLR, pp. 1188-1196. [13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. [14] N. Reimers and I. Gurevych, "Sentence-bert: Sentence embeddings using siamese bert-networks," arXiv preprint arXiv:1908.10084, 2019. [15] M. Grootendorst, "BERTopic: Neural topic modeling with a class-based TF-IDF procedure," arXiv preprint arXiv:2203.05794, 2022. [16] Z. Zhang, M. Fang, L. Chen, and M. R. Namazi Rad, "Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics," Seattle, United States, July 2022: Association for Computational Linguistics, in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3886-3893, doi: 10.18653/v1/2022.naacl-main.285. [Online]. Available: https://aclanthology.org/2022.naacl-main.285 https://doi.org/10.18653/v1/2022.naacl-main.285 [17] L. McInnes and J. Healy, "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction," 02/09 2018. [18] L. McInnes, J. Healy, and S. Astels, "hdbscan: Hierarchical density based clustering," J. Open Source Softw., vol. 2, no. 11, p. 205, 2017. [19] D. Arthur and S. Vassilvitskii, "k-means++: the advantages of careful seeding," presented at the Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans, Louisiana, 2007. [20] I. Deliu, C. Leichter, and K. Franke, "Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process using Support Vector Machines and Latent Dirichlet Allocation," in 2018 IEEE International Conference on Big Data (Big Data), 10-13 Dec. 2018 2018, pp. 5008-5013, doi: 10.1109/BigData.2018.8622469. [21] 趙偉志, "從駭客論壇發掘網路威脅情報," 碩士, 資訊管理學系研究所, 國立中山大學, 高雄市, 2020. [Online]. Available: https://hdl.handle.net/11296/vddmfx [22] W. Xie, F. Zhu, J. Jiang, E. P. Lim, and K. Wang, "TopicSketch: Real-Time Bursty Topic Detection from Twitter," IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 8, pp. 2216-2229, 2016, doi: 10.1109/TKDE.2016.2556661. [23] M. Cataldi, L. D. Caro, and C. Schifanella, "Personalized emerging topic detection based on a term aging model," ACM Trans. Intell. Syst. Technol., vol. 5, no. 1, p. Article 7, 2014, doi: 10.1145/2542182.2542189. [24] NewsNow. "NewsNow - CyberSecurity." 2023. https://www.newsnow.co.uk/h/Technology/Cyber+Security (accessed 05/08). [25] NewsNow. "NewsNow." https://www.newsnow.co.uk/h/ (accessed 05/08, 2023). [26] M. Allaoui, M. L. Kherfi, and A. Cheriet, "Considerably improving clustering algorithms using UMAP dimensionality reduction technique: a comparative study," in International Conference on Image and Signal Processing, 2020: Springer, pp. 317-325. [27] E. De Santis, A. Martino, and A. Rizzi, "An Infoveillance System for Detecting and Tracking Relevant Topics From Italian Tweets During the COVID-19 Event," (in eng), IEEE Access, vol. 8, pp. 132527-132538, 2020, doi: 10.1109/access.2020.3010033. [28] M. Röder, A. Both, and A. Hinneburg, "Exploring the Space of Topic Coherence Measures," presented at the Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, Shanghai, China, 2015. [Online]. Available: https://doi.org/10.1145/2684822.2685324. [29] A. B. Dieng, F. J. Ruiz, and D. M. Blei, "Topic modeling in embedding spaces," Transactions of the Association for Computational Linguistics, vol. 8, pp. 439-453, 2020. [30] Q. Dang, F. Gao, and Y. Zhou, "Early detection method for emerging topics based on dynamic bayesian networks in micro-blogging networks," Expert Systems with Applications, vol. 57, pp. 285-295, 2016. [31] Y. Zhou, X. Guan, Q. Zheng, Q. Sun, and J. Zhao, "Group dynamics in discussing incidental topics over online social networks," IEEE Network, vol. 24, no. 6, pp. 42-47, 2010, doi: 10.1109/MNET.2010.5634442. [32] TWCERT. "TWCERT/CC 台灣電腦網路危機處理暨協調中心資安新聞." https://www.twcert.org.tw/tw/lp-104-1-1-20.html (accessed 05/23, 2023). [33] TWCERT. "TWCERT/CC 台灣電腦網路危機處理暨協調中心." https://www.twcert.org.tw/tw/mp-1.html (accessed 05/23, 2023).

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：開放下載的時間 available 2028-07-07 校外 Off-campus：開放下載的時間 available 2028-07-07 您的 IP(校外) 位址是 216.73.216.218 現在時間是 2025-06-05 論文校外開放下載的時間是 2028-07-07 Your IP address is 216.73.216.218 The current date is 2025-06-05 This thesis will be available to you on 2028-07-07.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 2028-07-07

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS