國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,利用機器學習與集成學習預測急診室類不平衡之菌血症,Using Machine Learning and Ensemble Learning to Predict Imbalanced Classification of Bacteremia in Emergency Room

論文名稱 Title	利用機器學習與集成學習預測急診室類不平衡之菌血症 Using Machine Learning and Ensemble Learning to Predict Imbalanced Classification of Bacteremia in Emergency Room
系所名稱 Department	應用數學系 Department of Applied Mathematics
畢業學年期 Year, semester	112 學年度第 2 學期 The spring semester of Academic Year 112	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	82
研究生 Author	游詰云 Jie-Yun You
指導教授 Advisor	張福春 Chang,Fu-Chuen
召集委員 Convenor	鍾思齊 Chung,Szu-Chi
口試委員 Advisory Committee	于松桓, 張中 Yu,Sung-Huan; Chang Chung
口試日期 Date of Exam	2024-07-04	繳交日期 Date of Submission	2024-07-12
關鍵字 Keywords	菌血症、機器學習、類不平衡、類重疊、集成學習 Bacteremia, Machine Learning, Class Imbalance, Class Overlap, Ensemble Learning
統計 Statistics	本論文已被瀏覽 313 次，被下載 6 次 The thesis/dissertation has been browsed 313 times, has been downloaded 6 times.

中文摘要
診斷是否罹患菌血症可透過血液培養確認，使臨床醫師能夠提供有效的治療，但血液培養通常需要很長的時間，一般情況下結果需要 24 至 72 小時不等的時間。因此，為了加速診斷流程，臨床醫師通常根據經驗判斷病人是否罹患菌血症，然而，根據研究顯示，使用不適當的經驗性抗生素治療也可能導致死亡率增加，本研究旨在利用機器學習的方法，輔助急診醫師判斷病人是否患有菌血症。本研究透過高雄榮總急診室收集了共 37009 筆資料，其中陽性樣本為 3763 筆，陰性樣本為 33246 筆。由於類別不平衡的問題，因此，本研究採用欠採樣法、過採樣法、結合採樣法、權重調整法等解決類不平衡數據的問題。此外，類重疊使得數據更加複雜，傳統的機器學習演算法可能難以泛化，因此，本研究採用了裝袋算法、提升算法、投票算法和堆疊算法，以及平衡集成算法，並通過混合矩陣、F1 分數、MCC、AUROC 等指標來評估模型的優劣。最後，本研究發現在 37009 筆資料中，使用 SMOTE-ENN 結合採樣法與權重調整法，以及使用投票算法、堆疊算法等演算法表現最好，F1 分數可達 0.42，MCC 可達 0.349， AUC 最高可達到約 0.827，可有效地處理類不平衡和類重疊等問題。未來，我們可進一步優化模型，探索更多的特徵和演算法，以提高預測性能，從而更好地滿足臨床需求，幫助臨床醫師在判斷患者是否患有菌血症上給予幫助，為患者提供更好的治療和護理。
Abstract
Diagnosing bacteremia can be confirmed through blood cultures, enabling clinicians to provide effective treatment. However, blood cultures usually take a long time, typically requiring 24 to 72 hours for results. Therefore, to expedite the diagnostic process, clinicians often rely on their experience to judge whether a patient has bacteremia. Nevertheless, studies have shown that inappropriate empirical antibiotic treatment can lead to increased mortality. This study aims to assist emergency physicians in diagnosing bacteremia using machine learning methods. In this study, we collected a total of 37,009 records from the emergency department of Kaohsiung Veterans General Hospital, with 3,763 positive samples and 33,246 negative samples. Due to the issue of class imbalance, we employed under-sampling, over-sampling, hybrid sampling, and weight adjustment methods to address this problem. Additionally, the overlapping classes further complicate the data, making it difficult for traditional machine learning algorithms to generalize. Therefore, we used bagging algorithms, boosting algorithms, voting algorithms, stacking algorithms, and balanced ensemble algorithms. We evaluated the models using metrics such as the confusion matrix, F1 score, MCC, and AUROC. Finally, we found that among the 37,009 records, the combination of SMOTE-ENN hybrid sampling and weight adjustment, along with the voting and stacking algorithms, performed the best. The F1 score reached 0.42, MCC reached 0.349, and AUC reached approximately 0.827, effectively handling issues of class imbalance and class overlap. In the future, we can further optimize the model, explore more features and algorithms to improve predictive performance, better meet clinical needs, assist clinicians in determining whether patients have bacteremia, and provide better treatment and care for patients.

目次 Table of Contents
論文審定書 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 致謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ii 摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 第一章緒論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究背景 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 研究目的 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 研究框架 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 第二章文獻探討 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 數據清理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 缺失值處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 離群值處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 編碼器. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 標籤編碼 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 獨熱編碼 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 目標編碼 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 特徵縮放 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 最大最小正規化 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 vi 2.3.2 Z 值標準化 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 特徵選取 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.1 皮爾森卡方檢定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2 司徒頓 𝑡 檢定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 超參數選取 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5.1 網格搜索 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5.2 隨機搜索 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6 交叉驗證 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6.1 K 折交叉驗證 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6.2 分層交叉驗證 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 第三章研究方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 資料介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 研究流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 資料清理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 分層交叉驗證. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 特徵選取 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.6 處理類不平衡方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6.1 隨機欠採樣. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.6.2 分群欠採樣. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6.3 SMOTE 過採樣. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6.4 BorderlineSMOTE 過採樣 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.6.5 SMOTEENN 結合採樣 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6.6 權重調整法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.7 演算法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 vii 3.7.1 羅吉斯迴歸. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7.2 K-近鄰 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7.3 貝氏分類器. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7.4 支援向量機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.7.5 決策樹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.7.6 隨機森林 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7.7 提升算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.7.8 投票算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.7.9 堆疊算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7.10 平衡集成算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.8 評估指標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.8.1 混合矩陣 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.8.2 F1 分數 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.8.3 MCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.8.4 AUROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 第四章研究結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 第五章結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1 總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 未來展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

參考文獻 References
[1] B. Pian, P. Sundaram, and N. Raoof, “The clinical and prognostic importance of positive blood cultures in adults,” The American Journal of Medicine, vol. 123, no. 9, p. 819, 2010. [2] C. Lee, C. Lee, C. Yang, and Y. Chen, “Beneficial effects of early empirical admin- istration of appropriate antimicrobials on survival and defervescence in adults with community-onset bacteremia,” Critical Care, vol. 23, no. 1, p. 363, 2019. [3] I. Baltas, T. Stockdale, M. Tausan, and Y. Chen, “Impact of antibiotic timing on mortality from gram-negative bacteraemia in an english district general hospital: the importance of getting it right every time,” Journal of Antimicrobial Chemotherapy, vol. 76, no. 3, pp. 813–819, 2021. [4] E. Chang, Foundations of Large-Scale Multimedia Information Management and Retrieval: Mathematics of Perception. Berlin, Germany: Springer, 2011. [5] L. Lin, “A study of developing the methods for solving class imbalance problems,” 台灣碩博士論文 https://hdl.handle.net/11296/mtw2pg, 2010. [6] C. Hsieh, “Prediction to the bacteremia in the patients in the emergency depart- ment with logistic regression and machine learning,” 台灣碩博士論文 https:// hdl.handle.net/11296/27jpqt, 2022. [7] T. You, “Using machine learning to predict bacteremia in emergency room patients,” 台灣碩博士論文https://hdl.handle.net/11296/g9f92a, 2023. 67 [8] A. Fernández, F. Herrera, and M. Del Jesus, “On the influence of an adaptive in- ference system in fuzzy rule based classification systems for imbalanced data-sets,” Expert Systems with Applications, vol. 36, pp. 9805–9812, 2009. [9] B. Krawczyk, “Learning from imbalanced data:open challenges and future direc- tions,” Expert Systems with Applications, vol. 5, no. 4, pp. 221–232, 2016. [10] K. Cheng, “Under-sampling and over-sampling for debias in machine learning,” 台灣碩博士論文https://hdl.handle.net/11296/zbgann, 2022. [11] P.-K. Huang, “集成學習-ensemble learning.” [Online]. Avail- able: https://medium.com/data-science-navigator/0%E9%9B%86%E6%88%90% E5%AD%B8%E7%BF%92-ensemble-learning-f575461889bd [12] Y. James, “[資料分析機器學習] 第 2.4 講：資料前處理 (missing data, one-hot encoding, feature scaling).” [Online]. Available: https: //medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86% E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7% AC%AC2-4%E8%AC%9B-%E8%B3%87%E6%96%99%E5%89%8D%E8%99% 95%E7%90%86-missing-data-one-hot-encoding-feature-scaling-3b70a7839b4a [13] F. Pei, “如何使用 dbscan 處理資料的離群值 (outlier) ?” [Online]. Available: https://www.cupoy.com/qa/club/ai_tw/ 0000016D6BA22D97000000016375706F795F72656C656173654B5741535354434C5542/ 0000017AC8D7D6A30000000F6375706F795F72656C656173655155455354 [14] M. Ester, H.-P. Kriegel, and J. Sander, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Portland, OR: AAAI Press, 1996. 68 [15] Y. Cheng, “資料前處理-特徵工程.” [Online]. Available: https: //medium.com/data-science-navigator/%E8%B3%87%E6%96%99%E5%89% 8D%E8%99%95%E7%90%86-%E7%89%B9%E5%BE%B5%E5%B7%A5% E7%A8%8B-e5ae383def50 [16] K. Pearson, “𝑥2 test of goodness of fit,” Biometrika, vol. 1, no. 2, pp. 117–132, 1900. [17] W. Gosset, “The probable error of a mean,” Biometrika, vol. 6, no. 1, pp. 1–25, 1908. [18] N. Zhang, “Research on unbalanced data classification based on hybrid method,” 台灣碩博士論文 https://hdl.handle.net/11296/sy353q, 2018. [19] Z. Ward, “Hierarchical grouping to optimize an objective function,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244, 1963. [20] N. Chawla, K. Bowyer, and L. Hallet.al, “Smote: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. [21] H. He and Y. Bai, “Borderline-smote: A new over-sampling method in imbalanced data sets learning,” Advances in Intelligent Computing, pp. 878–887, 2008. [22] D. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Transactions on Systems, Man, and Cybernetics, SMC, vol. 2, no. 3, pp. 408–421, 1972. [23] D. Cox, “The regression analysis of binary sequences,” Journal of the Royal Statis- tical Society: Series B (Methodological), vol. 20, no. 2, pp. 215–242, 1958. [24] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967. 69 [25] V. Vapnik and A. Chervonenkis, “A note on one class of perceptrons,” Automation and Remote Control, pp. 821–837, 1963. [26] T. Huang, “機器學習-支撐向量機 (support vector machine, svm) 詳細推導.” [Online]. Available: https://chih-sheng-huang821.medium.com/%E6%A9%9F% E5%99%A8%E5%AD%B8%E7%BF%92-%E6%94%AF%E6%92%90%E5% 90%91%E9%87%8F%E6%A9%9F-support-vector-machine-svm-%E8%A9% B3%E7%B4%B0%E6%8E%A8%E5%B0%8E-c320098a3d2e [27] L. Breiman, Classification and Regression Trees. Belmont, CA: Wadsworth Inter- national Group, 1984. [28] L. Breimanet.al, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. [29] Y. Freundet.al, “A decision-theoretic generalization of on-line learning and an ap- plication to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1995. [30] T. Chenet.al, “Xgboost: A scalable tree boosting system,” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, 2016. [31] C. Yi, “Xgboost 介紹.” [Online]. Available: https://medium.com/chung-yi/ xgboost%E4%BB%8B%E7%B4%B9-b31f7ec8295e [32] G. Keet.al, “Lightgbm: A highly efficient gradient boosting decision tree,” In Ad- vances in Neural Information Processing Systems, pp. 3146–3154, 2017. 70 [33] A. Gusevet.al, “Catboost: unbiased boosting with categorical features support,” arXiv preprint arXiv:1706.09516, 2017. [34] I. Chen, “Python 機器學習-分類模型的 5 個評估指標.” [Online]. Available: https://medium.com/@imirene/python%E6%A9%9F%E5%99%A8%E5%AD% B8%E7%BF%92-%E5%88%86%E9%A1%9E%E6%A8%A1%E5%9E%8B% E7%9A%845%E5%80%8B%E8%A9%95%E4%BC%B0%E6%8C%87%E6% A8%99-3260f116ce47 [35] B. Matthews, “Comparison of the predicted and observed secondary structure of t4 phage lysozyme,” Biochimica et Biophysica Acta (BBA) - Protein Structure, vol. 405, no. 2, pp. 442–451, 1975. 71

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0612124-153516.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS