國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於傾向分數的概念飄移發現方法,Discovering concept drift based on propensity score

論文名稱 Title	基於傾向分數的概念飄移發現方法 Discovering concept drift based on propensity score
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	111 學年度第 2 學期 The spring semester of Academic Year 111	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	44
研究生 Author	顏宏任 Hong-Ren Yen
指導教授 Advisor	康藝晃 KANG, YI-HUANG
召集委員 Convenor	王惠嘉 Wang, Hei-Chia
口試委員 Advisory Committee	楊惠芳 Yang,Huei-Fang
口試日期 Date of Exam	2023-07-27	繳交日期 Date of Submission	2023-08-29
關鍵字 Keywords	治療效果、時間變化資料、傾向分數、干擾因子、概念飄移 Treatment effect, Temporal data, Confounder, Propensity score, Concept drift
統計 Statistics	本論文已被瀏覽 399 次，被下載 15 次 The thesis/dissertation has been browsed 399 times, has been downloaded 15 times.

中文摘要
現在很多領域都會使用機器學習模型來協助我們進行決策，但隨著模型的上線，伴隨而來會是概念飄移的問題，模型會因為時間的推移或政策的改變而導致逐漸地不堪使用。因此，我們需要定時地去偵測模型的實用性，一旦我們發現模型開始出現問題，就需要對模型進行修正或是抽換。在現有的概念飄移方法中都是著重偵測和改進方法，雖然可以部分解決概念飄移的問題，但是常會因為忽略干擾因子對於自變數和應變數的影響，而導致偵測概念飄移時產生錯誤的結果。為了解決這個問題，本論文提出概念飄移使用傾向分數的方法，透過加入傾向分數的方法來改善干擾因子的影響，進而讓估計治療效果可以有效的量化資料。
Abstract
In many fields, machine learning models are widely used to assist in decision- making. However, with the deployment of these models, the issue of concept drift arises. Over time or due to policy changes, models gradually become less effective and reliable. Therefore, it is necessary to regularly monitor the usefulness of the models. Once problems are detected, appropriate adjustments or replacements need to be made. Existing concept drift methods primarily focus on detection and improvement techniques, which partially address the concept drift problem. However, they often overlook the influence of confounding factors on the relationships between independent and dependent variables, leading to erroneous results in concept drift detection. To address this issue, this paper proposes Concept Drift using Propensity Score (CDPS). By incorporating propensity scores, the impact of confounding factors can be mitigated, thereby enhancing the ability to accurately quantify treatment effects from the data.

目次 Table of Contents
論文審訂書 i 摘要 ii Abstract iii List of Figures vi List of Tables vii 1. Introduction 1 2. Background 2 2.1 Concept drift detection 2 2.1.1 Error-based drift detection 3 2.1.2 Distribution-based drift detection 4 2.1.3 Explain-based drift detection 6 2.1.4 Unsupervised-based drift detection 6 2.1.5 Ensemble-based drift detection 7 2.1.6 Neural network-based drift detection 8 2.2 Propensity score 9 2.2.1 Identifying potentially confounding factors 9 2.2.2 Computing the propensity score 10 2.2.3 Applying the matching method 11 2.2.4 Evaluating the performance of the matching process 11 3. Methodology 12 3.1 Data segmentation 12 3.2 Propensity score process 13 3.3 Estimate the treatment effect 15 3.4 Detect concept drift 16 4. Experiment 18 4.1 Experiment description 18 4.2 Dataset 19 4.2.1 KMUH dataset 19 4.2.2 GEO dataset 20 4.2.3 Electricity dataset 21 4.2.4 Artificial dataset 22 4.3 Experiment result 23 4.3.1 KMUH dataset 23 4.3.2 GEO dataset 26 4.3.3 Electricity dataset 26 4.3.4 Artificial dataset 27 4.3.5 Discussion 28 5. Conclusion 29 References 30 Appendix 36

參考文獻 References
Abbasi, A., Javed, A. R., Chakraborty, C., Nebhen, J., Zehra, W., & Jalil, Z. (2021). ElStream: An Ensemble Learning Approach for Concept Drift Detection in Dynamic Social Big Data Stream Learning. IEEE Access, 9, 66408–66419. https://doi.org/10.1109/ACCESS.2021.3076264 Austin, P. C. (2009). Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Statistics in Medicine, 28(25), 3083–3107. https://doi.org/10.1002/sim.3697 Austin, P. C., & Stuart, E. A. (2015). Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in Medicine, 34(28), 3661– 3679. https://doi.org/10.1002/sim.6607 Baena-Garcıa, M., Gavalda, R., & Morales-Bueno, R. (n.d.). Early Drift Detection Method. Baier, L., Schlör, T., Schöffer, J., & Kühl, N. (2022). Detecting Concept Drift With Neural Network Model Uncertainty (arXiv:2107.01873). arXiv. http://arxiv.org/abs/2107.01873 Bifet, A., & Gavaldà, R. (2007). Learning from Time-Changing Data with Adaptive Windowing. Proceedings of the 2007 SIAM International Conference on Data Mining, 443–448. https://doi.org/10.1137/1.9781611972771.42 Breiman, L. (n.d.). Statistical Modeling: The Two Cultures. THE TWO CULTURES. Brzeziński, D., & Stefanowski, J. (2011). Accuracy Updated Ensemble for Data Streams with Concept Drift. In E. Corchado, M. Kurzyński, & M. Woźniak (Eds.), Hybrid Artificial Intelligent Systems (Vol. 6679, pp. 155–163). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-21222-2_19 Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 535–541. https://doi.org/10.1145/1150402.1150464 Carbonell, E. J. G., & Siekmann, J. (n.d.). Lecture Notes in Artificial Intelligence. 308. Cerqueira, V., Gomes, H. M., Bifet, A., & Torgo, L. (2022). STUDD: A student–teacher method for unsupervised concept drift detection. Machine Learning. https://doi.org/10.1007/s10994-022-06188-7 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed). L. Erlbaum Associates. Collier, Z. K., & Leite, W. L. (2022). A Tutorial on Artificial Neural Networks in Propensity Score Analysis. The Journal of Experimental Education, 90(4), 1003– 1020. https://doi.org/10.1080/00220973.2020.1854158 Feng Gu, Zhang, G., Jie Lu, & Chin-Teng Lin. (2016). Concept drift detection based on equal density estimation. 2016 International Joint Conference on Neural Networks (IJCNN), 24–30. https://doi.org/10.1109/IJCNN.2016.7727176 Ferri-García, R., & Rueda, M. D. M. (2020). Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys. PLOS ONE, 15(4), e0231500. https://doi.org/10.1371/journal.pone.0231500 Frias-Blanco, I., Campo-Avila, J. D., Ramos-Jimenez, G., Morales-Bueno, R., Ortiz-Diaz, A., & Caballero-Mota, Y. (2015). Online and Non-Parametric Drift Detection Methods Based on Hoeffding’s Bounds. IEEE Transactions on Knowledge and Data Engineering, 27(3), 810–823. https://doi.org/10.1109/TKDE.2014.2345382 Gail, M. H., Wieand, S., & Piantadosi, S. (1984). Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika, 71(3), 431–444. https://doi.org/10.1093/biomet/71.3.431 Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), 1–37. https://doi.org/10.1145/2523813 Granger, E., Watkins, T., Sergeant, J. C., & Lunt, M. (2020). A review of the use of propensity score diagnostics in papers published in high-ranking medical journals. BMC Medical Research Methodology, 20(1), 132. https://doi.org/10.1186/s12874- 020-00994-0 Haug, J., & Kasneci, G. (2021). Learning Parameter Distributions to Detect Concept Drift in Data Streams. 2020 25th International Conference on Pattern Recognition (ICPR), 9452–9459. https://doi.org/10.1109/ICPR48806.2021.9412499 Iwashita, A. S., & Papa, J. P. (2019). An Overview on Concept Drift Learning. IEEE Access, 7, 1532–1547. https://doi.org/10.1109/ACCESS.2018.2886026 Jaworski, M., Rutkowski, L., & Angelov, P. (2020). Concept Drift Detection Using Autoencoders in Data Streams Processing. In L. Rutkowski, R. Scherer, M. Korytkowski, W. Pedrycz, R. Tadeusiewicz, & J. M. Zurada (Eds.), Artificial Intelligence and Soft Computing (Vol. 12415, pp. 124–133). Springer International Publishing. https://doi.org/10.1007/978-3-030-61401-0_12 Keller, B., Kim, J.-S., & Steiner, P. M. (2015). Neural Networks for Propensity Score Estimation: Simulation Results and Recommendations. In L. A. Van Der Ark, D. M. Bolt, W.-C. Wang, J. A. Douglas, & S.-M. Chow (Eds.), Quantitative Psychology Research (Vol. 140, pp. 279–291). Springer International Publishing. https://doi.org/10.1007/978-3-319-19977-1_20 Krawczyk, B., Minku, L. L., Gama, J., Stefanowski, J., & Woźniak, M. (2017). Ensemble learning for data stream analysis: A survey. Information Fusion, 37, 132–156. https://doi.org/10.1016/j.inffus.2017.02.004 Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. https://doi.org/10.1214/aoms/1177729694 Lee, B. K., Lessler, J., & Stuart, E. A. (2010). Improving propensity score weighting using machine learning. Statistics in Medicine, 29(3), 337–346. https://doi.org/10.1002/sim.3782 Lin, J. (n.d.). Divergence Measures Based on the Shannon Entropy. 7. Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2018). Learning under Concept Drift: A Review. IEEE Transactions on Knowledge and Data Engineering, 1–1. https://doi.org/10.1109/TKDE.2018.2876857 McCaffrey, D. F., Ridgeway, G., & Morral, A. R. (n.d.). Propensity Score Estimation With Boosted Regression for Evaluating Causal Effects in Observational Studies. Molnar, C. (n.d.). Interpretable Machine Learning Interpretable Machine Learning. Nishida, K., & Yamauchi, K. (2007). Detecting Concept Drift Using Statistical Testing. In V. Corruble, M. Takeda, & E. Suzuki (Eds.), Discovery Science (Vol. 4755, pp. 264–269). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-75488- 6_27 Rosenbaum, P. R., & Rubin, D. B. (n.d.). The central role of the propensity score in observational studies for causal effects. Schlimmer, J. C., & Granger, R. H. (1986). Incremental learning from noisy data. Machine Learning, 1(3), 317–354. https://doi.org/10.1007/BF00116895 Sethi, T. S., & Kantardzic, M. (2017). On the Reliable Detection of Concept Drift from Streaming Unlabeled Data (arXiv:1704.00023). arXiv. http://arxiv.org/abs/1704.00023 Setoguchi, S., Schneeweiss, S., Brookhart, M. A., Glynn, R. J., & Cook, E. F. (2008). Evaluating uses of data mining techniques in propensity score estimation: A simulation study. Pharmacoepidemiology and Drug Safety, 17(6), 546–555. https://doi.org/10.1002/pds.1555 Sharkawy, A.-N. (n.d.). Principle of Neural Network and Its Main Types: Review. Sohil, F., Sohali, M. U., & Shabbir, J. (2022). An introduction to statistical learning with applications in R: By Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, New York, Springer Science and Business Media, 2013, $41.98, eISBN: 978-1-4614-7137-7. Statistical Theory and Related Fields, 6(1), 87–87. https://doi.org/10.1080/24754269.2021.1980261 Song, X., Wu, M., Jermaine, C., & Ranka, S. (2007). Statistical change detection for multi-dimensional data. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’07, 667. https://doi.org/10.1145/1281192.1281264 Staffa, S. J., & Zurakowski, D. (2018). Five Steps to Successfully Implement and Evaluate Propensity Score Matching in Clinical Research Studies: Anesthesia & Analgesia, 127(4), 1066–1073. https://doi.org/10.1213/ANE.0000000000002787 Sweredoski, M. J., & Baldi, P. (2009). COBEpro: A novel system for predicting continuous B-cell epitopes. Protein Engineering, Design and Selection, 22(3), 113– 120. https://doi.org/10.1093/protein/gzn075 Tsymbal, A., Pechenizkiy, M., Cunningham, P., & Puuronen, S. (2008). Dynamic integration of classifiers for handling concept drift. Information Fusion, 9(1), 56–68. https://doi.org/10.1016/j.inffus.2006.11.002 Watkins, S., Jonsson-Funk, M., Brookhart, M. A., Rosenberg, S. A., O’Shea, T. M., & Daniels, J. (2013). An Empirical Comparison of Tree-Based Methods for Propensity Score Estimation. Health Services Research, n/a-n/a. https://doi.org/10.1111/1475- 6773.12068 written on behalf of AME Big-Data Clinical Trial Collaborative Group, Zhang, Z., Kim, H. J., Lonjon, G., & Zhu, Y. (2019). Balance diagnostics after propensity score matching. Annals of Translational Medicine, 7(1), 16–16. https://doi.org/10.21037/atm.2018.12.10 Zhao, P., Su, X., Ge, T., & Fan, J. (2016). Propensity score and proximity matching using random forest. Contemporary Clinical Trials, 47, 85–92. https://doi.org/10.1016/j.cct.2015.12.012

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0729123-205631.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS