國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於概率流程模型的時間序列異常偵測,Temporal Anomaly Detection using Probabilistic Process Models

論文名稱 Title	基於概率流程模型的時間序列異常偵測 Temporal Anomaly Detection using Probabilistic Process Models
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	109 學年度第 2 學期 The spring semester of Academic Year 109	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	47
研究生 Author	沈育嬋 Yu-Chan Shen
指導教授 Advisor	康藝晃 KANG, YI-HUANG
召集委員 Convenor	胡雅涵 Hu, Ya-Han
口試委員 Advisory Committee	林耕霈 Lin, Keng-Pei
口試日期 Date of Exam	2021-07-23	繳交日期 Date of Submission	2021-08-19
關鍵字 Keywords	重複測量資料、混合模型、隱半馬可夫模型、流程發現、異常偵測 Repeated Measures Data, Mixed Model, Hidden Semi-Markov Model, Process Discovery, Anomaly Detection
統計 Statistics	本論文已被瀏覽 841 次，被下載 70 次 The thesis/dissertation has been browsed 841 times, has been downloaded 70 times.

中文摘要
生活中有許多具有階層性的現象，例如：同一位醫師治療多位病患，而同一位病患有多次的生理量測數值，其中階層由高至低依序為醫師、病患、量測數值，如此的重複測量資料不僅具階層性，還考量了時間因素。針對這種資料，本研究試圖解決以下三個問題：第一、各群體的資料是否隨著時間而遵循一定的模式改變？如何找出其中的變化模式？第二、如何偵測該數值變化過程出現的異常？第三、如何解釋其中的機制，包含變化模式的意義以及為何異常會發生？考量到重複測量資料中資料點間的相依性，本研究使用廣義線性混合模型樹，並結合隱半馬可夫模型，以發掘系統潛在的變化模式，即流程發現，至於異常偵測，依照資料集中資訊量的多寡分別使用粒子群演算法、最大概似估計或廣義 Jensen–Shannon 散度判別該資料點是否異常，最後，可由混合模型樹的規則進行模型解釋。因此，本研究期望提出的模型可用來偵測時間序列的異常並幫助面臨相關問題的人們做出決策。
Abstract
In the real world, there are many phenomena which are hierarchical. For example, the same doctor treats multiple patients, and the same patient has multiple physiological measurements. This hierarchy from high to low is doctors, patients, and measurements respectively. The repeated measures data considers not only the hierarchy but also the time factor. For this kind of data, our research attempts to solve the following problems: first, does each grouped data change with a specific pattern as time goes on? How to find the changing patterns? Second, how to detect the anomalies in a changing process? Third, how to explain the mechanisms, including the meaning of a changing pattern and why the anomalies occur? For the dependence of data points in the repeated measures data, we use the generalized linear mixed model trees and combine the hidden semi-Markov model to discover underlying changing patterns of a system, namely the process discovery. As for the anomaly detection, we use the particle swarm optimization, maximum likelihood estimation, or generalized Jensen–Shannon divergence to judge whether the data point is anomalous depending on the amount of information in the dataset. Finally, the model interpretability can be done by the mixed-effect trees rules. As a result, we hope our proposed model can be used to detect the anomalies in the temporal data and help those who face relevant problems make decisions.

目次 Table of Contents
論文審定書........................................................................................ i 摘要.................................................................................................... ii Abstract.............................................................................................. iii List of Figures..................................................................................... v List of Tables...................................................................................... vi 1. Introduction.................................................................................. 1 2. Background and Related Work.................................................... 3 2.1. Correlated Data................................................................... 3 2.2. Generalized Linear Mixed Model (GLMM)........................... 6 2.3. Hidden Semi-Markov Model (HSMM).................................. 8 2.4. Classification Tree Hidden Semi-Markov Model (CTHSMM)... 9 3. Methodology.................................................................................. 10 3.1. Process Discovery Using MMT-HSMM.................................. 11 3.2. Outlier Detection Using PSO and MLE................................. 14 3.3. Anomaly Detection Using Generalized Jensen–Shannon Divergence... 17 3.4. Model Interpretability Using Tree Rules.............................. 20 4. Experiment and Discussion ......................................................... 22 4.1. Introduction to Dataset......................................................... 22 4.2. Experiment Setup ................................................................ 26 4.3. Leaf encoding with GLMM trees............................................ 31 4.4. Comparison of Outlier Definition......................................... 34 5. Conclusion...................................................................................... 35 6. References...................................................................................... 36

參考文獻 References
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01 Bendtsen, C. (2012). pso: Particle Swarm Optimization. https://CRAN.R-project.org/package=pso Bryk, A. S., & Raudenbush, S. W. (1989). 8 - TOWARD A MORE APPROPRIATE CONCEPTUALIZATION OF RESEARCH ON SCHOOL EFFECTS: A THREE-LEVEL HIERARCHICAL LINEAR MODEL11The research reported here has been supported by grants from the Spencer Foundation and the Benton Center for Curriculum and Instruction at the University of Chicago. The authors wish to acknowledge the gracious assistance of David Myers in making a small subset of his Sustaining Effects data files available to us for this analysis. The authors also wish to thank Michael Seltzer for his thoughtful comments on an earlier draft of this manuscript. In R. D. Bock (Ed.), Multilevel Analysis of Educational Data (pp. 159–204). Academic Press. https://doi.org/10.1016/B978-0-12-108840-8.50014-7 Delignette-Muller, M. L., & Dutang, C. (2015). fitdistrplus: An R Package for Fitting Distributions. Journal of Statistical Software, 64(4), 1–34. Diggle, P., Diggle, D. of M. and S. P. J., Diggle, P. J., Heagerty, P., Liang, K.-Y., Heagerty, P. J., Zeger, S., & Zeger, B. at B. D. S. (2002). Analysis of Longitudinal Data. OUP Oxford. Field, A. P. (2014). Intraclass Correlation. In N. Balakrishnan, T. Colton, B. Everitt, W. Piegorsch, F. Ruggeri, & J. L. Teugels (Eds.), Wiley StatsRef: Statistics Reference Online (p. stat06612). John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118445112.stat06612 Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage. Fokkema, M., Smits, N., Zeileis, A., Hothorn, T., & Kelderman, H. (2018). Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees. Behavior Research Methods, 50(5), 2016–2034. https://doi.org/10.3758/s13428-017-0971-x Galbraith, S., Daniel, J. A., & Vissel, B. (2010). A Study of Clustered Data and Approaches to Its Analysis. Journal of Neuroscience, 30(32), 10601–10608. https://doi.org/10.1523/JNEUROSCI.0362-10.2010 Google’s PageRank and Beyond. (2012). https://press.princeton.edu/books/paperback/9780691152660/googles-pagerank-and-beyond Grosse, I., Bernaola-Galván, P., Carpena, P., Román-Roldán, R., Oliver, J., & Stanley, H. E. (2002). Analysis of symbolic sequences using the Jensen-Shannon divergence. Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics, 65(4 Pt 1), 041905. https://doi.org/10.1103/PhysRevE.65.041905 Hedeker, D. (2005). Generalized Linear Mixed Models. In Encyclopedia of Statistics in Behavioral Science. American Cancer Society. https://doi.org/10.1002/0470013192.bsa251 Hunter, J. S. (1986). The Exponentially Weighted Moving Average. Journal of Quality Technology, 18(4), 203–210. https://doi.org/10.1080/00224065.1986.11979014 Inker, L. A., Astor, B. C., Fox, C. H., Isakova, T., Lash, J. P., Peralta, C. A., Kurella Tamura, M., & Feldman, H. I. (2014). KDOQI US Commentary on the 2012 KDIGO Clinical Practice Guideline for the Evaluation and Management of CKD. American Journal of Kidney Diseases, 63(5), 713–735. https://doi.org/10.1053/j.ajkd.2014.01.416 Kang, Y., & Zadorozhny, V. (2016a). Process monitoring using maximum sequence divergence. Knowledge and Information Systems, 48(1), 81–109. https://doi.org/10.1007/s10115-015-0858-z Kang, Y., & Zadorozhny, V. (2016b). Process Discovery Using Classification Tree Hidden Semi-Markov Model. 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), 361–368. https://doi.org/10.1109/IRI.2016.55 Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. Proceedings of ICNN’95 - International Conference on Neural Networks, 4, 1942–1948 vol.4. https://doi.org/10.1109/ICNN.1995.488968 Levey, A. S., Coresh, J., Greene, T., Stevens, L. A., Zhang, Y., Hendriksen, S., Kusek, J. W., & Lente, F. V. (2006). Using standardized serum creatinine values in the modification of diet in renal disease study equation for estimating glomerular filtration rate. Annals of Internal Medicine, 145(4), 247–254. https://doi.org/10.7326/0003-4819-145-4-200608150-00004 Levey, A. S., Stevens, L. A., Schmid, C. H., Zhang, Y. L., Castro, A. F., Feldman, H. I., Kusek, J. W., Eggers, P., Van Lente, F., Greene, T., Coresh, J., & CKD-EPI (Chronic Kidney Disease Epidemiology Collaboration). (2009). A new equation to estimate glomerular filtration rate. Annals of Internal Medicine, 150(9), 604–612. https://doi.org/10.7326/0003-4819-150-9-200905050-00006 Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151. https://doi.org/10.1109/18.61115 Liu, O. L., Lee, H.-S., & Linn, M. C. (2010). An investigation of teacher impact on student inquiry science performance using a hierarchical linear model. Journal of Research in Science Teaching, 47(7), 807–819. https://doi.org/10.1002/tea.20372 Lorch, R. F., & Myers, J. L. (1990). Regression analyses of repeated measures data in cognitive research. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(1), 149–157. https://doi.org/10.1037/0278-7393.16.1.149 Myung, I. J. (2003). Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology, 47(1), 90–100. https://doi.org/10.1016/S0022-2496(02)00028-7 Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized Linear Models. Journal of the Royal Statistical Society: Series A (General), 135(3), 370–384. https://doi.org/10.2307/2344614 R Core Team. (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/ Reeves, S. J., & Zhe, Z. (1999). Sequential algorithms for observation selection. IEEE Transactions on Signal Processing, 47(1), 123–132. https://doi.org/10.1109/78.738245 Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. CRC Press. Snijders, T. A. B., & Bosker, R. J. (2011). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. SAGE. Song, P. X.-K. (2007). Correlated data analysis: Modeling, analytics, and applications. Springer Verlag. Stevens, L. A., Coresh, J., Greene, T., & Levey, A. S. (2009, October 8). Assessing Kidney Function—Measured and Estimated Glomerular Filtration Rate (world) [Review-article]. Http://Dx.Doi.Org/10.1056/NEJMra054415; Massachusetts Medical Society. https://doi.org/10.1056/NEJMra054415 Subramanian, S. V., Kim, D. J., & Kawachi, I. (2002). Social trust and self-rated health in US communities: A multilevel analysis. Journal of Urban Health, 79(1), S21–S34. https://doi.org/10.1093/jurban/79.suppl_1.S21 Therneau, T., & Atkinson, B. (2019). rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart Verbeke, G. (1997). Linear Mixed Models for Longitudinal Data. In G. Verbeke & G. Molenberghs (Eds.), Linear Mixed Models in Practice: A SAS-Oriented Approach (pp. 63–153). Springer. https://doi.org/10.1007/978-1-4612-2294-1_3 Wright, M. N., & Ziegler, A. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01 Yu, S.-Z. (2010). Hidden semi-Markov models. Artificial Intelligence, 174(2), 215–243. https://doi.org/10.1016/j.artint.2009.11.011 Zeger, S. L., Liang, K.-Y., & Albert, P. S. (1988). Models for Longitudinal Data: A Generalized Estimating Equation Approach. Biometrics, 44(4), 1049. https://doi.org/10.2307/2531734 Zhang, H., Yu, Q., Feng, C., Gunzler, D., Wu, P., & Tu, X. M. (2012). A new look at the difference between the GEE and the GLMM when modeling longitudinal count responses. Journal of Applied Statistics, 39(9), 2067–2079. https://doi.org/10.1080/02664763.2012.700452

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0719121-103014.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS