Responsive image
博碩士論文 etd-0719121-103014 詳細資訊
Title page for etd-0719121-103014
論文名稱
Title
基於概率流程模型的時間序列異常偵測
Temporal Anomaly Detection using Probabilistic Process Models
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
47
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2021-07-23
繳交日期
Date of Submission
2021-08-19
關鍵字
Keywords
重複測量資料、混合模型、隱半馬可夫模型、流程發現、異常偵測
Repeated Measures Data, Mixed Model, Hidden Semi-Markov Model, Process Discovery, Anomaly Detection
統計
Statistics
本論文已被瀏覽 522 次,被下載 69
The thesis/dissertation has been browsed 522 times, has been downloaded 69 times.
中文摘要
生活中有許多具有階層性的現象,例如:同一位醫師治療多位病患,而同一位病患有多次的生理量測數值,其中階層由高至低依序為醫師、病患、量測數值,如此的重複測量資料不僅具階層性,還考量了時間因素。針對這種資料,本研究試圖解決以下三個問題:第一、各群體的資料是否隨著時間而遵循一定的模式改變?如何找出其中的變化模式?第二、如何偵測該數值變化過程出現的異常?第三、如何解釋其中的機制,包含變化模式的意義以及為何異常會發生?
考量到重複測量資料中資料點間的相依性,本研究使用廣義線性混合模型樹,並結合隱半馬可夫模型,以發掘系統潛在的變化模式,即流程發現,至於異常偵測,依照資料集中資訊量的多寡分別使用粒子群演算法、最大概似估計或廣義 Jensen–Shannon 散度判別該資料點是否異常,最後,可由混合模型樹的規則進行模型解釋。因此,本研究期望提出的模型可用來偵測時間序列的異常並幫助面臨相關問題的人們做出決策。
Abstract
In the real world, there are many phenomena which are hierarchical. For example, the same doctor treats multiple patients, and the same patient has multiple physiological measurements. This hierarchy from high to low is doctors, patients, and measurements respectively. The repeated measures data considers not only the hierarchy but also the time factor. For this kind of data, our research attempts to solve the following problems: first, does each grouped data change with a specific pattern as time goes on? How to find the changing patterns? Second, how to detect the anomalies in a changing process? Third, how to explain the mechanisms, including the meaning of a changing pattern and why the anomalies occur?
For the dependence of data points in the repeated measures data, we use the generalized linear mixed model trees and combine the hidden semi-Markov model to discover underlying changing patterns of a system, namely the process discovery. As for the anomaly detection, we use the particle swarm optimization, maximum likelihood estimation, or generalized Jensen–Shannon divergence to judge whether the data point is anomalous depending on the amount of information in the dataset. Finally, the model interpretability can be done by the mixed-effect trees rules. As a result, we hope our proposed model can be used to detect the anomalies in the temporal data and help those who face relevant problems make decisions.
目次 Table of Contents
論文審定書........................................................................................ i
摘要.................................................................................................... ii
Abstract.............................................................................................. iii
List of Figures..................................................................................... v
List of Tables...................................................................................... vi
1. Introduction.................................................................................. 1
2. Background and Related Work.................................................... 3
2.1. Correlated Data................................................................... 3
2.2. Generalized Linear Mixed Model (GLMM)........................... 6
2.3. Hidden Semi-Markov Model (HSMM).................................. 8
2.4. Classification Tree Hidden Semi-Markov Model (CTHSMM)... 9
3. Methodology.................................................................................. 10
3.1. Process Discovery Using MMT-HSMM.................................. 11
3.2. Outlier Detection Using PSO and MLE................................. 14
3.3. Anomaly Detection Using Generalized Jensen–Shannon Divergence... 17
3.4. Model Interpretability Using Tree Rules.............................. 20
4. Experiment and Discussion ......................................................... 22
4.1. Introduction to Dataset......................................................... 22
4.2. Experiment Setup ................................................................ 26
4.3. Leaf encoding with GLMM trees............................................ 31
4.4. Comparison of Outlier Definition......................................... 34
5. Conclusion...................................................................................... 35
6. References...................................................................................... 36
參考文獻 References
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
Bendtsen, C. (2012). pso: Particle Swarm Optimization. https://CRAN.R-project.org/package=pso
Bryk, A. S., & Raudenbush, S. W. (1989). 8 - TOWARD A MORE APPROPRIATE CONCEPTUALIZATION OF RESEARCH ON SCHOOL EFFECTS: A THREE-LEVEL HIERARCHICAL LINEAR MODEL11The research reported here has been supported by grants from the Spencer Foundation and the Benton Center for Curriculum and Instruction at the University of Chicago. The authors wish to acknowledge the gracious assistance of David Myers in making a small subset of his Sustaining Effects data files available to us for this analysis. The authors also wish to thank Michael Seltzer for his thoughtful comments on an earlier draft of this manuscript. In R. D. Bock (Ed.), Multilevel Analysis of Educational Data (pp. 159–204). Academic Press. https://doi.org/10.1016/B978-0-12-108840-8.50014-7
Delignette-Muller, M. L., & Dutang, C. (2015). fitdistrplus: An R Package for Fitting Distributions. Journal of Statistical Software, 64(4), 1–34.
Diggle, P., Diggle, D. of M. and S. P. J., Diggle, P. J., Heagerty, P., Liang, K.-Y., Heagerty, P. J., Zeger, S., & Zeger, B. at B. D. S. (2002). Analysis of Longitudinal Data. OUP Oxford.
Field, A. P. (2014). Intraclass Correlation. In N. Balakrishnan, T. Colton, B. Everitt, W. Piegorsch, F. Ruggeri, & J. L. Teugels (Eds.), Wiley StatsRef: Statistics Reference Online (p. stat06612). John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118445112.stat06612
Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage.
Fokkema, M., Smits, N., Zeileis, A., Hothorn, T., & Kelderman, H. (2018). Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees. Behavior Research Methods, 50(5), 2016–2034. https://doi.org/10.3758/s13428-017-0971-x
Galbraith, S., Daniel, J. A., & Vissel, B. (2010). A Study of Clustered Data and Approaches to Its Analysis. Journal of Neuroscience, 30(32), 10601–10608. https://doi.org/10.1523/JNEUROSCI.0362-10.2010
Google’s PageRank and Beyond. (2012). https://press.princeton.edu/books/paperback/9780691152660/googles-pagerank-and-beyond
Grosse, I., Bernaola-Galván, P., Carpena, P., Román-Roldán, R., Oliver, J., & Stanley, H. E. (2002). Analysis of symbolic sequences using the Jensen-Shannon divergence. Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics, 65(4 Pt 1), 041905. https://doi.org/10.1103/PhysRevE.65.041905
Hedeker, D. (2005). Generalized Linear Mixed Models. In Encyclopedia of Statistics in Behavioral Science. American Cancer Society. https://doi.org/10.1002/0470013192.bsa251
Hunter, J. S. (1986). The Exponentially Weighted Moving Average. Journal of Quality Technology, 18(4), 203–210. https://doi.org/10.1080/00224065.1986.11979014
Inker, L. A., Astor, B. C., Fox, C. H., Isakova, T., Lash, J. P., Peralta, C. A., Kurella Tamura, M., & Feldman, H. I. (2014). KDOQI US Commentary on the 2012 KDIGO Clinical Practice Guideline for the Evaluation and Management of CKD. American Journal of Kidney Diseases, 63(5), 713–735. https://doi.org/10.1053/j.ajkd.2014.01.416
Kang, Y., & Zadorozhny, V. (2016a). Process monitoring using maximum sequence divergence. Knowledge and Information Systems, 48(1), 81–109. https://doi.org/10.1007/s10115-015-0858-z
Kang, Y., & Zadorozhny, V. (2016b). Process Discovery Using Classification Tree Hidden Semi-Markov Model. 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), 361–368. https://doi.org/10.1109/IRI.2016.55
Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. Proceedings of ICNN’95 - International Conference on Neural Networks, 4, 1942–1948 vol.4. https://doi.org/10.1109/ICNN.1995.488968
Levey, A. S., Coresh, J., Greene, T., Stevens, L. A., Zhang, Y., Hendriksen, S., Kusek, J. W., & Lente, F. V. (2006). Using standardized serum creatinine values in the modification of diet in renal disease study equation for estimating glomerular filtration rate. Annals of Internal Medicine, 145(4), 247–254. https://doi.org/10.7326/0003-4819-145-4-200608150-00004
Levey, A. S., Stevens, L. A., Schmid, C. H., Zhang, Y. L., Castro, A. F., Feldman, H. I., Kusek, J. W., Eggers, P., Van Lente, F., Greene, T., Coresh, J., & CKD-EPI (Chronic Kidney Disease Epidemiology Collaboration). (2009). A new equation to estimate glomerular filtration rate. Annals of Internal Medicine, 150(9), 604–612. https://doi.org/10.7326/0003-4819-150-9-200905050-00006
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151. https://doi.org/10.1109/18.61115
Liu, O. L., Lee, H.-S., & Linn, M. C. (2010). An investigation of teacher impact on student inquiry science performance using a hierarchical linear model. Journal of Research in Science Teaching, 47(7), 807–819. https://doi.org/10.1002/tea.20372
Lorch, R. F., & Myers, J. L. (1990). Regression analyses of repeated measures data in cognitive research. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(1), 149–157. https://doi.org/10.1037/0278-7393.16.1.149
Myung, I. J. (2003). Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology, 47(1), 90–100. https://doi.org/10.1016/S0022-2496(02)00028-7
Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized Linear Models. Journal of the Royal Statistical Society: Series A (General), 135(3), 370–384. https://doi.org/10.2307/2344614
R Core Team. (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/
Reeves, S. J., & Zhe, Z. (1999). Sequential algorithms for observation selection. IEEE Transactions on Signal Processing, 47(1), 123–132. https://doi.org/10.1109/78.738245
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. CRC Press.
Snijders, T. A. B., & Bosker, R. J. (2011). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. SAGE.
Song, P. X.-K. (2007). Correlated data analysis: Modeling, analytics, and applications. Springer Verlag.
Stevens, L. A., Coresh, J., Greene, T., & Levey, A. S. (2009, October 8). Assessing Kidney Function—Measured and Estimated Glomerular Filtration Rate (world) [Review-article]. Http://Dx.Doi.Org/10.1056/NEJMra054415; Massachusetts Medical Society. https://doi.org/10.1056/NEJMra054415
Subramanian, S. V., Kim, D. J., & Kawachi, I. (2002). Social trust and self-rated health in US communities: A multilevel analysis. Journal of Urban Health, 79(1), S21–S34. https://doi.org/10.1093/jurban/79.suppl_1.S21
Therneau, T., & Atkinson, B. (2019). rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart
Verbeke, G. (1997). Linear Mixed Models for Longitudinal Data. In G. Verbeke & G. Molenberghs (Eds.), Linear Mixed Models in Practice: A SAS-Oriented Approach (pp. 63–153). Springer. https://doi.org/10.1007/978-1-4612-2294-1_3
Wright, M. N., & Ziegler, A. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01
Yu, S.-Z. (2010). Hidden semi-Markov models. Artificial Intelligence, 174(2), 215–243. https://doi.org/10.1016/j.artint.2009.11.011
Zeger, S. L., Liang, K.-Y., & Albert, P. S. (1988). Models for Longitudinal Data: A Generalized Estimating Equation Approach. Biometrics, 44(4), 1049. https://doi.org/10.2307/2531734
Zhang, H., Yu, Q., Feng, C., Gunzler, D., Wu, P., & Tu, X. M. (2012). A new look at the difference between the GEE and the GLMM when modeling longitudinal count responses. Journal of Applied Statistics, 39(9), 2067–2079. https://doi.org/10.1080/02664763.2012.700452
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code