國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以SMILES結構增強二元屬性資料之藥物副作用預測,Using SMILES structure to enhance the prediction of drug side effect

論文名稱 Title	以SMILES結構增強二元屬性資料之藥物副作用預測 Using SMILES structure to enhance the prediction of drug side effect
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	109 學年度第 1 學期 The fall semester of Academic Year 109	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	65
研究生 Author	林暉 Hui Lin
指導教授 Advisor	李偉柏 Lee Wei-Po
召集委員 Convenor	許育峰 Yu‐Feng Hsu
口試委員 Advisory Committee	楊宗憲 Tsung-Hsien Yang
口試日期 Date of Exam	2020-12-04	繳交日期 Date of Submission	2020-12-19
關鍵字 Keywords	一維卷積網路、SMILES、多模態模型、深度學習、藥物副作用預測 deep learning, multi-model neural networks, 1-dimension convolutional neural networks, drug side-effect prediction, SMILES
統計 Statistics	本論文已被瀏覽 538 次，被下載 1 次 The thesis/dissertation has been browsed 538 times, has been downloaded 1 times.

中文摘要
藥物開發研究一直受到高度的關注，不管是治療所需的藥物，亦或是預防性的疫苗，人類在用藥的需求只會隨著醫療知識的進步不斷提高，然而開發的過程是冗長且每一個步驟都必須非常謹慎的，以免用藥變成用毒，沒解決問題反而損害身體。儘管開發過程順利且按照規章流程，但是每年還是有許多用藥人受到副作用的影響，嚴重者甚至死亡，例如近期在韓國施打的賽諾菲（Sanofi）流感疫苗，統計到10月底已經死亡83人。如此高開發成本的條件下，副作用問題仍然層出不窮，導致更多的醫療資源被浪費，所以如何有效地找出潛在副作用變成藥物開發不可或缺的步驟。在網路普及後資料更容易被收集與整合，藥物相關資料也越來越豐富且多元，然而現今的研究大多還是以某種資料搭配特定的模型訓練，所以本研究主要探討利用不同的模型來萃取不同型態或種類的資料，以提高副作用的預測能力。除了使用多模態模型外，本研究為了改善藥物資料不平衡的問題，將藥物以已知副作用的數量區分成兩個模型訓練，避免整個訓練被部分極度不平衡的資料影響。最後實驗研究過程中，發現大部分的論文並沒有針對雙字元素作特別的編碼，例如鈉（Na）、氯（Cl）、鈣（Ca），這樣會混淆萃取出的特徵與副作用之間的關聯，所以本研究也有調整編碼方式，讓雙字元素可以正確被編碼。就實驗結果而言，多模態模型的預測能力比個別訓練的模型還要好，在資料依照副作用數量分段訓練後，相對平衡的資料其訓練後預測結果也提升不少，而調整雙字元素的編碼後，盡管就數據上只有其中一個資料集搭配特定模型有顯著影響，但是在資料量增加後可能會有更明顯的效果，這部份值得探討。另外在實驗過程中發現利用已知副作用來預測未知副作用可以得到最好的結果，證明副作用之間可能是有高度關聯的且高機率併發的，這部分也值得繼續研究。
Abstract
Drug development research has always received considerable amount of attention. With the advancement of knowledge in the medical field, human demand for medicines either in the form of drugs for treatment or preventive vaccines will continue to increase. However, it must be ensured that medicines do not become poisons and damage the body instead of curing the ailment. Therefore, most drug development processes are lengthy, and every step in such processes must be performed with utmost care. Although development processes may be smooth and in accordance with regulations, several drug users continue to be adversely affected by side effects every year. In severe cases, the side effects can be fatal. For example, the recent Sanofi flu vaccine that was administered in South Korea has caused 83 deaths till the end of October 2020. Despite the high development costs of the drug, its side effects continue to emerge, causing enormous wastage of medical resources. As a result, effective identification of potential side effects have become an indispensable step in drug development. Owing to the popularity of the Internet, data are now easier to collect and integrate. Drug-related information is becoming increasingly abundant and diverse. Nevertheless, most of the current research still uses specific data with specific model training. To this end, this study mainly explores the use of different models to extract different types of data to improve the predictive ability of the drug side effects. In addition to using multi-model neural networks, this study aims to improve the imbalance of drug data. In this study, drugs are divided into two training models based on the number of known side effects to avoid the entire training being affected by some extremely unbalanced data. Finally, during experimental research, it was found that most of the studies did not specifically encode for double-character elements, such as sodium (Na), chlorine (Cl), and calcium (Ca). This leads to a confusion in the relationship between the extracted features and the side effects. In order to solve the problem, my model will adjust the encoding method so that double-word elements can be encoded correctly. In terms of experimental results, the predictive ability of multi-model neural networks is better than that of the individually trained single model. After the data are trained in segments according to the number of side effects, the prediction results of even relatively balanced data improve significantly. Furthermore, after adjusting the encoding of double-character elements, it was seen that only one of the datasets on the data had a significant impact on the specific model. Although the results did not show significant improvement, they may do so with an increased amount of data. This part is worth exploring. Moreover, it was revealed during the experiment that using known side effects to predict unknown side effects can yield the best results. This proves that side effects may be highly correlated and may have a high probability of being concurrent. This part is also worth studying.

目次 Table of Contents
論文審定書 i 致謝 ii 摘要 iii Abstract iv 目錄 vi 圖次 viii 表次 x 第一章、緒論 1 1.1 研究背景 1 1.2 研究動機與目的 4 1.3 研究貢獻 5 第二章、文獻探討 6 2.1 深度學習（Deep Learning, DL） 6 2.1.1 多層感知器（Multilayer Perceptron, MLP） 6 2.1.2 卷積神經網路（Convolutional neural networks, CNN） 7 2.1.3 遞迴神經網路（Recurrent Neural Networks, RNN） 8 2.1.4 長短期記憶模型（Long Short-Term Memory, LSTM） 8 2.1.5 多模態神經網路（Multi-Model Neural Networks, MMNN） 9 2.2 藥物結構使用 9 2.3 藥物副作用預測 12 2.3.1 單標籤副作用預測 12 2.3.2 多標籤副作用預測 13 第三章、研究方法 20 3.1 資料取得與處理 20 3.2 標籤不平衡處理 21 3.3 多層感知器（Multilayer Perceptron, MLP） 22 3.4 卷積神經網路（Convolutional Neural Networks, CNN） 23 3.5 多模態神經網路（Multi-Model Neural Networks, MMNN） 25 3.6 已知副作用預測未知副作用 28 第四章、研究結果 31 4.1 實驗參數設置 31 4.2 資料觀察 32 4.3 多層感知器（Multilayer Perceptron, MLP） 33 4.4 卷積神經網路（Convolutional Neural Networks, CNN） 34 4.5 模型的分段與融合 37 4.6 已知副作用預測潛在未知副作用 40 第五章、結論 45 5.1 總結 45 5.2 未來展望 46 參考文獻 48

參考文獻 References
1. Aygün, I., M. Kaya, and R. Alhajj, Identifying Side Effects of Some Drugs Used in Covid-19 Treatment. 2020. 2. CHEN Xin, L.X., WU Ji, Research progress on drug representation learning. Journal of Tsinghua University(Science and Technology), 2020. 60(2): p. 171-180. 3. Jiang, M., et al., Drug–target affinity prediction using graph neural network and contact maps. RSC Advances, 2020. 10(35): p. 20701-20712. 4. Monteiro, N.R., B. Ribeiro, and J. Arrais, Drug-target interaction prediction: end-to-end deep learning approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2020. 5. Nguyen, T.T., T. Nguyen, and D.-H. Le, Graph convolutional networks for drug response prediction. BioRxiv, 2020. 6. Roden, D.M., et al., Considerations for Drug Interactions on QTc Interval in Exploratory COVID-19 Treatment. Heart Rhythm, 2020. 7. Seo, S., et al., Prediction of Side Effects Using Comprehensive Similarity Measures. BioMed Research International, 2020. 2020. 8. Burkhardt, H.A., et al., Predicting Adverse Drug-Drug Interactions with Neural Embedding of Semantic Predications. bioRxiv, 2019: p. 752022. 9. Ding, Y., J. Tang, and F. Guo, Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing, 2019. 325: p. 211-224. 10. Honda, S., S. Shi, and H.R. Ueda, SMILES Transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738, 2019. 11. Jamal, S., et al., Computational models for the prediction of adverse cardiovascular drug reactions. Journal of translational medicine, 2019. 17(1): p. 171. 12. Li, Y., et al., Drug-target interaction prediction based on drug fingerprint information and protein sequence. Molecules, 2019. 24(16): p. 2999. 13. Nguyen, T., H. Le, and S. Venkatesh, GraphDTA: prediction of drug–target binding affinity using graph convolutional networks. BioRxiv, 2019: p. 684662. 14. Üner, O.C., Deepside: predicting drug side effects with deep learning. 2019, Bilkent University. 15. Uner, O.C., et al., DeepSide: A Deep Learning Framework for Drug Side Effect Prediction. bioRxiv, 2019: p. 843029. 16. Bai, L.-Y., et al., Prediction of effective drug combinations by an improved naïve bayesian algorithm. International journal of molecular sciences, 2018. 19(2): p. 467. 17. Chen, R., et al., Machine learning for drug-target interaction prediction. Molecules, 2018. 23(9): p. 2208. 18. Goh, G.B., et al., Smiles2vec: Predicting chemical properties from text representations. 2018. 19. Guo, B. and Q. Zheng, Using naive Bayes algorithm to estimate the response to drug in lung cancer patients. Combinatorial chemistry & high throughput screening, 2018. 21(10): p. 734-748. 20. Hirohara, M., et al., Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinformatics, 2018. 19(Suppl 19): p. 526. 21. Öztürk, H., A. Özgür, and E. Ozkirimli, DeepDTA: deep drug–target binding affinity prediction. Bioinformatics, 2018. 34(17): p. i821-i829. 22. Paul, A., et al., Chemixnet: Mixed dnn architectures for predicting chemical properties using multiple molecular representations. arXiv preprint arXiv:1811.08283, 2018. 23. Vilar, S., C. Friedman, and G. Hripcsak, Detection of drug–drug interactions through data mining studies using clinical sources, scientific literature and social media. Briefings in bioinformatics, 2018. 19(5): p. 863-877. 24. Watanabe, J.H., T. McInnis, and J.D. Hirsch, Cost of prescription drug–related morbidity and mortality. Annals of Pharmacotherapy, 2018. 52(9): p. 829-837. 25. Bhat, A., et al., Drug Side Effect Analyser Using Machine Learning. International Journal of Advanced Research in Computer Science, 2017. 8(3). 26. Chen, X., et al., MKRMDA: multiple kernel learning-based Kronecker regularized least squares for MiRNA–disease association prediction. Journal of translational medicine, 2017. 15(1): p. 251. 27. Goh, G.B., et al., Smiles2vec: An interpretable general-purpose deep neural network for predicting chemical properties. arXiv preprint arXiv:1712.02034, 2017. 28. Morency, L.-P. and T. Baltrušaitis. Multimodal machine learning: integrating language, vision and speech. in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. 2017. 29. Sun, C., et al. Drug Side-effect Prediction based on Comprehensive Drug Similarity. in 2016 International Forum on Mechanical, Control and Automation (IFMCA 2016). 2017. Atlantis Press. 30. Chang, H.-h., 整合藥物特性預測藥物副作用之研究（Integrating Bio-medical properties to predict drug side effects）. 2016, National Sun Yat-sen University. 31. Hao, M., Y. Wang, and S.H. Bryant, Improved prediction of drug-target interactions using regularized least squares integrating with kernel fusion technique. Analytica chimica acta, 2016. 909: p. 41-50. 32. Jastrzębski, S., D. Leśniak, and W.M. Czarnecki, Learning to smile (s). arXiv preprint arXiv:1602.06289, 2016. 33. Mistry, P., et al., Using random forest and decision tree models for a new vehicle prediction approach in computational toxicology. Soft Computing, 2016. 20(8): p. 2967-2979. 34. Öztürk, H., E. Ozkirimli, and A. Özgür, A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction. BMC bioinformatics, 2016. 17(1): p. 128. 35. Zhang, W., et al., Predicting potential side effects of drugs by recommender methods and ensemble learning. Neurocomputing, 2016. 173: p. 979-987. 36. Ma, J., et al., Deep neural nets as a method for quantitative structure–activity relationships. Journal of chemical information and modeling, 2015. 55(2): p. 263-274. 37. Pyzer‐Knapp, E.O., K. Li, and A. Aspuru‐Guzik, Learning from the harvard clean energy project: The use of neural networks to accelerate materials discovery. Advanced Functional Materials, 2015. 25(41): p. 6495-6502. 38. Ramsundar, B., et al., Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072, 2015. 39. Zhang, W., et al., Predicting drug side effects by multi-label learning and ensemble learning. BMC bioinformatics, 2015. 16(1): p. 365. 40. Cao, D.S., et al., Computational Prediction of Drug Target Interactions Using Chemical, Biological, and Network Features. Molecular informatics, 2014. 33(10): p. 669-681. 41. Cheng, F., et al., Adverse drug events: database construction and in silico prediction. Journal of chemical information and modeling, 2013. 53(4): p. 744-752. 42. Gøtzsche, P., Deadly medicines and organised crime. How Big Pharma has corrupted healthcare. London: Radcliffe, 2013. 43. Yamanishi, Y., Chemogenomic approaches to infer drug–target interaction networks, in Data Mining for Systems Biology. 2013, Springer. p. 97-113. 44. Liu, M., et al., Large-scale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs. Journal of the American Medical Informatics Association, 2012. 19(e1): p. e28-e35. 45. Mizutani, S., et al., Relating drug–protein interaction network with drug side effects. Bioinformatics, 2012. 28(18): p. i522-i528. 46. Yu, H., et al., A systematic prediction of multiple drug-target interactions from chemical, genomic, and pharmacological data. PloS one, 2012. 7(5): p. e37608. 47. O'Boyle, N.M., et al., Open Babel: An open chemical toolbox. Journal of cheminformatics, 2011. 3(1): p. 33. 48. Pauwels, E., V. Stoven, and Y. Yamanishi, Predicting drug side-effect profiles: a chemical fragment-based approach. BMC bioinformatics, 2011. 12(1): p. 1-13. 49. Yamanishi, Y., et al., Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 2008. 24(13): p. i232-i240. 50. Cheng, A.C., et al., Structure-based maximal affinity model predicts small-molecule druggability. Nature biotechnology, 2007. 25(1): p. 71-75. 51. Keiser, M.J., et al., Relating protein pharmacology by ligand chemistry. Nature biotechnology, 2007. 25(2): p. 197-206. 52. Hinton, G.E. and R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. science, 2006. 313(5786): p. 504-507. 53. Zhu, S., et al., A probabilistic model for mining implicit ‘chemical compound–gene’relations from literature. Bioinformatics, 2005. 21(suppl_2): p. ii245-ii251. 54. Wu, A.W., Adverse drug events and near misses: who's counting? The American Journal of Medicine, 2000. 109(2): p. 166-168. 55. Weininger, D., A. Weininger, and J.L. Weininger, SMILES. 2. Algorithm for generation of unique SMILES notation. Journal of Chemical Information and Modeling, 1989. 29(2): p. 97-101. 56. Weininger, D., SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Modeling, 1988. 28(1): p. 31-36. 57. Rumelhart, D.E., G.E. Hinton, and R.J. Williams, Learning representations by back-propagating errors. nature, 1986. 323(6088): p. 533-536. 58. Smith, T.F. and M.S. Waterman, Identification of common molecular subsequences. Journal of molecular biology, 1981. 147(1): p. 195-197. 59. Rosenblatt, F., The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 1958. 65(6): p. 386.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-1119120-175136.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS