Using SMILES structure to enhance the prediction of drug side effect
deep learning, multi-model neural networks, 1-dimension convolutional neural networks, drug side-effect prediction, SMILES
Drug development research has always received considerable amount of attention. With the advancement of knowledge in the medical field, human demand for medicines either in the form of drugs for treatment or preventive vaccines will continue to increase. However, it must be ensured that medicines do not become poisons and damage the body instead of curing the ailment. Therefore, most drug development processes are lengthy, and every step in such processes must be performed with utmost care. Although development processes may be smooth and in accordance with regulations, several drug users continue to be adversely affected by side effects every year. In severe cases, the side effects can be fatal. For example, the recent Sanofi flu vaccine that was administered in South Korea has caused 83 deaths till the end of October 2020. Despite the high development costs of the drug, its side effects continue to emerge, causing enormous wastage of medical resources. As a result, effective identification of potential side effects have become an indispensable step in drug development.
Owing to the popularity of the Internet, data are now easier to collect and integrate. Drug-related information is becoming increasingly abundant and diverse. Nevertheless, most of the current research still uses specific data with specific model training. To this end, this study mainly explores the use of different models to extract different types of data to improve the predictive ability of the drug side effects. In addition to using multi-model neural networks, this study aims to improve the imbalance of drug data. In this study, drugs are divided into two training models based on the number of known side effects to avoid the entire training being affected by some extremely unbalanced data. Finally, during experimental research, it was found that most of the studies did not specifically encode for double-character elements, such as sodium (Na), chlorine (Cl), and calcium (Ca). This leads to a confusion in the relationship between the extracted features and the side effects. In order to solve the problem, my model will adjust the encoding method so that double-word elements can be encoded correctly. In terms of experimental results, the predictive ability of multi-model neural networks is better than that of the individually trained single model. After the data are trained in segments according to the number of side effects, the prediction results of even relatively balanced data improve significantly. Furthermore, after adjusting the encoding of double-character elements, it was seen that only one of the datasets on the data had a significant impact on the specific model. Although the results did not show significant improvement, they may do so with an increased amount of data. This part is worth exploring.
Moreover, it was revealed during the experiment that using known side effects to predict unknown side effects can yield the best results. This proves that side effects may be highly correlated and may have a high probability of being concurrent. This part is also worth studying.
