|
1.INTRODUCTIONThe development of new drugs is one of the main ways to treat existing diseases and alleviate pain. However, the cost of new drug development is very expensive, and their identity is long [1-3], resulting in a high failure rate in new drug development. Most drug candidate compounds cannot pass all stages in clinical trials [4]. Since the 1990s, only about 20 drugs have been approved for marketing by the US Food and Drug Administration (FDA) each year, and this number is still decreasing year by year [5]. According to data statistics from CMR International & IMS Health, new drug research and development has shown a clear trend: time costs are gradually increasing, investment funds are gradually increasing, and output is significantly decreasing [6]. Therefore, drug research and development remain a globally important topic [7]. To address this challenge and accelerate the drug development process, drug repositioning [8] methods have emerged. Drug repurposing is a drug development strategy aimed at discovering drugs that have been approved for specific disease treatments and re applying them to new treatment areas. This method can accelerate the drug discovery process, reduce research and development risks and costs, and identify and treat disease areas that are not covered by existing drugs [9]. It is estimated that drug repositioning technology can shorten the time of drug discovery from 10-17 years to 3-12 years [10]. Although drug repositioning has achieved success in some cases, so far, most successful cases have been discovered by chance, usually due to the unexpected emergence of new efficacy or side effects in drug development or market [11]. This emphasizes the need to develop more systematic and controllable methods to achieve more efficient drug repositioning, accelerate the drug development process, and reduce uncertainty. Among them, there are methods based on similarity networks: the network-based method studied by Chiang et al. [12], a prediction model with restart random walk based on heterogeneous networks proposed by Chen et al. [13], the heterogeneous network model TL-HGBI [14], the MBIRW network method [15], and the TP-NRWRH method [16]. There are also machine learning methods used: LRSSL matrix calculation method [17], PreDR method [18-19], and Mehmet Gonen et al. [20] established similarity matrices for drugs and protein targets and used Bayesian algorithms in machine learning to predict drugs and target proteins, searching for new associations. Therefore, to improve drug screening techniques and explore new uses of older drugs, providing patients with more treatment options, this article proposes a drug reuse method based on network pharmacology. The concept of this method is to compare the Human Protein-Protein Interaction Network (PPI) to a map. When the human body is sick, the disease factors of the disease will cause local disturbances on the map, and these factors will aggregate to form disease modules in the human body and be in corresponding regions. Similarly, drug effects can also be seen as triggering local disturbances in the human body, forming drug modules. To effectively treat specific diseases, the target protein of the drug should be located within or near the corresponding disease module in the map [21-22]. Combining this concept with network pharmacology, the relationship between drug targets and disease specific proteins in the human body is quantified through distance and similarity, thereby predicting potential drug disease relationships. The main innovation of this method is as follows: (1) Two distance calculation methods were designed, namely, using the shortest path length and a combination of node degree and shortest path length in the PPI network to calculate the distance between the drug module and the disease module. (2) A similarity calculation method was designed to convert the distance calculated in (1) into similarity through a custom method. 2.DATASET INTRODUCTIONThis article uses drug data, disease data, and gene interaction data from the public dataset [23]. The composition of the dataset is shown in Table 1, and the sources of the dataset are as follows: Table 1.Data set
The drug data used in this experiment is drug target gene data, which is sourced from DrugBank [24]. The DrugBank database is a public database that combines detailed drug data (i.e. chemistry, pharmacology, and pharmaceuticals) with comprehensive drug target information (i.e. sequence, structure, and action pathways). At present, there are over 1710 drugs approved by the FDA for marketing. This article selects a total of 1877 FDA approved drugs with at least one target gene. The disease gene relationship data is sourced from Phenomedia [25], which collects relevant research data centered on diseases and constructs an online encyclopedia of human genome epidemiology. The Phenomedia website collected genetic data for 3255 diseases. The human protein–protein interactome are sourced from the public dataset [26]. This updated version of the human interactome consists of 217160 protein interaction relationships (edges), with a total of 15970 unique proteins (nodes). 3.NETWORK MODEL PERFORMANCE EVALUATIONThis article uses the classic evaluation indicator, the area under the receiver operating characteristic curve (AUC). To calculate AUC, it is necessary to first draw an ROC curve. The area below the curve is the AUC value, which is between 0 and 1, and a larger value indicates better prediction performance. The ROC curve is a curve on a two-dimensional plane, whose abscissa is the false positive rate (FPR), representing the proportion of actual negative samples to all negative samples in the positive class predicted by the classifier. The vertical axis represents the true positive rate (TPR), which represents the proportion of actual positive samples in the positive class predicted by the classifier to all positive samples. The following are the calculation formulas for false positive rate and true positive rate: Among them, if a sample is positive and predicted to be positive, it is the true class (TP); If a sample is positive but predicted to be negative, it is called a false negative class (FN); If a sample is negative but predicted to be positive, it is called a false positive class (FP); If a sample is negative but predicted to be negative, it is called true negative class (TN). 4.ALGORITHM MODELThis article proposes a network based drug reuse algorithm, as shown in Figure 1. As shown in the figure, the network model takes drug target gene data, disease gene data, and the human protein–protein interactome network data list as inputs. And constructs a predictive drug disease association network through module distance calculation, reference distance distribution, similarity calculation, clustering analysis, and other steps. The model is detailed as follows. In the network model, distance calculation is performed using the human protein–protein interactome as the background. As shown in Figure 2, the human protein interaction group is networked using proteins in the human protein interaction group data as nodes and protein relationships as edges. At the same time, drug gene data and disease gene data can be viewed as nodes in the network, modularized to form drug and disease modules. The drug module and the disease module are located at different positions in the network. This article calculates the distance between the drug module and the disease module by combining two methods: the shortest path length, node degree, and path length, and constructs a drug disease network. The formula is as follows: the closer the distance in equation (1) is to 1, the closer the distance between the drug module and the disease module is; The closer the distance in equation (2) to 2, the closer the distance between the drug module and the disease module. In the formula, d (T, S) is the distance between the drug target t in drug module T and the disease gene s in disease module S; G is the network; Dmax (G) refers to the diameter of Figure G; dis (t, s) refers to the shortest path length of nodes t, s; degt refers to the degree of node t, and degG refers to the maximum degree in Figure G. To make the calculation results more meaningful, this article constructs a reference distance distribution that corresponds to the expected distance between two randomly selected groups of proteins in the human interaction group that have the same size and degree distribution as the original disease protein set and drug target. This process is repeated 1000 times, and the size of the Z value is calculated by using the mean and standard deviation of the reference distance distribution. Based on the calculation results, this article sets a threshold of 0.95 and removes the value of Z<0.95 to ensure more meaningful experimental results. To better understand the association between drugs and diseases, this algorithm has customized a transformation formula that converts the distance between modules into similarity, with similarity values within the range of [0-1]. After obtaining drug disease similarity, this algorithm uses the ClusterONE clustering method to detect the similarity between drug and disease modules, making members in the same cluster more similar than those in other clusters. ClusterONE transforms the similarity between drug and disease modules into a weighted graph, where each node represents a drug or disease, each edge represents a possible interaction relationship, and the weight of the edges represents the confidence level of the interaction. This algorithm calculates the score f (v) of each disease module through a formula, representing the cohesion of the module. Where Win represents the total weight of all edges within the same group, Wout represents the total weight of edges connecting the class to the rest of the network, and P is the penalty term. Based on this assumption, when a drug and a disease group are in the same cluster, it indicates that the drug is more likely to be effective for the disease. Therefore, this algorithm introduces cohesion f (v) to reward the association between drugs and diseases in the same cluster. For drugs and diseases in the same cluster, the algorithm adjusts the similarity between them to better reflect their potential associations. At the end of the model, this algorithm uses the sigmoid function to convert the probability of the relationship between each drug and the disease: Where x is the adjusted similarity, c and d are adjustable parameters. In summary, this algorithm generates a prediction probability list that represents the relationship between drug small molecules and disease modules. This list can be viewed as a weighted binary network, where one group represents the drug, and the other group represents the disease. Each pair of nodes has a weighted edge between them, reflecting the degree of correlation between drugs and diseases. 5.EXPERIMENTAL DESIGN AND RESULT ANALYSIS COMPARISON5.1Experimental DesignTo evaluate the effectiveness of the algorithm, we selected 5 human diseases as input data, including Arthritis, Rheumatoid, Diabetes Mellitus, Heart Arrest, Hypertension, and Multiple Sclerosis. There are some commonalities and similarities in the medication of these five diseases, which are conducive to experimental analysis. At the same time, in terms of current medical level, there is relatively complete data on the related genes of these diseases, and their related symptoms are well documented in the literature. This experiment mainly used two distance calculation methods. The first method is the path length method: in the human PPI network, the distance between the drug gene node in the drug module and the disease gene node in the disease module is calculated. According to formula (1), the distance is calculated. If the calculation result is closer to 1, it indicates that the drug module is closer to the disease module; The second method uses a combination of path length and node degree. Node degree refers to the number of connections between a gene node and other gene nodes in the network, as shown in formula (2). The maximum distance value calculated by this method is 2, which has a significant correlation with the potential association between drugs and diseases. After calculating the distance between the drug module and the disease module, other steps in the model are taken in sequence to obtain the probability of the association between the drug small molecule and the disease. The probability value is within [0-1], and the closer it is to 1, the more likely the Western medicine small molecule is to influence the disease. We have found that many new associations have emerged in the new predicted association table, which means that there may be new drug molecules that can treat a certain disease, and the drug molecule has been approved for use. This will contribute to the development of subsequent drugs. 5.2Result Analysis and ComparisonT To analyze the performance of the network model in this algorithm, two different methods were evaluated in this experiment: path length method, path length combined with node degree method, and ROC curve, as shown in Figures 4 and 5. According to the known drug disease association downloaded from the target literature [17], 1 corresponds to the known predicted drug disease association, and 0 indicates no or potential association. For the specified p-value threshold, the true positive rate (i.e. sensitivity) is calculated as the proportion of correctly predicted known associations, while the false positive rate (i.e. 1 specificity) is calculated as the proportion of predicted unknown associations. Draw ROC probability curves under different thresholds based on these indicators and calculate the corresponding Area Under The curve (AUC). The higher the AUC, the better the algorithm distinguishes between two types (known drug disease associations and unknown drug disease associations). According to the experimental results, the AUC value of the first method reached 0.803, while the AUC value of the second method reached 0.829. This means that the method has a great opportunity to distinguish between positive categories (known drug disease associations) and negative categories (unknown drug disease associations). Among the methods for predicting drug disease direct association based on network drug repositioning, the SAveRUNNER algorithm published by Giulia Fiscon et al. is a classic drug utilization algorithm based on network medicine. It uses similarity measures in the network to quantify the interactions between drug modules and disease modules in the human interaction group, thus constructing a bipartite drug disease network, this network prioritizes the association between drugs and diseases located in the same network neighborhood. SAveRUNNER has achieved high accuracy in identifying well-known drug indications and has strong practicality, which can quickly highlight the potential new medical indications of various drugs that have been approved and used in clinical practice. The accuracy of identifying known drug disease relationships has reached 73% (AUC=0.73). Compared to the SAveRUNNER algorithm, the algorithm model studied in this article has improved performance, exceeding 80%. The accuracy of prediction has been improved, and the performance is excellent when predicting drug disease relationships. 6.SUMMARYWith the continuous growth of drug disease data, traditional experimental verification methods are inefficient. Therefore, computational methods are increasingly used as auxiliary means in the field of drug development, and drug repositioning technology provides great technical support for new drug development. This article mainly takes human protein-protein interaction group data as the background and converts it into an undirected network graph. Using module distance calculation, constructing reference distance distribution, clustering analysis and other methods to study and predict the potential correlation between drugs and diseases, providing a certain theoretical basis for drug research and development. With the development of scientific research and the update of medical data, I believe that drug repositioning technology can be better developed. 8.8.REFERENCESPaul S M, Mytelka D S, Dunwiddie C T, et al.,
“How to improve R&D productivity: the pharmaceutical industry’s grand challenge,”
Nature Reviews Drug Discovery, 9
(3), 203
–214
(2010). https://doi.org/10.1038/nrd3078 Google Scholar
Kola I, Landis J.,
“Can the pharmaceutical industry reduce attrition rates?,”
Nature Reviews Drug Discovery, 3
(8), 711
–5
(2004). https://doi.org/10.1038/nrd1470 Google Scholar
Dickson M, Gagnon J P.,
“Key factors in the rising cost of new drug discovery and development,”
Nature Reviews Drug Discovery, 3
(5), 417
–429
(2004). https://doi.org/10.1038/nrd1382 Google Scholar
Yazdanian M, Briggs K, Jankovsky C, et al.,
“The “high solubility” definition of the current FDA Guidance on Biopharmaceutical Classification System may be too strict for acidic drugs,”
Pharmaceutical Research, 21
(2), 293
(2004). https://doi.org/10.1023/B:PHAM.0000016242.48642.71 Google Scholar
Gagnon M A, Lexchin J.,
“The cost of pushing pills: a new estimate of pharmaceutical promotion expenditures in the United States,”
Plos Medicine, 5
(1), e1
(2008). https://doi.org/10.1371/journal.pmed.0050001 Google Scholar
Lipinski C A, Lombardo F, Dominy B W, et al.,
“Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings,”
Advanced Drug Delivery Reviews, 64
(1-3), 4
–17
(2012). https://doi.org/10.1016/j.addr.2012.09.019 Google Scholar
Famm K, Litt B, Tracey K J, et al.,
“Drug discovery: A jump-start for electroceuticals,”
Nature, 496
(7444), 159
–61
(2013). https://doi.org/10.1038/496159a Google Scholar
Pushpakom S, Iorio F, Eyers PA, Escott KJ, Hopper S, Wells A, et al.,
“Drug repurposing: progress, challenges and recommendations,”
Nat Rev Drug Discov., 18
(1), 41
–58
(2019). https://doi.org/10.1038/nrd.2018.168 Google Scholar
Chong C R, Jr S D,
“New uses for old drugs,”
Infect Dis Clin North Am, 353
(1), 653
–664
(1989). Google Scholar
Hurle M R, Yang L Xie Q, et al.,
“Computational Drug Repositioning: From Data to Therapeutics,”
Clinical Pharmacology & Therapeutics, 93
(4), 335
–341
(2013). https://doi.org/10.1038/clpt.2013.1 Google Scholar
Ashburn T T, Thor K B.,
“Drug repositjoning: identifying and developing new uses for existing drugs,”
Nature Reviews Drug Discovery, 3
(8), 673
–683
(2004). https://doi.org/10.1038/nrd1468 Google Scholar
Chiang A P Butte A J,
“Systematic Evaluation of Drug-Disease Relationships to Identify Leads for Novel Drug Uses[J],”
Clinical Pharmacology & Therapeutics, 86
(5), 507
–510
(2009). https://doi.org/10.1038/clpt.2009.103 Google Scholar
Chen X, Liu M X, Yan G Y.,
“Drug-target interaction prediction by random walk on the heterogeneous network,”
Molecular BioSystems, 8
(7), 1970
(2012). https://doi.org/10.1039/c2mb00002d Google Scholar
Wang W, Yang S, Zhang X, et al.,
“Drug repositioning by integrating target information through a heterogeneous network model,”
Bioinformatics, 30
(20), 2923
–2930
(2014). https://doi.org/10.1093/bioinformatics/btu403 Google Scholar
Luo H, Wang J, Li M, et al.,
“Drug repositioning based on comprehensive similarity and Bi-Random Walk algorithm,”
Bioinformatics, 32
(17), btw228
(2016). https://doi.org/10.1093/bioinformatics/btw228 Google Scholar
Liu H, Song Y, Guan J, et al.,
“Inferring new indications for approved drugs via random walk on drug-disease heterogenous networks,”
BMC Bioinformatics, 17
(17 Supplement),
(2016). Google Scholar
Liang X, Zhang P, Yan L, et al.,
“LRSSL: predict and interpret drug-disease associations based on data integration using sparse subspace learning,”
Bioinformatics, btw770
(2017). Google Scholar
Yongcui W, Shilong C, Naiyang D, et al,
“Drug Repositioning by Kernel-Based Integration of Molecular Structure, Molecular Activity, and Phenotype Data,”
PLOS ONE, 8
(11), e78518
(2013). https://doi.org/10.1371/journal.pone.0078518 Google Scholar
Boser, Bernhard E, Guyon, et al.,
“A training algorithm for optimal margin classifiers,”
in Proceedings of Annual Acm Workshop on Computational Learning Theory,
144
–152
(2008). Google Scholar
Mehmet, et al.,
“Drug susceptibility prediction against a panel of drugs using kernelized Bayesian multitask learning,”
Bioinformatics (Oxford, England),
(2014). Google Scholar
Fang J, Zhang P, Wang Q, Zhou Y, Chiang C-W, Chen R, et al.,
“Network-based Translation of GWAS Findings to Pathobiology and Drug Repurposing for Alzheimer’s Disease,”
medRxiv.,
(2020). Google Scholar
Zhou Y, Hou Y. Shen J. Huang Y. Martin W, Cheng F,
“Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2,”
Cell Discov, 6
(1), 1
–18
(2020). https://doi.org/10.1038/s41421-020-0153-3 Google Scholar
Fiscon G, Conte F, Farina L, et al.,
“SAveRUNNER: a network-based algorithm for drug repurposing and its application to COVID-19[J],”
(2020). Google Scholar
Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, et al.,
“DrugBank 5.0: a major update to the DrugBank database for 2018,”
Nucleic Acids Res, 46
(Database issue), D1074
(2018). https://doi.org/10.1093/nar/gkx1037 Google Scholar
Yu W, Clyne M, Khoury MJ, Gwinn M.,
“Phenopedia and Genopedia: disease-centered and gene-centered views of the evolving knowledge of human genetic associations,”
Bioinformatics., 26
(1), 145
–6
(2010). https://doi.org/10.1093/bioinformatics/btp618 Google Scholar
Cheng F, Desai RJ, Handy DE, Wang R, Schneeweiss S, Barabasi A-L, et al.,
“Network-based approach prediction and population-based validation of in silico drug repurposing,”
Nat Commun, 129
(1), 2691
(2018). https://doi.org/10.1038/s41467-018-05116-5 Google Scholar
|