Current Advances in Computational Strategies for Drug Discovery in Leishmaniasis (Tropical Diseases Due to Protozoa and Helminths) Part 1

Introduction

Leishmaniasis is a complex disease caused by several species of the Leishmania genus ranging in severity from cutaneous and mucocutaneous lesions to the chronic visceral form that if untreated adequately can cause death. It has a worldwide distribution in 98 countries and 85 out of 98 are developing or poor countries. One of the main problems in leishmaniasis is the limited number of drug options along with the adverse effects they can cause including death (Ahasan., et al. 1996; Sundar & Chakravarty 2010; Oliveira., et al. 2011). In addition, there are reports of treatment failures due to increased parasite resistance to the first drug of choice, the antimonials (Faraut-Gambarelli., et al. 1997; Goyeneche-Patino., et al. 2008). Second-choice drugs, such as amphotericin B, pentamidine, paromomycin, and more recently, miltefosine, have also toxic effects that require hospital management (Maltezou 2008; Oliveira., et al. 2011). Miltefosine, the only oral administered drug for leishmaniasis, has not been tested in many Leishmania species. Recently, a central nervous system toxicity was reported for liposomal amphotericin B therapy used to treat cutaneous leishmaniasis (Glasser & Murray 2011).

In the search for new drug targets in Leishmania, a group of proteins have been proposed based mainly on their known function, the expression level, and localization, or because they are involved in important metabolic processes in the parasite. Topoisomerases (Das., et al. 2008), kinases (de Azevedo & Soares 2009), proteins localized or targeted to lysosomes (Carrero-Lerida., et al. 2009) are some potential Leishmania drug targets. However, none of these protein targets have been used to successfully develop new drugs that can substitute the existing therapies.

Currently, the massive genome sequencing of many medically important microorganisms together with protein structure and drug databases and the development of new computational tools, will allow molecular targets and new drugs to be searched in a more rigorous manner. Three Leishmania genomes, L. major, L. infantum and L. braziliensis (Peacock., et al. 2007) have been sequenced and annotated and a fourth species, L. mexicana and some L. major strains are in the process of being sequenced (GeneDB, http://www.genedb.org; University of Washington Genome Sequencing Center, http://genome.wustl.edu/gsc/gschmpg.html). The availability of these genomes and the annotated proteins can be used in a rational manner to predict novel drug targets and provide a basis to develop new drugs.

The computational prediction of drugs, in addition to the evaluation of drugs already synthesized and used in other diseases, must be coupled with automated in vitro assessment methodologies of these compounds. In this sense and in the case of Leishmania, the use of GFP (Varela., et al. 2009) or luciferase transgenic parasites (Lang., et al. 2005) coupled with techniques such as flow cytometry or fluorometry can be used to rapidly evaluate potential anti-leishmanial drugs. The WHO program for training in tropical diseases research has created a network based on reporter gene technology to foster the process of drug search not only against leishmaniasis but also against other diseases with limited therapeutic options.

Selection of drug targets

An initial step in the drug discovery process involves the search and selection of the drug target. This target is frequently a protein that is essential for the organism survival or critical for regulating a particular signaling pathway. In the specific case of parasites, the protein target when inhibited should impair or delay parasite viability. The classical approach of finding a new essential protein that can act as a potential target is the experimental characterization by using gene knockout or knock-down strategies in the target organism. Besides essentiality, some targets are selected for being specific for the pathogen; for example, the ergosterol pathway is present in fungi and Leishmania spp, but humans only contain the required enzymes for the synthesis of cholesterol. This is the reason why this pathway has been exploited for searching drugs against mycotic pathogens and also Leishmania. However, the experimental approach employing interference RNA (RNAi) is not feasible given Leishmania species do not carry the machinery for RNAi (Peacock., et al. 2007), with the exception of Leishmania braziliensis where some RNAi-associated genes have been found. In addition, depending on the parasite stage the essentiality of a particular protein could change dramatically. With all these constraints, a rational alternative for choosing effective targets is a more systematic study of the biology of the parasite, with the aim of uncovering important mechanisms that are not evident by studying descriptively isolated proteins. A starting point for this "systems view" of the parasite biology in the case of Leishmania, was the sequencing of its genome in 2005 (Ivens., et al. 2005). Since then, more high-throughput data have been generated, not at the same rate as other organisms but with important applications for drug discovery in tropical diseases. This leads to an important issue of data analysis, where computational tools can have a role in reducing the ocean of possibilities of finding a drug for this disease, making more efficient and less costly the experimental setup. In the following sections, we will describe the current computational methods that can be applied to find new drug targets, with special application to the Leishmania parasite.

Selection of targets by homology searching

The simplest approach for finding a drug target is the homology search of essential proteins. There are several organisms with available essential data at genome-wide level (Forsyth., et al. 2002; Kamath., et al. 2003; Hu., et al. 2007). In model organisms such as yeast, the phenotypic effects of deletion of particular genes have been shown (Giaever., et al. 2002) and more recently the study of genetic interactions on a large scale (Costanzo., et al. 2010). This has been used to elucidate redundancy and possibly some synergistic effects among genes. Therefore, it is possible to find orthologs in the organism of interest that could be essential by comparing its sequences against the list of essential genes in model organisms. The Database of Essential Genes (http://tubic.tju.edu.cn/deg/) (Zhang & Lin 2009) provides information of essential genes in prokaryotes and eukaryotes, and it is also possible to do a BLAST search with the protein of interest. This resource is useful for an exploratory search of essentiality of a particular protein. Another important resource, for drug target data, is the DrugBank database (http://www.drugbank.ca/) (Knox., et al. 2011), which can be used to extract drug-target interactions along with additional pharmacological data. The same strategy can be employed in this case; with the advantage that the homology search will also return possible drug candidates that can be tested on the protein found to have homology to the target in DrugBank.

This methodology has been applied in Pseudomonas aeruginosa (Sakharkar., et al. 2004) with the aim of detecting new drug targets, given this bacterium is an important problem in nosocomial settings due to the rapid generation of resistance. In Leishmania, drug targets can be also identified by this approach. Tools like BLAST or PSI-BLAST can be employed, with PSI-BLAST being more sensitive for detecting distant relationships among proteins (Altschul., et al. 1997). However, some false positives still can occur due to alignments that are optimal according to the algorithm but not biologically meaningful. The E value helps to detect those alignments that are significant. As an example, running a PSI-BLAST search with the Leishmania major proteome against the DrugBank database, one can find among the potential Leishmania orthologs to known targets, the protein LmjF36.2430, which is similar to the sterol 14- alpha demethylase in fungi. Drugs such as miconazole are known inhibitors of this enzyme. Interestingly, the protein LmjF19.0450 belongs to the group of protein kinases conserved in other Leishmania species; it is constitutively expressed and has significant similarity to other kinase targets in cancer. These are simple cases of how a homology search can generate a list of potential drug targets using existing genomic data. The main advantage of this methodology is that it offers a quick overview of potential targets and second use of drugs. In addition, the STITCH 2 database (http://stitch.embl.de/) (Kuhn., et al. 2010) compiles known and predicted drug-target relationships jointly with biological information about targets in a network-based view.

Despite its simplicity, the homology search strategy has some caveats. Proteins inside the cell perform specific functions depending on their interactions, and these interactions can vary between species. Even if sequences are highly related, pathway conservation is not necessarily present. In addition, temporal regulation is important, as not all the interactions are active at the same time, which can further complicate the analysis. These problems highlight the importance of detecting targets by incorporating more detailed information about the molecular interactions.

Selection of targets by topological analysis of protein networks

In order to better understand complex pathogens such as Leishmania and to improve the efficiency of the drug discovery process, it is crucial to gain deeper knowledge about how protein interactions are established and how these interactions are regulated. This is a central issue for a more accurate definition of essentiality and biological robustness. These interactions can be described as a network, a representation commonly used to describe complex systems. The protein interaction network (interactome) describes all possible molecular interactions among proteins. The interactome is composed of nodes that represent the molecular components, in this case proteins, and edges, that are the interactions between components (Fig. 1). Depending on the biological function of the node, other types of networks can also be constructed; for example, gene networks involving transcription factors as nodes that regulate other genes by binding (edges) and metabolic networks where the nodes are the enzymes connected by the production of some metabolites. The study of networks comes from a mathematical discipline called graph theory, and the analysis of the interaction patterns in the network is defined as network topology.(Barabasi & Oltvai 2004)

Fig. 1. Schematic representation of a protein network. Yellow circle corresponds to a hub protein, green circles correspond to bottleneck proteins connecting several sub-networks. Lines connecting circles represent the edges of the network.

To detect protein interactions in biological systems, large-scale methods have been developed that can map all possible pairwise interactions. Yeast two-hybrid is a popular technique of this kind, which was used to construct the first interactome (Uetz., et al. 2000). The technique involves the fusion of a protein with a transcription factor DNA-binding domain subunit. This protein is called the bait. The second protein is fused onto an activator domain subunit and it is called the prey. If the interaction between the bait and the prey is present, the two transcription factor subunits will come closer and the expression of the reporter gene is activated (Osman 2004). The most important limitation of this method is the presence of high number of false positives. However recent evidence has shown that a combination of experimental methods will reduce the number of false interactions (Dreze., et al. 2010).

The initial studies of the yeast interactome revealed that the network structure was not organized randomly, and in fact the organization pattern was similar to other experimentally-observed networks. This particular network structure was called scale-free and it was elucidated by analyzing the number of interactions (or degree distribution) of proteins in the yeast interactome, showing that some nodes were more highly connected than others, and those nodes were in relatively low frequency in the network. This scale-free structure followed a power law distribution for the node degree and it described the probability of a node having a certain degree. An interesting consequence of having a scale-free structure is that the network was robust against random deletion of nodes, but susceptible to the deletion of highly connected nodes or hubs (Jeong., et al. 2001). The hubs can be detected by measuring the connectivity or degree of the network. In addition, the scale-free network was also susceptible to deletion of other types of nodes that were not highly connected but control the flux of the network; these nodes were called bottlenecks (Yu., et al. 2007). A classical example of bottleneck nodes is the scaffold proteins (Good., et al. 2011); these proteins facilitate the communication between signalling pathways very efficiently, although sometimes they are not highly connected. Deleting a bottleneck node will disrupt cellular homeostasis by destroying communication between processes in the cell. This network biology approach becomes an important step in a systems level understanding of the biology of parasites like Leishmania, and it becomes very useful for detecting essential nodes that may constitute potential new drug targets.

Construction of the Leishmania protein interaction network

The analysis of the Leishmania protein network could lead to the discovery of new and effective drug targets. However, current protein interaction data in Leishmania have only focused on a few specific proteins, and at this time, no yeast two-hybrid data is available for this organism. Despite this limitation, the use of a computationally-predicted protein network from orthology-based methods is a good first step for the exploration of drug targets that may be more informative than a traditional homology search. The results described in the next section will focus on the current status of the predicted Leishmania major interactome and will give some directions for future experimental studies for network and target validation.

Even when protein domain sequences are conserved, multiple combinations of these domains enable an organism to rewire the interactome in different ways. This can overcome the problem of the context of the targets that influence essentiality and enable new hubs or protein targets to be detected. A common disadvantage is the bias towards detection of conserved interactions, which could be a caveat in the case of organism-specific interactions that may also be important for survival. These specific interactions will be only detected when more data becomes available, which will also allow existent predictions to be validated. In our recent study (Florez., et al. 2010), the protein interaction network in Leishmania major was predicted using only the parasite protein sequences and several protein interaction databases, in particular iPfam (Finn., et al. 2005), PSIMAP (Park., et al. 2005) and PEIMAP. These databases included protein-protein interactions defined by analysis of structures of protein complexes and experimental data extracted from literature, including high-throughput experiments. From the structures, the analysis of interacting structural domains was mapped to the sequence, using the domain definition by Pfam (Finn., et al. 2006) and SCOP (Hubbard., et al. 1997). These two databases contained information of domains with a systematic classification for protein families. In this particular case the physical distance between adjacent domains within a complex was used as the criteria for the definition of interaction and it was stored in iPfam and PSIMAP databases. This strategy has been used in other organisms such as fungi and bacteria (He., et al. 2008; Kim., et al. 2008). The domain interaction analysis generated more diversity in the detection of possible interactions because modular exchange of protein domains allowed rewiring the network even if the isolated sequence of the domain was conserved. However, despite the high accuracy of this method, the prediction of protein interactions was limited as there was not an abundance of crystallized protein complexes. The PEIMAP database was also used, and it included sequences of protein interaction pairs detected by several methods, including co-immunoprecipitation (co-IP) and yeast two-hybrid.

To construct the Leishmania major network, protein sequences were extracted from the GeneDB database. This database included genomic and proteomic information of pathogens, including protozoan parasites. The protein sequences were aligned to the interacting domain pairs using PSI-BLAST against the SCOP 1.71 database with an E-value cutoff of 0.0001, as described previously (Kim., et al. 2008). The PSI-BLAST tool was used for the alignments because it had the advantage of detecting small conserved sequences, such as small domains that would be otherwise missed by using the standard BLASTP. The same strategy was applied for the alignments concerning the iPfam database. In this case, the domain assignment for the Leishmania proteins was carried out using the Pfam database (release 18.0) with the hmmpfam tool employed for the alignments. The final set of predicted interactions was carried out by homology search over the PEIMAP database using BLASTP, with a minimal cutoff of 40% sequence identity and 70% length coverage. The PEIMAP database included protein-protein interaction (PPI) information from six source databases: DIP (Xenarios., et al. 2000), BIND (Bader., et al. 2001), IntAct (Hermjakob., et al. 2004), MINT (Zanzoni., et al. 2002), HPRD(Peri., et al. 2004), and BioGrid (Stark., et al. 2006).

Filtering interactions by using a combined confidence score

As discussed earlier, the reliability of this analysis and its bias to certain types of protein interactions was dependent on the experimental method employed. Therefore, it was necessary to combine results from different databases to increase the coverage and the confidence of the predicted interactions. In the Leishmania major interactome, we used a simple scoring system to identify high confidence interactions. A previous study classified the experimental methods according to their reliability (Chua., et al. 2006), and we used this data in addition to the significance of the sequence alignments to calculate the confidence of the interactions. This scoring system was called the ‘combined score’ method, and it was applied for the confidence calculations in the STRING database (von Mering., et al. 2005). This database is useful for searching predicted protein interactions detected by other methods, although the definitions are beyond the scope of this topic. The score was calculated according to the formula (1):

where score was the confidence value ranging from 0-1 with 1 equals to 100% accuracy, E was the set of methods under analysis (PEIMAP, PSIMAP, iPfam); Ri was the reliability of method i, and n was the number of interactions predicted by method i. The results of these calculations represented pairs of interactions with their respective confidence. With this information, it was possible to select those interactions that fulfilled a particular confidence threshold. In this case, a confidence score of 0.7 was chosen to select the core Leishmania major network. The threshold selection can vary depending on how strongly supported the interactions were required. For us, a 0.70 confidence value gave a smooth fit to the power law distribution and this was an important condition for reliable detection of hubs and bottlenecks.

Topological analysis of the network

Topological metrics such as clustering coefficient and mean shortest path help to describe global characteristics of the network. They measure the density of the connections within the network. Highly dense connected networks are characterized by modular components which also maintain the robustness of the network against failures. Biological networks tend to have a modular structure (Jeong., et al. 2001) and one additional way to test for reliability of the predicted network is by comparing the values of the clustering coefficient and mean shortest path to randomly generated networks with the same number of nodes and edges. These metrics should be statistically different between predicted and random networks. In the case of Leishmania network, 1,000 random networks were generated and the metrics calculated and compared to the original network.

The power law fitting for the definition of scale-free structure can be calculated using the plug-in Network Analyzer v.2.6.1(Assenov., et al. 2008) available in the platform Cytoscape (Shannon., et al. 2003). This platform includes a very advanced environment for network visualization and analysis. Network topology metrics, such as betweenness centrality, and connectivity were calculated using the Hubba server (http://hub.iis.sinica.edu.tw/ Hubbawebcite). (Lin., et al. 2008) A plug-in version of this tool in Cytoscape was recently made available. For the calculation of the metrics, the confidence scores of the interactions were used so the detection could be focused on the nodes most likely to be essential in the group of highly supported interactions. From this analysis, a potential list of targets was selected. However, it was possible that some proteins detected could also be conserved in terms of sequence and function among several organisms including humans. This becomes a problem if drugs targeting some of these proteins interfere with important biological process in humans, generating unwanted toxic effects. To avoid this, an additional filter was used for the list of predicted targets and it consisted of aligning the Leishmania proteins to the human proteins and excluding proteins that were conserved between these two species.

Prediction of protein function from network clusters

An important feature of network analysis was the prediction of protein function. The normal procedure for inferring function involved a homology search of the unknown protein versus a curated protein database such as UniProt (http://www.uniprot.org/). In some occasions, the detection of protein function was not feasible as significant similarity could not be found. When this approach failed, protein interaction network analysis helped to uncover potential functions. The prediction of protein function based on network analysis involved the assumption suggested by experimental data that interacting proteins tended to have related functions. This implied that it was possible to predict the function of neighboring nodes by clustering network modules and knowing the function of some of the nodes inside of the module. This analysis was carried out over the Leishmania network using the Markov Clustering (MCL) algorithm (Enright., et al. 2002) which has been demonstrated to be a robust and fast algorithm for detecting clusters or modules in protein networks (Brohee & van Helden 2006). The algorithm was implemented in the NeAT tool (Brohee., et al. 2008). For proteins of unknown function in the GeneDB database, we predicted their possible biological roles by evaluating the results of Gene Ontology terms for biological processes using the BinGO plug-in available in Cytoscape.

Selection of candidate drug targets from the network analysis

We constructed a protein-protein interaction (PPI) map, combining the results generated by PEIMAP, iPfam and PSIMAP (Fig. 2). The number of interactions detected for each database is described in (Table 1). By merging the data from the different approaches, bias to a specific class of interactions was avoided. The predicted network also contained isolated sub-networks which were difficult to analyze. These sub-networks appeared as a consequence of the inability to assign domains or from the lack of homology of those proteins to the known pairs of protein interactions. These sub-networks could be investigated by further experimental validation of the network. The total number of high confidence predicted interactions were 33,861 for 1,366 nodes

Number of Proteins	PEIMAP		PSIMAP		IPFAM
8,335	Nodes	Edges	Nodes	Edges	Nodes	Edges
8,335	718	14,839	3,184	158,984	2,336	50,398

Table 1. Number of nodes and predicted interactions for each database.

By using the topological metrics of connectivity and betweenness centrality we identified 384 potential targets. From these targets, those that had homology to human proteins were eliminated. This substantially reduced the number of potential targets, although higher specificity of drug effects was expected. As explained earlier, toxicity becomes a very important issue when designing or searching for a drug, since many clinical trials failed because of undesired and severe side effects. After this filter, the final number of targets was reduced to 142. Further filters can be applied to this list to select those targets that were most attractive for drug design (Table 2).

From the group of targets, 91 kinases were predicted as essential proteins in the network with no homology to the human kinome. Kinases are very important regulators of signaling in the cell, and in the case of Leishmania, kinases are crucial to enable the different metabolic changes needed to adapt to a human host. Perhaps by intensive pharmacological investigation, drugs that are very successful in treating cancer (e.g., Gleevec) could be used against Leishmania parasites. One particular example from the group of predicted kinases detected on the network is the protein LMPK (LmjF36.6470). This protein has been shown to be essential in Leishmania mexicana (Wiese 1998) and it has conserved orthologs in other species such as L. amazonensis, L. major, L. tropica, L. aethiopica, L. donovani, L. infantum, and L. braziliensis (Wiese & Gorcke 2001). Therefore, this kinase was an interesting candidate for experimental validation and possibly its upstream and down-stream interacting partners could also be inhibited by a combination of drugs. In addition, one of the challenges in this disease is to find a broad-spectrum drug that can have therapeutic effects on several Leishmania species that cause different forms of leishmaniasis. Further analysis of this target can help to elucidate drugs or combination of drugs that are active against amastigotes, the stage responsible for the disease in mammals. Three ABC transporters that were Leishmania specific –

LmjF34.0670, LmjF27.0470, LmjF32.2060 – were also predicted as essential. They confer resistance to antimonials and pentamidine by extruding the drug outside of the cell (Perez-Victoria., et al. 2002). Based upon our analysis, these proteins could be also interesting drug targets due to their role in the homeostasis of the intracellular parasite environment.

Fig. 2. Visualization of the predicted Leishmania major interactome

GeneDB ID	Uniprot ID	Description
LmjF15.0770	Q4QFA8	Protein kinase.
LmjF07.0250	Q4QIR9	Protein kinase
LmjF11.0330	Q4QH47	PIF1 helicase-like protein
LmjF35.2450	Q4FWM4	Hypothetical protein conserved
LmjF25.1990	Q4Q9P0	Protein kinase
LmjF21.0853	Q4QCC1	Hypothetical protein conserved
LmjF27.1800	Q4FYE1	Protein kinase-like
LmjF35.1000	Q4FX16	Casein kinase I
LmjF26.0660	Q4Q9C8	Protein disulfide isomerase
LmjF25.2050	Q4Q9N4	Helicase-like protein

Table 2. Top 10 list of predicted targets from the L. major interactome.

It has been shown that modular organization is a prevalent feature in biology, and this modular organization of pathways can be used to infer protein function (Rives & Galitski 2003). We detected 63 clusters or modules in the network, and assigned potential biological processes to 263 proteins with no prior functional description. By examining the proportion of predicted targets by biological process, 64% of the proteins in the network were predicted to participate in the protein phosphorylation (GO:0006468). In addition, 8% of proteins were predicted to be involved in nucleosome assembly (GO:0006334), 4% in nucleic acid metabolic process (GO:0006139), 4% in electron transport (GO:0006118), 4% in transport processes (GO:0006810), and 2% in protein amino acid alkylation (GO:0006139). The remaining 14% of target proteins were distributed across processes with one protein per process. This result highlighted the importance of protein kinases as the main protein class to characterize and explore as drug targets in Leishmania parasites.