Abstract
Regulatory networks provide control over complex cell behavior in all kingdoms of life. Here we describe a statistical model, based on representing proteins as collections of domains or motifs, which predicts unknown molecular interactions within these biological networks. Using known proteinprotein interactions of Saccharomyces cerevisiae as training data, we were able to predict the links within this network with only 7% falsenegative and 10% falsepositive error rates. We also use Markov chain Monte Carlo simulation for the prediction of networks with maximum probability under our model. This model can be applied across species, where interaction data from one (or several) species can be used to infer interactions in another. In addition, the model is extensible and can be analogously applied to other molecular data (e.g., DNA sequences).
RECENT achievements in genome sequencing, coupled with advances in cellular biology, have raised hopes for an imminent leap forward in our understanding of the regulatory machinery of life. However, we have yet to make the transition from a linear onedimensional sequence of genes to an integrated multidimensional model of metabolic and regulatory networks. Despite their importance, relatively little is understood, with a major complication being the general lack of data on the mechanism, rate, and even existence, of interactions between known genes and proteins. In fact, only recently have sufficient data sets become available to provide support for the analysis of such largescale networks (Uetzet al. 2000; Xenarioset al. 2000).
Molecularinteraction networks feature proteins, nucleic acids, and small molecules as primary players. Since genes are passive carriers of information, and because there are relatively few enzymatic or structural RNA molecules, the majority of important biological functions are carried out by proteins. Being linear sequences of amino acids at the level of primary structure, at the functional level, proteins can be broken down into segments that correspond to functional domains or conserved motifs. Like amino acids, these domains are discrete “letters,” combinations of which give rise to the diversity of protein form and function.
In this work, we assume that the existence of a network connection between proteins, which may or may not involve a physical interaction between them, is a function of the domain composition of each. For convenience of description, we treat nonprotein network nodes as singledomain proteins. If we move along a network pathway, a domain of an upstream protein may favor interaction with a domain of a downstream protein. In addition to a physical connection, the term “interaction” in our model can represent more general relationships between domains, e.g., information flow. Furthermore, we assume that once a given pair has proven effective, nature will tend to reuse it in other networks within the same organism, as well as in other organisms. Thus the model we describe here is based on quantifying, from data taken from known networks, the frequency with which a domain in one protein is observed immediately upstream or downstream of domains in another protein. We then use this information to infer the probability of unknown interactions. Below, we describe our model and how its parameters are estimated, verify its validity with crossvalidation, and show sample applications to real biological networks.
MODEL DESCRIPTION
The model we now describe assigns probabilities to all possible networks formed from a fixed number of vertices. Note that, rather than trying to reproduce the actual genesis of regulatory networks in evolution, our model has the more modest purpose of providing each network with a probability value in such a way that networks having more features typical of real networks have higher probabilities.
We represent a network as an oriented graph, G = <V, E>, where the vertices, V, correspond to proteins, and the edges, E, correspond to interactions between proteins. Each vertex of the network is composed of one or more domains or motifs, which are identified through comparison with existing databases of protein domains (e.g., Pfam; Batemanet al. 2000). We use the frequency of separate occurrence of domains d_{m} and d_{n} in two connected vertices of a known network to infer probabilities of “attraction” p(d_{m},d_{n}) (i.e., that an oriented edge will be formed) between these domains. As described in detail later, these probabilities are used to determine the probability of individual proteinprotein interactions.
This model has two independent stochastic steps, and the probability of an individual network emerges as a product of the probabilities associated with these two stages. In the first stage, every pair of proteins i and j may be connected to each other with an “attraction” probability p_{ij} (we explain how to compute this probability later) or not connected with probability (1 − p_{ij}). We can imagine this process as being performed by a machine that, for every pair of vertices, tosses imaginary biased coins, each coin specific to a particular pair of proteins. If it is heads, an edge between the two vertices is formed; if it is tails, it does not form. The coin is biased by prior information about the domains in each of the vertices, leading to some edges having a probability >0.5 (attraction) and some edges <0.5 (repulsion). For a network with V vertices, there are 2^{V V} possible networks with oriented edges (e.g., a network we describe later consists of 11 vertices and thus has 10^{36} possible configurations). We then define the probability of a single network with the particular edge set E as
In the second stage, networks are sorted into a finite number of bins, each corresponding to a particular “network topology.” In this case, “network topology” is defined as the particular distribution of edges coming into and out of each vertex of the network. Note that for a given number of vertices, it is possible to have a large number of edge configurations that are characterized by the same topology, so each bin represents a collection of networks with identical topologies. The number of incoming edges, or indegree, of a vertex in an oriented graph is the number of oriented edges that end at this vertex. Similarly, the outdegree of a vertex is the number of oriented edges that start at this vertex. For a pair of proteins connected with a single oriented edge, the upstream protein has a single outgoing edge while the downstream protein possesses a single incoming edge. For each network we compute the number of vertices that have outdegree zero,
It is not difficult to verify that the product of the former two stochastic steps would give the probability of sampling any given network:
Although the second stage of our networkgenerating process may appear to the reader as artificial and even unnecessary, it is not. As we elaborate below, real biological networks have a very characteristic topology that distinguishes them from the vast majority of arbitrary random networks. Therefore, in a situation where information about protein domain interactions is far from being complete, a restriction on acceptable network topology is used to improve the prediction ability of the algorithm that we develop here.
Computing probabilities of protein interactions: In this model we consider proteins as “bags of domains,” where each individual pair of domains, d_{m} and d_{n}, has a probability of getting attracted, p(d_{m}, d_{n}). If p(d_{m}, d_{n}) > 0.5, the domains “attract” each other, while for p(d_{m}, d_{n}) < 0.5, we can say that domains “repel” each other. Considering a pair of multidomain proteins i and j, where ν_{i} and ν_{j} are the sets of domains for each protein (a domain of each type occurs in ν_{i} no more than once, even if the ith protein has multiple domains of the same kind), we assume that the probability of attraction (edge probability) between these proteins is given in terms of domain attraction probabilities as
Estimating probabilities of attraction between domains: Facts on interactions between proteins published in the research literature have strikingly different reliabilities. This is in part due to the fact that it is uncommon to publish the negative results of an experiment. As a result, the presence of an interaction between proteins is usually backed by multiple experiments while the absence of interaction may correspond to a failed experiment or just the absence of experiments at all (the only exclusion from this observation is exhaustive twohybrid screening, where all results, both positive and negative, are reported). Therefore we decided to estimate the probabilities of “attraction” between two domains in such a way that the absence of a connection
is treated as the absence of data, while the counts of known connections are used to estimate the probabilities. That is, for domains d_{m} and d_{n}, we compute the attraction probability as
As a result, this formulation assigns probabilities >50% to edges that have known connections and probabilities equal to 50% to edges that have no known connections. In the absence of experimental observations (k_{m} = k_{n} = k_{mn} = 0), the probability of an edge forming between any two domains is exactly 50%, which in turn leads to a probability of 50% for forming an edge between any two proteins, regardless of the number of domains in each of them. In the absence of data, all networks can be assigned a nonzero probability. Note that, while the model allows for it, the current methodology (specifically, Equation 6) does not generate probabilities of <0.5 for domaindomain interactions. While we could have expanded the probability range from 0 to 1, the compressed scale of 0.5–1 does not affect the results, and a future version of this model, combined with the collection of appropriate data (e.g., negative experimental results, appropriate twohybrid data, etc.), will use the range of 0–0.5 for modeling “repulsive” effects between domains.
To summarize, this model assigns a probability to every possible network with V  vertices. This probability is based on both local and global network properties. At the local level, the probability of a vertex having an interaction with another vertex is dependent on the domain composition of each. If, as previously determined by training data, the set of domains in one protein is likely to be attracted to that of another protein, the probability of an edge existing between the two vertices increases to a value >0.5. If no information is available about the likelihood of interaction for the set of domains contained in both upstream and downstream vertices, the probability of an edge forming between the two is taken to be 0.5. At a global scale, the probability of the network (based solely on local properties) is modified on the basis of how well it represents real biological networks. Networks with topologies (distribution of incoming and outgoing edges per vertex) that are more biologically realistic are given higher likelihoods. The probability of any given network is the product of both the local and global probabilities.
Estimating parameters relevant to the topology of real networks: When we estimated the parameters
APPLICATION TO REAL NETWORKS
To test the efficacy of this model, we needed networks with large numbers of known proteinprotein interactions. A complication in this process arises in that, even if a large number of interactions are known, not all of them have a defined domain composition. For this work, we used Saccharomyces cerevisiae proteinprotein interactions taken from the DIP (http://dip.doembi.ucla.edu/; Xenarioset al. 2000). We determined the domains involved in each interaction by analyzing protein sequences with hmmpfam (Batemanet al. 2000), a publicly available software tool that referenced 2015 domains at the time of this analysis. We analyzed a total of 642 proteinprotein interactions (all with at least 1 domain) and then used them to determine the domaindomain interaction probabilities. Data (in this case a list of undirected proteinprotein interactions) used for studying the effect of vertex removal on network edge distributions were taken from the Fields Lab home page (http://depts.washington.edu/sfields/).
The yeast protein network is scale free: We know that the observed powerlaw behavior for the distribution of edge types within the network implies a scalefree system. To provide another means of verification, we determined the value γ for a large network (1823 vertices). We then ran a bootstrap procedure for 200 iterations, where 30 vertices were randomly removed from the network and the value of γ and 95% confidence intervals were determined for each. After this was completed, 60 vertices were removed and the process was repeated. This was repeated until the final 200 iterations with 113 total vertices in the network. The effect of vertex removal on γ is shown in Figure 6 (mean of γ and 95% confidence intervals displayed) and shows that this network is remarkably scale invariant. This implies that knowledge of the topology of a small part of a network should provide a reliable means of estimating the complete network's topology.
Crossvalidation: We used crossvalidation to determine the effectiveness of the model in predicting the overall network configuration. Crossvalidation is a general technique for evaluating the efficiency (and hence, validity) of statistical algorithms. It typically involves dividing a dataset into two disjoint subsets, one of which is used for training, with the other being used for model validation. We used the jackknife version of crossvalidation, where the training set consists of the complete network minus a single specified edge. We then compared the likelihood of the complete graph (the model validation set of data) to that of the complete graph minus one edge. If the likelihood of the full network was greater than the likelihood of the reduced network, we considered the edge to be positively predicted. This step was performed iteratively until all edges were considered. Analysis of the test network indicated that the model predicted 93% of the known 642 edges used in the test; the remaining 7% represent false negatives. We similarly estimated the rate of false positives as ~10% by starting with the full known network and attempting to add a single edge between unconnected vertices. Note that this measure of false positives assumes that all edges not included in the true network should not exist. Currently, we cannot determine which, if any, of the falsepositive edges correspond to true—but currently unknown—connections. While we should be able to achieve even greater accuracy by including more data in the training set, these results suggest that the model is valid and is capable of making reasonably accurate predictions.
Markov chain Monte Carlo: For nearly all species, many interactions within a biological pathway are currently unknown. As our model allows us to compute the probability for any possible arrangement of edges that connect a set of vertices, we implemented a Markov chain Monte Carlo (MCMC) simulation approach (Hastings 1970; Gilkset al. 1996), which allowed us to compute posterior probabilities for all edges while effectively sampling from the astronomically large number of possible networks.
We implemented a reversiblejump methodology (Green 1995) typical for Bayesian model selection, treating different networks as alternative statistical models. We chose a uniform prior distribution over all networks, because, without additional information, we have no reason to prefer one network over another. Starting with an arbitrary network, the algorithm either adds or removes, with equal probability, a defined number of edges. Edges to be added or deleted are respectively sampled from the pool of edges that are included or excluded from the current network, with the probability of selecting any given edge dependent on only the number of edges from which to choose. Adding or removing edges in this manner, the system jumps from network X to a new network Y. The proposed new state Y is sampled from the proposal distribution, q(ba). The new network Y is then accepted with probability
As a smallscale example, we selected a group of 11 yeast proteins known to interact with at least one other member of the group and attempted to predict these edges (Figure 7). The probabilities of a given edge, based on domaindomain interactions alone, are shown in Figure 7a. Note that all edges except (7, 1) (xaxis, yaxis) are found in the original data. The posterior probability estimated through simulation is shown in Figure 7b, and all known edges except (10, 1) are predicted reliably. This result is not merely a sampled version of Figure 7a; rather, it incorporates the constraints imposed by the edge distributions on the topology of the network. Thus edges (7, 1) and (10, 1) are not supported with high confidence due to their low domaindomain interaction probabilities and to the influence of the edge distributions. The effect of topology constraints can also be seen where regions of low probability [e.g., the vicinity of (4, 8)] are associated with proteins that already have a highprobability edge; addition of a second edge is unlikely. The nonsymmetrical pattern is due to differences between the outgoing and incoming edge distributions. Although they are easily differentiated from unlikely edges, all likely edges have relatively low posterior probabilities.
For very small systems, a significant amount of information can be gained simply from looking at the edge probabilities between a given set of vertices, with very little additional information coming from topology information. However, use of the MCMC method described here should be particularly valuable for the prediction of large networks, where large amounts of protein interaction data with complicated domain architectures (such as those of higher organisms) and a computationally intensive number of network topologies are the norm.
As a further example of the application of domaindomain interaction information, we selected 10 proteins known to function in the human apoptosis pathway from the KEGG database (Gotoet al. 1997). As is obvious from Figure 8, few edges were supported by yeast training data; however, the most strongly predicted interaction was of Apaf1 interacting with itself. A search of the signaltransduction literature revealed that Apaf1 does, in fact, selfassociate (Huet al. 1998; Benedictet al. 2000). We were not hitherto aware of this association, and it was not described within KEGG. We believe these results to be quite encouraging. While we were not able to predict the known network, this example is remarkable given the small amount of domaindomain interaction data available for training; it demonstrates the potential application of this method to predicting interactions across species. Accumulation of interaction from more complicated organisms should greatly enhance these predictions.
DISCUSSION
Based on the simple concepts of domain composition and network topology, this model allows us to characterize and predict both known and unknown protein interactions within a given species and potentially across species. Markov chain Monte Carlo techniques described earlier provide a computationally feasible way to calculate the posterior probability of a network, given data as
In the study of regulatory pathways, this model could significantly reduce the number of required experiments by identifying a few most likely hypotheses. Such experimental analysis is itself an empirical way of validating the model, and we can likewise design experiments for this validation. Improvements could include additional interaction data and the introduction of more domains for assignment to protein segments. We are currently enhancing the model by allowing the introduction of repulsion effects, which are implemented by allowing probabilities of <0.5 for domaindomain interactions. This information can be gathered from experiments (past and future) as well as from experts in the field. Also, the creation of pseudodomains for characterizing nonprotein substances and small molecules would allow their analysis within the network.
Despite the lack of data on various molecular parameters (e.g., rate constants), modeling at this level of detail may provide significant benefits. For example, von Dassow et al. (2000) recently described a nonlinear differentialequation model for the simulation of the segment polarity network within Drosophila. Surprisingly, they found that the performance of this network was not dependent on the value of specific kinetic parameters but rather achieved stability through the topology of the network itself.
Of special interest is the finding that the connectivity of vertices appears to follow a powerlaw distribution, exhibiting scalefree behavior. Such behavior implies that the points where a newly added protein is connected to the network will occur preferentially with proteins having greater numbers of preestablished connections (i.e., a “rich get richer” phenomenon). This phenomenon has been observed within metabolic networks, and, most recently, studies by Jeong and colleagues also demonstrated the scalefree nature of the proteinprotein interaction network within yeast described here (Jeong et al. 2000, 2001). The presence of a large number of connections may indicate a fundamentally more important, or more versatile, protein function, a possible realworld example being the protein p53. It would be particularly intriguing to consider what types of evolutionary mechanisms would give rise to particular network topologies. With the continuing accumulation of cellular and molecular data, modeling approaches such as ours should provide a more comprehensive picture of such molecular networks and their role in biological function.
Footnotes

Communicating editor: N. Takahata
 Received January 23, 2001.
 Accepted August 6, 2001.
 Copyright © 2001 by the Genetics Society of America