nexoncn.com

文档资料共享网 文档搜索专家

文档资料共享网 文档搜索专家

当前位置：首页 >> >> Information based clustering

1

Information based clustering

Noam Slonim, Gurinder Singh Atwal, Gaˇ sper Tkaˇ cik, and William Bialek

Joseph Henry Laboratories of Physics, and Lewis–Sigler Institute for Integrative Genomics Princeton University, Princeton, New Jersey 08544 {nslonim, gatwal, gtkacik, wbialek}@princeton.edu

arXiv:q-bio/0511043v1 [q-bio.QM] 26 Nov 2005

In an age of increasingly large data sets, investigators in many di?erent disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here we reformulate the clustering problem from an information theoretic perspective which avoids many of these assumptions. In particular, our formulation obviates the need for de?ning a cluster ”prototype,” does not require an a priori similarity metric, is invariant to changes in the representation of the data, and naturally captures non–linear relations. We apply this approach to di?erent domains and ?nd that it consistently produces clusters that are more coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering based on collective notions of similarity rather than the traditional pairwise measures.

The idea that complex data can be grouped into clusters or categories is central to our understanding of the world, and this structure arises in many diverse contexts (e.g., Fig. 1). In popular culture we group ?lms or books into genres, in business we group companies into sectors of the economy, in biology we group the molecular components of cells into functional units or pathways, and so on. Typically these groupings are ?rst constructed by hand using speci?c but qualitative knowledge; e.g., Dell and Apple belong in the same group because they both make computers. The challenge of clustering is to ask whether these qualitative groupings can be derived automatically from objective, quantitative data. Is our intuition about sectors of the economy derivable, for example, from the dynamics of stock prices? Are the functional units of the cell derivable from patterns of gene expression under di?erent conditions (1,2)? The literature on clustering, even in the context of gene expression, is vast (3). Our goal here is not to suggest yet another clustering algorithm, but rather to focus on questions about the formulation of the clustering problem. We are led to an approach, grounded in information theory, that should have wide applicability. Our intuition about clustering starts with the obvious notion that similar elements should fall within the same cluster while dissimilar ones should not. But clustering also achieves data compression—instead of identifying each data point individually, we can identify points by the cluster to which they belong, ending up with a simpler and shorter description of the data. Rate–distortion theory (4,5) formulates precisely the tradeo? between these two considerations, searching for assignments to clusters such that the number of bits used to describe the data is minimized while the average similarity between each data point and its cluster representative (or prototype) is maximized. A well known limitation of this formulation

(as in most approaches to clustering) is that one needs to specify the similarity measure in advance, and quite often this choice is made arbitrarily. Another issue, which attracts less attention, is that the notion of a representative or ”cluster prototype” is inherent to this formulation although it is not always obvious how to de?ne this concept. Our approach provides plausible answers to both these concerns, with further interesting consequences.

Theory

Theoretical Formulation. Imagine that there are N elements (i = 1, 2, · · · , N ) and Nc clusters (C = 1, 2, · · · , Nc ) and that we have assigned elements i to clusters C according to some probabilistic rules, P (C |i), that serve as the variables in our analysis.1 If we reach into a cluster and pull out elements at random, we would like these elements to be as similar to one another as possible. Similarity usually is de?ned among pairs of elements (e.g., the closeness of points in some metric space), but as noted below we also can construct more collective measures of similarity among r > 2 elements; perhaps surprisingly we will see that that this more general case can be analyzed at no extra cost. Leaving aside for the moment the question of how to measure similarity, let us assume that computing the similarity among r elements i1 , i2 , · · · , ir returns a similarity measure s(i1 , i2 , · · · , ir ).

1

Conventionally, one distinguishes “hard” clustering, in which each element is assigned to exactly one cluster, and “soft” clustering in which the assignments are probabilistic, described by a conditional distribution P (C |i); we consider here the more general soft clustering with hard clustering emerging as a limiting case.

2 The average similarity among elements chosen out of a single cluster is

N N

s(C ) =

i1 =1

···

ir =1

P (i1 |C ) · · · P (ir |C )s(i1 , · · · , ir ), (1)

or average; instead, we measure directly the similarity of elements within each cluster; moreover, we can consider collective rather than pairwise measures of similarity. A more rigorous treatment detailing the relation between Eq. (4) and the conventional rate–distortion functional will be presented elsewhere. Optimal Solution. In general it is not possible to ?nd an explicit solution for the P (C |i) that maximize F . However, if we assume that F is di?erentiable with respect to the variables P (C |i), equating the derivative to zero yields after some algebra a set of implicit, self–consistent equations that any optimal solution must obey: P (C |i) = P (C ) exp Z (i; T ) 1 [rs(C ; i) ? (r ? 1)s(C )] , (5) T

where P (i|C ) is the probability to ?nd element i in cluster C . This average similarity corresponds to a scenario where one chooses the elements {i1 , · · · , ir } at random out of a cluster C , independently of each other; other formulations might also be plausible. From Bayes’ rule we have P (i|C ) = P (C |i)P (i)/P (C ), where P (C ) is the total probability of ?nding any element in cluster C , P (C ) = i P (C |i)P (i). In many cases the elements i occur with equal probability so that P (i) = 1/N . We further consider this case for simplicity, although it is not essential. The intuition about the “goodness” of the clustering is expressed through the average similarity over all the clusters,

Nc

where Z (i; T ) is a normalization constant and s(C ; i) is the expected similarity between i and r ? 1 members of cluster C ,

N N

s =

C =1

P (C )s(C ).

(2) s(C ; i) =

···

i1 =1 ir?1 =1

P (i1 |C ) · · ·

(6)

For the special case of pairwise “hard” clustering we ob1 1 tain s h = N C,i,j |C | s(i, j), where |C | is the size of cluster C . This simpler form was shown in (6) to satisfy basic invariance and robustness criteria. The task then is to choose the assignment rules P (C |i) that maximize s , while, as in rate–distortion theory, simultaneously compressing our description of the data as much as possible. To implement this intuition we maximize s while constraining the information carried by the cluster identities (5), 1 I (C ; i) = N

N Nc

P (ir?1 |C )s(i1 , · · · , ir?1 , i). The derivation of these equations from the optimization of F is reminiscent of the derivation of the rate–distortion (5) or information bottleneck (7) equations. This simple form is valid when the similarity measure is invariant under permutations of the arguments. In the more general case we have P (C |i) = P (C ) exp Z (i; T ) 1 [ T

r

s(C ; i(r ) ) ? (r ? 1)s(C )] ,

′

r ′ =1

i=1 C =1

P (C |i) . P (C |i) log P (C )

(3)

Thus, our mathematical formulation of the intuitive clustering problem is to maximize the functional F = s ? T I (C ; i), (4)

where the Lagrange multiplier T enforces the constraint on I (C ; i). Notice that, as in other formulations of the clustering problem, F resembles the free energy in statistical mechanics, where the temperature T speci?es the tradeo? between energy and entropy like terms. This formulation is intimately related to conventional rate–distortion theory. In rate–distortion clustering one is given a ?xed number of bits with which to describe the data, and the goal is to use these bits so as to minimize the distortion between the data elements and some representatives of these data. In practice the bits specify membership in a cluster, and the representatives are prototypical or average patterns in each cluster. Here we see that we can formulate a similar tradeo? with no need to introduce the notion of a representative

(7) ′ where s(C ; i(r ) ) is the expected similarity between i and r ? 1 members of cluster C when i is the r′ argument of s. An obvious feature of Eq. (5) is that element i should be assigned to cluster C with higher probability if it is more similar to the other elements in the cluster. Less obvious is that this similarity has to be weighed against the mean similarity among all the elements in the cluster. Thus, our approach automatically embodies the intuitive principle that “tightly knit” groups are more di?cult to join. We emphasize that we did not explicitly impose this property, but rather it emerges directly from the variational principle of maximizing F ; most other clustering methods do not capture this intuition. The probability P (C |i) in Eq. (5) has the form of a Boltzmann distribution, and increasing similarity among elements of a cluster plays the role of lowering the energy; the temperature T sets the scale for converting similarity di?erences into probabilities. As we lower this temperature there are a sequence of “phase transitions” to solutions with more distinct clusters that achieve greater mean similarity in each cluster (8). For a ?xed

3 number of clusters, reducing the temperature yields more deterministic P (C |i) assignments. Algorithm. Although Eq. (5) is an implicit set of equations, we can turn this self–consistency condition into an iterative algorithm that ?nds an explicit numerical solution for P (C |i) that corresponds to a (perhaps local) maximum of F . Fig. 2 presents a pseudo-code of the algorithm for the case r = 2. Extending the algorithm for the general case of more than pairwise relations (r > 2) is straightforward. In principle we repeat this procedure for di?erent initializations and choose the solution which maximizes F = s ? T I (C ; i). We refer to the algorithm described here as Iclust . We emphasize that we utilize this algorithm mainly because it emerges directly out of the theoretical analysis. Other procedures that aim to optimize the same target functional are certainly plausible and we expect future research to elucidate the potential (dis)advantages of the di?erent alternatives. Information as a Similarity Measure. In formulating the clustering problem as the optimization of F , we have used, as in rate–distortion theory, the generality of information theory to provide a natural measure for the cost of dividing the data into more clusters, but the similarity measure remains arbitrary and commonly is believed to be problem speci?c. Is it possible to use information theory to address this issue as well? To be concrete, consider the case where the elements i are genes and we are trying to measure the relation between gene expression patterns across a variety of conditions ? = 1, 2, · · · , M ; gene i has expression level ei (?) under condition ?. We imagine that there is some real distribution of conditions that cells encounter during their lifetime, and an experiment with a ?nite set of conditions provides samples out of this distribution. Then, for each gene we can de?ne the probability density of expression levels, Pi (e) = 1 M

M

log2

Pij (e1 , e2 ) bits. Pi (e1 )Pj (e2 )

This measure is naturally extended to the multi– information among multiple variables (9), or genes: Ii1 ,i2 ,···,ir =

(r )

dr e Pi1 i2 ···ir (e1 , e2 , · · · , er ) · · · log2

(11)

Pi1 i2 ···ir (e1 , e2 , · · · , er ) bits. Pi1 (e1 )Pi2 (e2 ) · · · Pir (er )

δ (e ? ei (?)),

?=1

(8)

which should become smooth as M → ∞. Similarly we can de?ne the joint probability density for the expression levels of r genes i1 , i2 , · · · , ir , Pi1 ···ir (e1 , · · · , er ) = 1 M

M

δ (e1 ?ei1 (?)) · · · δ (er ?eir (?)).

?=1

We recall that the mutual information is the unique measure of relatedness between a pair of variables that obeys several simple and desirable requirements independent of assumptions about the form of the underlying probability distributions (4). In particular, the mutual (and multi–) information are independent of invertible transformations on the individual variables. For example, the mutual information between the expression levels of two genes is identical to the mutual information between the log of the expression levels: there is no need to ?nd the “right” variables with which to represent the data. The absolute scale of the information measure also has a clear meaning. For example, if two genes share more than one bit of information then the underlying biological mechanisms must be more subtle than just turning expression on and o?. In addition, the mutual information re?ects any type of dependence among variables while ordinary correlation measures typically ignore nonlinear dependences. While these theoretical advantages are well known, in practice information theoretic quantities are notoriously di?cult to estimate from ?nite data. For example, although the distributions in Eq’s. (8,9) become smooth in the limit of many samples (M → ∞), with a ?nite amount of data one needs to regularize or discretize the distributions, and this could introduce artifacts. Although there is no completely general solution to these problems, we have found that in practice the di?culties are not as serious as one might have expected. Using an adaptation of the “direct” estimation method originally developed in the analysis of neural coding (10), we have found that one can obtain reliable estimates of mutual (and sometimes multi–) information values for a variety of data types, including gene expression data (11); see the supplementary material for details. In particular, experiments which explore gene expression levels under > 100 conditions are su?cient to estimate the mutual information between pairs of genes with an accuracy of ? 0.1 bits.2

(9) Given the joint distributions of expression levels, information theory provides natural measures of the relations among genes. For r = 2, we can identify the relatedness of genes i and j with the mutual information between the expression levels, s(i, j) = Ii,j = de1 de2 Pij (e1 , e2 ) · · · (10)

2

It should be noted that in applications where there is a natural similarity measure it might be advantageous to use this measure directly. Furthermore, in situations where the number of observations is not su?cient for non–parametric estimates of the information relations, other heuristic similarity measures should be employed or one could use parametric models for the underlying distributions. Notice, though, that these alternative measures can be incorporated into the algorithm in Fig. 2.

4 To summarize, we have suggested a purely information theoretic approach to clustering and categorization: relatedness among elements is de?ned by the mutual (or multi–) information, and optimal clustering is de?ned as the best tradeo? between maximizing this average relatedness within clusters and minimizing the number of bits required to describe the data. The result is a formulation of clustering that trades bits of similarity against bits of descriptive power, with no further assumptions. A freely available web implementation, of the clustering algorithm and the mutual information estimation procedure can be found at http://www.genomics.princeton.edu/biophysics-theory . Ontologies (15). Almost all of our clusters were signi?cantly enriched in particular annotations. We compared our performance to 18 di?erent conventional clustering algorithms that are routinely applied to this data type (16). We employed the clustering software, available at http://bonsai.ims.u-tokyo.ac.jp?mdehoon/software/cluster/ , to implement the conventional algorithms. In Fig. 5 we see that our clusters obtained the highest average coherence, typically by a signi?cant margin. Moreover, even when the competing algorithms cluster the log2 of expression (ratio) pro?les—a common regularization used in this application with no formal justi?cation— our results are comparable or superior to all of the alternatives. Instead of imposing a hierarchical structure on the data, as done in many popular clustering algorithms, here we directly examine the relations between solutions at di?erent numbers of clusters that were found independently .4 In Fig. 6 we see that an approximate hierarchy emerges as a result rather than as an implicit assumption, where some functional modules (e.g., the “ribosome cluster”, C18 ) are better preserved than others. Our attention is drawn also to the cluster C7 , which is found repeatedly at di?erent numbers of clusters. Speci?cally, at the solution with 20 clusters, among the 114 repressed genes in C7 , 69 have an uncharacterized molecular function; this level of concentration has a probability of ? 10?15 to have arisen by chance. One might have suspected that almost every process in the cell has a few components that have not been identi?ed, and hence that as these processes are regulated there would be a handful of unknown genes that are regulated in concert with many genes of known function. At least for this cluster, our results indicate a di?erent scenario where a signi?cant portion of tightly co–expressed genes remain uncharacterized to date. Stock Prices. To emphasize the generality of our approach we consider a very di?erent data set, the day– to–day fractional changes in price of the stocks in the Standard and Poor’s (S & P) 500 list (available at http://www.standardandpoors.com ), during the trading days of 2003. We cluster these data exactly as in our analysis of gene expression data. The resulting tradeo? curves are shown in Fig. 4B, and again we focus on the four solutions where s already saturates. To determine the coherence of the ensuing clusters we

Results

Gene Expression. As a ?rst test case we consider experiments on the response of gene expression levels in yeast to various forms of environmental stress (12). Previous analysis identi?ed a group of ? 300 stress–induced and ? 600 stress–repressed genes with “nearly identical but opposite patterns of expression in response to the environmental shifts” (13), and these genes were termed the environmental stress response (ESR) module. In fact, based on this observation, these genes were excluded from recent further analysis of the entire yeast genome (14). Nonetheless, as we shall see next, our approach automatically reveals further rich and meaningful substructure in these data. As seen in Fig. 3A, di?erences in expression pro?les within the ESR module indeed are relatively subtle. However, when considering the mutual information relations (Fig. 3B) a relatively clear structure emerges. We have solved our clustering problem for r = 2 and various numbers of clusters and temperatures. The resulting concave tradeo? curves between s and I (C ; i) are shown in Fig. 4A. We emphasize that we generate not a single solution, but a whole family of solutions describing structure at di?erent levels of complexity. With the number of clusters ?xed, s gradually saturates as the temperature is lowered and the constraint on I (C ; i) is relaxed. For the sake of brevity we focused our analysis on the four solutions for which the saturation of s is relatively clear (1/T = 25). At this temperature, ? 85% of the genes have nearly deterministic assignments to one of the clusters [P (C |i) > 0.9 for a particular C ]. As an illustration, three of the twenty clusters found at this temperature are in fact the clusters presented in Fig. 1. We have assessed the biological signi?cance of our results by considering the distribution of gene annotations across the clusters and estimating the corresponding clusters’ coherence 3 with respect to all three Gene

4

3

Speci?cally, the coherence of a cluster (14) is de?ned as the percentage of elements in this cluster which are annotated by an

annotation that was found to be signi?cantly enriched in this cluster (P–val < 0.05, with the Bonferroni correction for multiple hypotheses). See the supplementary material for a detailed discussion regarding the statistical validation of our results. In standard agglomerative or hierarchical clustering one starts with the most detailed partition of singleton clusters and obtains new solutions through merging of clusters. Consequently, one must end up with a tree-like hierarchy of clustering partitions, regardless of whether the data structure actually supports this description.

5 used the Global Industry Classi?cation Standard (GICS) (available at http://wrds.wharton.upenn.edu ) which classi?es companies at four di?erent levels: sector, industry group, industry, and sub-industry. Thus each company is assigned four annotations, which are organized in a hierarchical tree, somewhat similar to the Gene Ontology hierarchical annotation (15). As before, our average coherence performance is comparable to or superior to all the other 18 clustering algorithms we examined (Fig. 5). Almost all our clusters, at various levels of Nc , exhibit a surprisingly high degree of coherence with respect to the “functional labels” that correspond to the di?erent (sub) sectors of the economy. The four independent solutions, at Nc = {5, 10, 15, 20} and 1/T = 35, naturally form an approximate hierarchy (see Fig. 10 of Supporting Material). We have analyzed in detail the results for Nc = 20 and 1/T = 35 where selections from three of the derived clusters are shown in Fig. 1. Eight of the clusters are found to be perfectly (100%) coherent, capturing subtle di?erences between industrial sectors. For example, two of the perfectly coherent clusters segregate companies into either investment banking and asset management (e.g., Merill Lynch) or commercial regional banks (e.g., PNC). Even in clusters with less than perfect coherence we are able to observe and explain relationships between intra-cluster companies above and beyond what the annotations may suggest. For example, one cluster is enriched with “Hotel Resorts and Cruise Line” companies at a coherence level of 30%. Nonetheless, the remaining companies in this cluster seem also to be tied with the tourism industry, like the Walt Disney Co., banks which specialise in credit card issuing and so on. Movie Ratings. Finally, we consider a third test case of yet another di?erent nature: movie ratings provided by more than 70, 000 viewers (the EachMovie database, see http://www.research.digital.com/SRC/eachmovie/ ). Unlike the previous cases, the data here is already naturally quantized since only six possible ratings were permitted. We proceed as before to cluster the 500 movies that received the maximal number of votes. The resulting tradeo? curves are presented in Fig. 4C. Few clusters are preserved amongst the solutions at di?erent numbers of Nc , suggesting that a hierarchical structure may not be a natural representation of the data. Cluster coherence was determined with respect to the genre labels provided in the database: action, animation, art-foreign, classic, comedy, drama, family, horror, romance, and thriller. Fig. 5 demonstrates that our results are superior to all the other 18 standard clustering algorithms. We have analyzed in detail the results for Nc = 20 and 1/T = 40 where, once again, selections from three of the derived clusters are shown in Fig. 1. The clusters indeed re?ect the various genres, but also seem to capture subtle distinctions between sets of movies belonging to the same genre. For example, two of the clusters are both enriched in the action genre, but one group consists mainly of science-?ction movies and the other consists of movies in contemporary settings. Details of all three applications are given in a separate technical report, deposited on ArXiv as http://arxiv.org/abs/q-bio.QM/0511042 .

Discussion

Measuring the coherence of clusters corresponds to asking if the automatic, objective procedure embodied in our optimization principle does indeed recover the intuitive labeling constructed by human hands. Our success in recovering functional categories in di?erent systems using exactly the same principle and practical algorithm is encouraging. It should be emphasized that our approach is not a model of each system and that there is no need for making data–dependent decisions in the representation of the data, nor in the de?nition of similarity. Most clustering algorithms embody—perhaps implicitly—di?erent models of the underlying statistical structure.5 In principle, more accurate models should lead to more meaningful clusters. However, the question of how to construct an accurate model obviously is quite involved, raising further issues that often are addressed arbitrarily before the cluster analysis begins. Moreover, as is clear from Fig. 5, an algorithm or model which is successful in one data type might fail completely in a di?erent domain; even in the context of gene expression, successful analysis of data taken under one set of conditions does not necessarily imply success in a di?erent set of conditions, even for the same organism. Our use of information theory allows us to capture the relatedness of di?erent patterns independent of assumptions about the nature of this relatedness. Correspondingly, we have a single approach which achieves high performance across di?erent domains. Finally, our approach can succeed where other methods would fail qualitatively. Conventional algorithms search for linear or approximately linear relations among the di?erent variables, while our information theoretic approach is responsive to any type of dependencies, including strongly nonlinear structures. In addition, while the cluster analysis literature has focused thus far on pairwise relations and similarity measures, our approach sets a sound theoretical framework for analyzing complex data based on higher order relations. Indeed, it was recently demonstrated, both in principle (17) and in practice (18), that in some situations the data structure is obscured at the pairwise level, but clearly manifests itself only at

5

For example, the K –means algorithm corresponds to maximizing the likelihood of the data on the assumption that these are generated through a mixture of spherical Gaussians.

6 higher levels. The question of how common such data are, as well as the associated computational di?culties in analyzing such higher order relations, is yet to be explored.

[18] Bowers, P. M., Cokus, S. J., Eisenberg, D. & Yeates, T. O. (2004) Science 306, 2246–2249.

Acknowledgments

We thank O Elemento and E Segal for their help in connection with the analysis of the ESR data, and C Callan, D Botstein, N Friedman, R Schapire and S Tavazoie for their helpful comments on early versions of the manuscript. This work was supported in part by the National Institutes of Health Grant P50 GM071508. GT was supported in part by the Burroughs–Wellcome Graduate Training Program in Biological Dynamics.

References [1] Brown, P. O. & Botstein, D. (1999) Nature Gen 21, 33– 37. [2] Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998) Proc Nat Acad Sci (USA) 95, 14863–14868. [3] Jain, A. K., Murty, M. N. & Flynn, P. J. (1999) ACM Computing Surveys 31, 264–323. [4] Shannon, C. E. (1948) The Bell Sys Tech J 27, 379–423 & 623–656. [5] Cover, T. M. & Thomas, J. A. (1991) Elements of Information Theory (John Wiley and Sons, New York). [6] Puzicha, J., Hofmann, T. & Buhmann, J. M. (2000) Pattern Recognition 33, 617–634. [7] Tishby, N., Pereira, F. C. & Bialek, W. (1999) in Proc. of the 37th Annual Allerton Conference on Communication, Control and Computing , eds. Hajek, B. & Sreenivas, R. S. (Urbana, Illinois), pp. 368-377. [8] Rose, K. (1998) Proc IEEE 86, 2210–2239. [9] Studen, M. & Vejnarova, J. (1998) in Learning in Graphical Models , ed. Jordan, M. I. (Kluwer Academic Publishers, Dordrecht), pp. 261–298. [10] Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R. & Bialek, W. (1998) Phys Rev Lett 80, 197–200. [11] Slonim, N., Atwal, G. S., Tkaˇ cik, G. & Bialek, W. (2005) http://arxiv.org/abs/cs.IT/0502017 . [12] Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel–Harel, O., Eisen, M. B., Storz, G., Botstein, D. & Brown, P. O. (2000) Mol Biol Cell 11, 4241–4257. [13] Gasch, A. P. (2002) in Topics in Current Genetics , eds. Hohmann, S. & Mager, P. (Springer-Verlag, Heidelberg), Vol. 1, pp. 11–70. [14] Segal, E., Shapira, M., Regev, A., Pe’er, D., Botstein, D., Koller, D. & Friedman, N. (2003) Nature Gen 34, 166–176. [15] Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig J. T. et al. (2000) Nature Gen 25 25–29. [16] de Hoon, M. J. L., Imoto, S., Nolan, J. & Miyano, S. (2004) Bioinformatics 20, 1453–1454. [17] Schneidman, E., Still, S., Berry II, M. J. & Bialek, W. (2003) Phys Rev Lett 91, 238701(4).

7

Clusters of genes V T U C9 C9 C & &R I E F FFH# B 7 7 ¤ 6 & H & FF BF F 7 7 P & §S # FF 7 §(F §&F 77& Q# &FF I §&I F F " ?2? P?? P &??? 0 ?2 ? 0W? 0W? 0D? ? D?1 ) 2 ?2 ? 2 ??? 0 0 ? ?! ? ? ?? ??? ? ?? )0 ? ?! 1 ? ?7'# ?0? )? ??23?

… … …

…

C 9@ ? ? ?? ?? ?¤ § ? ? ??¨? ? ? ? ? ????

Clusters of stocks

…

A 9 "? !? ¤ #?? $% ??? & ? ¤ ???

C C

…

A '( ? §? )§0? $% ?2 ???1 $?% ¤ 5 $% 3 40? 60 ¨2??7 11?? 80!%

C

…

A C9 03 ?2 ?? 01????? $ ? ) & 2 #0!! ??11 0

…

9 & 2 ! ' # !?? ? 3 " §2 ?2 ? B1 ? ?? §?C ??D?? & ? E ! 0

Clusters of movies

…

@ ? ??? ?? " ?2? ? ? ?0 4?1 7 §2 § ? 0? ? ?? ? # ? ?0 FG #??

C

FIG. 1 Examples of clusters in three di?erent data sets. For each cluster, a sample of ?ve typical items is presented. All clusters were found through the same automatic procedure.

8

Input:

Pairwise similarity matrix, s(i1 , i2 ), ? i1 = 1, ..., N, i2 = 1, ..., N . Trade-o? parameter, T . Requested number of clusters, Nc . Convergence parameter, ? .

Output:

A (typically “soft”) partition of the N elements into Nc clusters.

Initialization:

m=0. P (m) (C |i) ← A random (normalized) distribution ? i = 1, ..., N .

While True

For every i = 1, ..., N : ? P (m+1) (C |i) ← P (m) (C ) exp ? P (m+1) (C |i) ← ? m←m+1. If ? i = 1, ..., N, ? C = 1, ..., Nc we have |P (m+1) (C |i) ? P (m) (C |i)| ≤ ? , Break.

P (m+1) (C |i)

Nc P (m+1) (C ′ |i) C ′ =1

1 T

[2s(m) (C ; i) ? s(m) (C )] , ? C = 1, ..., Nc .

, ? C = 1, ..., Nc .

FIG. 2 Pseudo-code of the iterative algorithm for the case of pairwise relations (r = 2).

9

FIG. 3 ESR data and information relations. (A) Expression pro?les of the ? 900 genes in the yeast ESR module across the 173 microarray stress experiments (12). (B) Mutual information relations (in bits) among the ESR genes. In both panels the genes are sorted according to the solution with 20 clusters and a relatively saturated s . Inside each cluster, genes are sorted according to their average mutual information relation with other cluster members.

0.75

0.2

A

0.7 0.25 0.65

< S > (bits)

B

C

0.18 0.16

< S > (bits)

< S > (bits)

0.14 0.12 0.1

5 clusters

0.6 0.55 0.5 0.45

5 clusters 10 clusters 15 clusters 20 clusters

0.2

0.15

5 clusters 10 clusters 15 clusters

0.08 0.06 0.04

4

10 clusters 15 clusters 20 clusters

0.1 4 0 1 2 I(C; i) (bits)

20 clusters

0

1

2 I(C; i) (bits)

3

3

0

1

2 I(C; i) (bits)

3

4

FIG. 4 Tradeo? curves in all three applications. In every panel, each curve describes the solutions obtained for a particular number of clusters. Di?erent points along each curve correspond to di?erent local maxima of F at di?erent T values. (A) 1 Tradeo? curves for the ESR data with T = {5, 10, 15, 20, 25}. In Fig. 6 we explore the possible hierarchical relations between 1 1 = {15, 20, 25, 30, 35}. (C) Tradeo? the four saturated solutions at T = 25. (B) Tradeo? curves for the S&P 500 data with T 1 curves for the EachMovie data with T = {20, 25, 30, 35, 40}.

10

100

ESR

S&P 500

EachMovie

75

Coherence

50

25

0

K-medians K-medians Hierarchical Hierarchical K-medians K-means K-means K-means Hierarchical

FIG. 5 Comparison of coherence results of our approach (yellow) with conventional clustering algorithms (16): K –means (green); K –medians (blue); Hierarchical (red). For the hierarchical algorithms, four di?erent variants are tried: complete, average, centroid, and single linkage, respectively from left to right. For every algorithm, three di?erent similarity measures are applied: Pearson correlation (left); absolute value of Pearson correlation (middle); Euclidean distance (right). The white bars in the ESR data correspond to applying the algorithm to the log2 transformation of the expression ratios. In all cases, the results are averaged over all the di?erent numbers of clusters that we tried: Nc = 5, 10, 15, 20. For the ESR data coherence is measured with respect to each of the three Gene Ontologies and the results are averaged.

11

tRNA aminoacyl. for prot. transl.

Transcr. From Pol II promoter

Regul. of redox homeostasis

S phase of mitotic cell cycle

Translational elongation

Nucleobase … metabolism

Carbohydrate biosynthesis

Heat shock protein activity

Nucleoside … metabolism

Pyruvate dehydro. bypass

Oxidoreductase activity

Carbohydrate metabolism

Vacuolar acidification

Ribosome biogenesis

Protein biosynthesis

Cell comunication

tRNA aminacyl. for prot. transl.

60

c12

12

16

c10

36

17

c2

35

c3

38

c20

39

c6

25

c8

34

c11

16

c13 c1

13

10

c17

32

c5 c4

47

49 123

c19 c7 c16

88

c9

56

RNA metabolism

?

?

122

c18

c15 c14

22

17

Cytoplasm

Mitochondrion

Nucleus

Nucleolus

Ribosome

Inclusion > 90% Inclusion > 75% Inclusion > 60% Inclusion > 45%

1 FIG. 6 Relations between the optimal solutions with Nc = {5, 10, 15, 20} at T = 25 for the ESR data. Every cluster is connected to the cluster in the next – less detailed – partition that absorbs its most signi?cant portion. The edge type indicates the level of inclusion. The independent solutions form an approximated hierarchical structure. At the upper level the clusters are sorted as in Fig. 3. The number above every cluster indicates the number of genes in it, and the text title corresponds to the most enriched GO biological–process annotation in this cluster. The titles of the ?ve clusters at the lower level are their most enriched GO cellular-component annotation. Most clusters were enriched with more than one annotation, hence the short titles sometimes are too concise. Red and green clusters represent clusters with a clear majority of stress–induced or stress–repressed genes, respectively.

Protein biosynthesis

赞助商链接

更多相关文档:

更多相关标签:

相关文档

- Information-theoretic co-clustering
- K-ANMI A Mutual Information Based Clustering Algorithm for Categorical Data
- K-ANMI A Mutual Information Based Clustering Algorithm for Categorical Data
- Information-Theoretic Co-clustering
- Fuzzy clustering ensemble based on mutual information
- A HIERARCHICAL CLUSTERING BASED ON MUTUAL INFORMATION MAXIMIZATION
- Information based clustering Supplementary material
- Fuzzy clustering and Bayesian information criterion based threshold estimation for robust VAD
- Information Theoretic Angle-Based Spectral Clustering A Theoretical Analysis and an Algorit
- Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema
- Fuzzy clustering ensemble based on mutual information
- Color Image Information Hiding Based on Perceptual color clustering
- Keep it Simple A Case-Base Maintenance Policy Based on Clustering and Information Theory
- The effectiveness of query-based hierarchic clustering of documents for information retriev
- PR-Supervised feature selection by clustering using conditionalmutual information-based

文档资料共享网 nexoncn.com
copyright ©right 2010-2020。

文档资料共享网内容来自网络，如有侵犯请联系客服。email:zhit325@126.com