nexoncn.com

文档资料共享网 文档搜索专家

文档资料共享网 文档搜索专家

当前位置：首页 >> >> Hierarchical Clustering Based on Mutual Information

arXiv:q-bio/0311039v2 [q-bio.QM] 1 Dec 2003

Hierarchical Clustering Based on Mutual Information

Alexander Kraskov, Harald St¨gbauer, Ralph G. Andrzejak, and Peter Grassberger o John-von-Neumann Institute for Computing, Forschungszentrum J¨ lich,D-52425 J¨ lich, Germany u u February 6, 2008

vised or unsupervised. In the following we will be interested only in exclusive unsupervised classi?cation. Motivation: Clustering is a frequently used con- This type of classi?cation is usually called clustering cept in variety of bioinformatical applications. We or cluster analysis. present a new method for hierarchical clustering of An instance of a clustering problem consist of a set data called mutual information clustering (MIC) al- of objects and a set of properties (called characterisgorithm. It uses mutual information (MI) as a sim- tic vector) for each object. The goal of clustering is ilarity measure and exploits its grouping property: the separation of objects into groups using only the The MI between three objects X, Y, and Z is equal characteristic vectors. Indeed, in general only certo the sum of the MI between X and Y , plus the MI tain aspects of the characteristic vectors will be relbetween Z and the combined object (XY ). evant, and extracting these relevant features is one Results: We use this both in the Shannon (prob- ?eld where mutual information (MI) plays a major abilistic) version of information theory, where the role [Tishby et al., 1999], but we shall not deal with “objects” are probability distributions represented this here. Cluster analysis organizes data either as by random samples, and in the Kolmogorov (algo- a single grouping of individuals into non-overlapping rithmic) version, where the “objects” are symbol se- clusters or as a hierarchy of nested partitions. The quences. We apply our method to the construction of ?rst approach is called partitional clustering (PC), mammal phylogenetic trees from mitochondrial DNA the second one is hierarchical clustering (HC). One sequences and we reconstruct the fetal ECG from the of the main features of HC methods is the visual imoutput of independent components analysis (ICA) pact of the dendrogram which enables one to see how applied to the ECG of a pregnant woman. objects are being merged into clusters. From any Availability: The programs for estimation of MI HC one can obtain a PC by restricting oneself to a and for clustering (probabilistic version) are available “horizontal” cut through the dendrogram, while one at http://www.fz-juelich.de/nic/cs/software. cannot go in the other direction and obtain a full hierContact: a.kraskov@fz-juelich.de archy from a single PC. Because of their wide spread

Abstract

1

Introduction

of applications, there are a large variety of di?erent clustering methods in use [Jain and Dubes, 1988]. The crucial point of all clustering algorithms is the choice of a proximity measure. This is obtained from the characteristic vectors and can be either an indicator for similarity (i.e. large for similar and small for dissimilar objects), or dissimilarity. In the latter case it is convenient but not obligatory if it satis?es 1

Classi?cation or organizing of data is very important in all scienti?c disciplines. It is one of the most fundamental mechanism of understanding and learning [Jain and Dubes, 1988]. Depending on the problem, classi?cation can be exclusive or overlapping, super-

the standard axioms of a metric (positivity, symmetry, and triangle inequality). A matrix of all pairwise proximities is called proximity matrix. Among HC methods one should distinguish between those where one uses the characteristic vectors only at the ?rst level of the hierarchy and derives the proximities between clusters from the proximities of their constituents, and methods where the proximities are calculated each time from their characteristic vectors. The latter strategy (which is used also in the present paper) allows of course for more ?exibility but might also be computationally more costly. Quite generally, the “objects” to be clustered can be either single (?nite) patterns (e.g. DNA sequences) or random variables, i.e. probability distributions. In the latter case the data are usually supplied in form of a statistical sample, and one of the simplest and most widely used similarity measures is the linear (Pearson) correlation coe?cient. But this is not sensitive to nonlinear dependencies which do not manifest themselves in the covariance and can thus miss important features. This is in contrast to mutual information (MI) which is also singled out by its information theoretic background [Cover and Thomas, 1991]. Indeed, MI is zero only if the two random variables are strictly independent. Another important feature of MI is that it has also an “algorithmic” cousin, de?ned within algorithmic (Kolmogorov) information theory [Li and Vitanyi, 1997] which measures the similarity between individual objects. For a thorough discussion of distance measures based on algorithmic MI and for their application to clustering, see [Li et al., 2001, Li et al., 2002]. Another feature of MI which is essential for the present application is its grouping property: The MI between three objects (distributions) X, Y, and Z is equal to the sum of the MI between X and Y , plus the MI between Z and the combined object (joint distribution) (XY ), I(X, Y, Z) = I(X, Y ) + I((X, Y ), Z). (1)

terms [Li and Vitanyi, 1997]. Since X, Y, and Z can be themselves composite, Eq.(1) can be used recursively for a cluster decomposition of MI. This motivates the main idea of our clustering method: instead of using e.g. centers of masses in order to treat clusters like individual objects in an approximative way, we treat them exactly like individual objects when using MI as proximity measure. More precisely, we propose the following scheme for clustering n objects with MIC: (1) Compute a proximity matrix based on pairwise mutual informations; assign n clusters such that each cluster contains exactly one object; (2) ?nd the two closest clusters i and j; (3) create a new cluster (ij) by combining i and j; (4) delete the lines/columns with indices i and j from the proximity matrix, and add one line/column containing the proximities between cluster (ij) and all other clusters; (5) if the number of clusters is still > 2, goto (2); else join the two clusters and stop. In the next section we shall review the pertinent properties of MI, both in the Shannon and in the algorithmic version. This is applied in Sec. 3 to construct a phylogenetic tree using mitochondrial DNA and in Sec. 4 to cluster the output channels of an independent component analysis (ICA) of an electrocardiogram (ECG) of a pregnant woman, and to reconstruct from this the maternal and fetal ECGs. We ?nish with our conclusions in Sec. 6.

2

2.1

Mutual Information

Shannon Theory

Within Shannon information theory this is an exact theorem (see below), while it is true in the algorithmic version up to the usual logarithmic correction 2

Assume that one has two random variables X and Y . If they are discrete, we write pi (X) = prob(X = xi ), pi (Y ) = prob(Y = xi ), and pij = prob(X = xi , Y = yi ) for the marginal and joint distribution. Otherwise (and if they have ?nite densities) we denote the densities by ?X (x), ?Y (y), and ?(x, y). Entropies are de?ned for the discrete case as usual by H(X) = ? i pi (X) log pi (X), H(Y ) = ? i pi (Y ) log pi (Y ), and H(X, Y ) = ? i,j pij log pij . Conditional entropies are de?ned as H(X|Y ) = H(X, Y ) ? H(Y ) =

?XY (x, y) , ?X (x)?Y (y) (6) = (2) is non-negative and invariant under x → φ(x) and y → ψ(y). It is (the limit of) a true information, It can be shown to be non-negative, and is zero only when X and Y are strictly independent. For n ran- I(X, Y ) = lim [Hbin (X) + Hbin (Y ) ? Hbin (X, Y )]. ?→0 dom variables X1 , X2 . . . Xn , the MI is de?ned as (7) n In applications, one usually has the data availI(X1 , . . . , Xn ) = H(Xk ) ? H(X1 , . . . , Xn ). (3) able in form of a statistical sample. To estik=1 mate I(X, Y ) one starts from N bivariate measureThis quantity is often referred to as (generalized) re- ments (xi , yi ), i = 1, . . . N which are assumed to dundancy, in order to distinguish it from di?erent be iid (independent identically distributed) realiza“mutual informations” which are constructed analo- tions. There exist numerous algorithms to estigously to higher order cumulants, but we shall not mate I(X, Y ) and entropies. We shall use in the follow this usage. Eq.(1) can be checked easily, to- following the MI estimators proposed recently in gether with its generalization to arbitrary groupings. Ref. [Kraskov et al., 2003], and we refer to this paIt means that MI can be decomposed into hierar- per for a review of alternative methods. chical levels. By iterating it, one can decompose I(X1 . . . Xn ) for any n > 2 and for any partitioning 2.2 Algorithmic Information Theory of the set (X1 . . . Xn ) into the MIs between elements within one cluster and MIs between clusters. In contrast to Shannon theory where the basic objects For continuous variables one ?rst introduces some are random variables and entropies are average inforbinning (‘coarse-graining’), and applies the above to mations, algorithmic information theory deals with the binned variables. If x is a vector with dimen- individual symbol strings and with the actual inforsion m and each bin has Lebesgue measure ?, then mation needed to specify them. To “specify” a sepi (X) ≈ ?X (x)?m with x chosen suitably in bin i, quence X means here to give the necessary input to and 1 a universal computer U , such that U prints X on its ? Hbin (X) ≈ H(X) ? m log ? (4) output and stops. The analogon to entropy, called here usually the complexity K(X) of X, is the miniwhere the di?erential entropy is given by mal length of an input which leads to the output X, ? H(X) = ? dx ?X (x) log ?X (x). (5) for ?xed U . It depends on U , but it can be shown that this dependence is weak and can be neglected in Notice that Hbin (X) is a true (average) information the limit when K(X) is large [Li and Vitanyi, 1997]. Let us denote the concatenation of two strings X ? and is thus non-negative, but H(X) is not an in? formation and can be negative. Also, H(X) is not and Y as XY . Its complexity is K(XY ). It is intuitively clear that K(XY ) should be larger than K(X) invariant under homeomorphisms x → φ(x). but cannot be larger than the sum K(X) + K(Y ). 1 Notice that we have here assumed that densities really exists. If not e.g. if X lives on a fractal set), then m is to be Finally, one expects that K(X|Y ), de?ned as the minimal length of a program printing X when Y is replaced by the Hausdor? dimension of the measure ?. I(X, Y ) = H(X) + H(Y ) ? H(X, Y ) pij pij log . pi (X)pi (Y ) i,j I(X, Y ) = dxdy ?XY (x, y) log 3

? i,j pij log pi|j . The base of the logarithm determines the units in which information is measured. In particular, taking base two leads to information measured in bits. In the following, we always will use natural logarithms. The MI between X and Y is ?nally de?ned as

Joint entropies, conditional entropies, and MI are de?ned as above, with sums replaced by integrals. ? Like H(X), joint and conditional entropies are neither positive (semi-)de?nite nor invariant. But MI, de?ned as

But d(X, Y ) is not appropriate for our purposes. furnished as auxiliary input, is related to K(XY ) ? K(Y ). Indeed, one can show [Li and Vitanyi, 1997] Since we want to compare the proximity between (again within correction terms which become irrele- two single objects and that between two clusters containing maybe many objects, we would like the vant asymptotically) that distance measure to be unbiased by the sizes of 0 ≤ K(X|Y ) ? K(XY ) ? K(Y ) ≤ K(X). (8) the clusters. As argued forcefully in [Li et al., 2001, Li et al., 2002], this is not true for Ialg (X, Y ), and for Notice the close similarity with Shannon entropy. the same reasons it is not true for I(X, Y ) or d(X, Y ) The algorithmic information in Y about X is ?- either: A mutual information of thousand bits should nally de?ned as be considered as large, if X and Y themselves are just thousand bits long, but it should be considered Ialg (X, Y ) = K(X)?K(X|Y ) ? K(X)+K(Y )?K(XY ). as very small, if X and Y would each be huge, say (9) one million bits. Within the same additive correction terms, one shows As shown in [Li et al., 2001, Li et al., 2002] within that it is symmetric, Ialg (X, Y ) = Ialg (Y, X), and can the algorithmic framework, one can form two di?erthus serve as an analogon to mutual information. ent distances which measure relative distance, i.e. From the halting theorem it follows that K(X) which are normalized by dividing by a total enis in general not computable. But one can easily tropy. We sketch here only the theorems and proofs give upper bounds. Indeed, the length of any infor the Shannon version, they are indeed very simput which produces X (e.g. by spelling it out verilar to their algorithmic analoga in [Li et al., 2001, batim) is an upper bound. Improved upper bounds Li et al., 2002]. are provided by any ?le compression algorithm such Theorem 1: The quantity as gnuzip or UNIX “compress”. Good compression algorithms will give good approximations to K(X), I(X, Y ) d(X, Y ) D(X, Y ) = 1 ? = (12) and algorithms whose performance does not depend H(X, Y ) H(X, Y ) on the input ?le length (in particular since they do not segment the ?le during compression) will be cru- is a metric, with D(X, X) = 0 and D(X, Y ) ≤ 1 for all pairs (X, Y ). cial for the following. Proof: Symmetry, positivity and boundedness are obvious. Since D(X, Y ) can be written as 2.3 MI-Based Distance Measures H(Y |X) H(X|Y ) Mutual information itself is a similarity measure in + , (13) D(X, Y ) = H(X, Y ) H(Y, X) the sense that small values imply large “distances” in a loose sense. But it would be useful to modit is su?cient for the proof of the triangle inequality ify it such that the resulting quantity is a metric in to show that each of the two terms on the r.h.s. is the strict sense, i.e. satis?es the triangle inequalbounded by an analogous inequality, i.e. ity. Indeed, the ?rst such metric is well known [Cover and Thomas, 1991]: The quantity H(X|Z) H(Z|Y ) H(X|Y ) ≤ + (14) H(X, Y ) H(X, Z) H(Z, Y ) d(X, Y ) = H(X|Y ) + H(Y |X) = H(X, Y ) ? I(X, Y ) (10) and similarly for the second term. Eq.(14) is proven satis?es the triangle inequality, in addition to be- straightforwardly, using Eq.(11) and the basic ining non-negative and symmetric and to satisfying equalities H(X) ≥ 0, H(X, Y ) ≤ H(X, Y, Z) and d(X, X) = 0. The proof proceeds by ?rst showing H(X|Z) ≥ 0: that for any Z H(X|Y ) H(X|Y ) = H(X|Y ) ≤ H(X, Z|Y ) ≤ H(X|Z) + H(Z|Y ). (11) H(X, Y ) H(Y ) + H(X|Y ) 4

≤ = ≤ =

H(X|Z) + H(Z|Y ) H(Y ) + H(X|Z) + H(Z|Y ) H(X|Z) + H(Z|Y ) H(X|Z) + H(Y, Z) H(Z|Y ) H(X|Z) + H(X|Z) + H(Z) H(Y, Z) H(Z|Y ) H(X|Z) + . (15) H(X, Z) H(Z, Y )

=

H(Z|Y ) + H(Z|X) , H(Z)

(19)

Theorem 2: The quantity D′ (X, Y ) = = 1? I(X, Y ) max{H(X), H(Y )} max{H(X|Y ), H(Y |X)} (16) max{H(X), H(Y )}

is also a metric, also with D′ (X, X) = 0 and D′ (X, Y ) ≤ 1 for all pairs (X, Y ). It is sharper than D, i.e. D′ (X, Y ) ≤ D(X, Y ). Proof: Again we have only to prove the triangle inequality, the other parts are trivial. For this we have to distinguish di?erent cases [Li et al., 2002]. Case 1: max{H(Z), H(Y )} ≤ H(X). Using Eq.(11) we obtain D′ (X, Y ) = = H(X|Z) H(Z|Y ) H(X|Y ) ≤ + H(X) H(X) H(Y ) ′ ′ D (X, Z) + D (Z, Y ). (17)

Case 2: max{H(Z), H(X)} ≤ H(Y ). This is completely analogous. Case 3: H(X) ≤ H(Y ) < H(Z). We now have to show that D′ (X, Y ) = ≤ =

?

H(Y |X) H(Y |Z) + H(Z|X) ≤ H(Y ) H(Y ) D′ (X, Z) + D′ (Z, Y ) H(Z|X) H(Z|Y ) + . H(Z) H(Z)

(18)

Indeed, if the r.h.s. of the ?rst line is less than 1, then H(Y |X) H(Y ) ≤ ≤ H(Y |Z) + H(Z|X) H(Y ) H(Y |Z) + H(Z|X) + H(Z) ? H(Y ) H(Z) 5

and Eq.(18) holds. If it is larger than 1, then also (H(Z|Y ) + H(Z|X))/H(Z) ≥ 1. Eq.(18) must now also hold, since H(Y |X)/H(Y ) ≤ 1. Case 4: H(Y ) ≤ H(X) < H(Z). This is completely analogous to case 3. Apart from scaling correctly with the total information, in contrast to d(X, Y ), the algorithmic analog to D(X, Y ) and D′ (X, Y ) are also universal [Li et al., 2002]. Essentially this means that if X ≈ Y according to any non-trivial distance measure, then X ≈ Y also according to D, and even more so (by factor up to 2) according to D′ . In contrast to the other properties of D and D′ , this is not easy to carry over from algorithmic to Shannon theory. The proof in Ref. [Li et al., 2002] depends on X and Y being discrete, which is obviously not true for probability distributions. Based on the universality argument, it was argued in [Li et al., 2002] that D′ should be superior to D, but the numerical studies shown in that reference did not show a clear di?erence between them. In the following we shall therefore use primarily D for simplicity, but we checked that using D′ did not give systematically better results. A major di?culty appears in the Shannon framework, if we deal with continuous random variables. As we mentioned above, Shannon informations are only ?nite for coarse-grained variables, while they diverge if the resolution tends to zero. This means that dividing MI by the entropy as in the de?nitions of D and D′ becomes problematic. One has essentially two alternative possibilities. The ?rst is to actually introduce some coarse-graining, although it would have been necessary for the de?nition of I(X, Y ), and divide by the coarse-grained entropies. This introduces an arbitrariness, since the scale ? is completely ad hoc, unless it can be ?xed by some independent arguments. We have found no such arguments, and thus we propose the second alternative. There we take ? → 0. In this case H(X) ? mx log ?, with mx being the dimension of X. In this limit D and D′ would tend to 1. But using similarity measures S(X, Y ) = (1 ? D(X, Y )) log(1/?), S ′ (X, Y ) = (1 ? D′ (X, Y )) log(1/?) (20) (21)

1

instead of D and D′ gives exactly the same results in MIC, and I(X, Y ) . max{mx , my } (22) Thus, when dealing with continuous variables, we divide the MI either by the sum or by the maximum of the dimensions. When starting with scalar variables and when X is a cluster variable obtained by joining m elementary variables, then its dimension is just mx = m. S(X, Y ) = I(X, Y ) , mx + my S ′ (X, Y ) =

0.9 0.8

D(X,Y)

0.7 0.6 0.5

3

A Phylogenetic Mammals

Tree

for

horse donkey indian rhino white rhino harbor seal grey seal cat dog pig hippo sheep cow fin whale blue whale rabbit aardvark squirrel dormouse rat mouse armadillo wallaroo oppossum guinea pig bat platypus elephant baboon orangutan gibbon gorilla human chimp pigmy chimp 0.4

As a ?rst application, we study the mitochondrial DNA of a group of 34 mammals (see Fig. 1). Exactly the same data [Gen, a] had previously been analyzed in [Li et al., 2001, Reyes et al., 2000]. This group includes among others2 some rodents3 , ferungulates4 , and primates5 . It had been chosen in [Li et al., 2001] because of doubts about the relative closeness among these three groups [Cao et al., 1998, Reyes et al., 2000]. Obviously, we are here dealing with the algorithmic version of information theory, and informations are estimated by lossless data compression. For constructing the proximity matrix between individual

2 opossum (Didelphis virginiana), wallaroo (Macropus robustus), and platypus (Ornithorhyncus anatinus) 3 rabbit (Oryctolagus cuniculus), guinea pig (Cavia porcellus), fat dormouse (Glis glis), rat (Rattus norvegicus), squirrel (Scuirus vulgaris), and mouse (Mus musculus) 4 horse (Equu caballus), donkey (Equus asinus), Indian rhinoceros (Rhinoceros unicornis), white rhinoceros (Ceratotherium simum), harbor seal (Phoca vitulina), grey seal (Halichoerus grypus), cat (Felis catus), dog (Canis familiaris), ?n whale (Balenoptera physalus), blue whale (Balenoptera musculus), cow (Bos taurus), sheep (Ovis aries), pig (Sus scrofa), hippopotamus (Hippopotamus amphibius), neotropical fruit bat (Artibeus jamaicensis), African elephant (Loxodonta africana), aardvark (Orycteropus afer ), and armadillo (Dasypus novemcintus) 5 human (Homo sapiens), common chimpanzee (Pan troglodytes), pigmy chimpanzee (Pan paniscus), gorilla (Gorilla gorilla), orangutan (Pongo pygmaeus), gibbon (Hylobates lar ), and baboon (Papio hamadryas)

Figure 1: Phylogenetic tree for 34 mammals (31 eutherians plus 3 non-placenta mammals). In contrast to Fig. 4, the heights of nodes are equal to the distances between the joining daughter clusters. taxa, we proceed essentially a in Ref. [Li et al., 2001]. But in addition to using the special compression program GenCompress [Gen, b], we also tested several general purpose compression programs such as BWTzip [BWT, ] and the UNIX tool bzip2. In Ref. [Li et al., 2001], this proximity matrix was then used as the input to a standard HC algorithm (neighbour-joining and hypercleaning) to produce an evolutionary tree. It is here where our treatment deviates crucially. We used the MIC algorithm described in Sec. 1, with distance D(X, Y ). The joining of two clusters (the third step in the MIC algorithm) is obtained by simply concatenating the DNA sequences. There is of course an arbitrariness in the order of concatenation sequences: XY and Y X give in general compressed sequences of di?erent lengths. But we found this to have negligible e?ect on the evolutionary tree. The resulting evolutionary tree obtained with Gencompress is shown in Fig. 1. As shown in Fig. 1 the overall structure of this tree closely resembles the one shown in 6

Ref. [Reyes et al., 2000]. All primates are correctly clustered and also the relative order of the ferungulates is in accordance with Ref. [Reyes et al., 2000]. On the other hand, there are a number of connections which obviously do not re?ect the true evolutionary tree, see for example the guinea pig with bat and elephant with platypus. But the latter two, inspite of being joined together, have a very large distance from each other, thus their clustering just re?ects the fact that neither the platypus nor the elephant have other close relatives in the sample. All in all, however, already the results shown in Fig. 1 capture surprisingly well the overall structure shown in Ref. [Reyes et al., 2000]. Dividing MI by the total information is essential for this success. If we had used the non-normalized Ialg (X, Y ) itself, the clustering algorithm used in [Li et al., 2001] would not change much, since all 34 DNA sequences have roughly the same length. But our MIC algorithm would be completely screwed up: After the ?rst cluster formation, we have DNA sequences of very di?erent lengths, and longer sequences tend also to have larger MI, even if they are not closely related. A heuristic reasoning for the use of MIC for the reconstruction of an evolutionary tree might be given as follows: Suppose that a proximity matrix has been calculated for a set of DNA sequences and the smallest distance is found for the pair (X, Y ). Ideally, one would remove the sequences X and Y , replace them by the sequence of the common ancestor (say Z) of the two species, update the proximity matrix to ?nd the smallest entry in the reduced set of species, and so on. But the DNA sequence of the common ancestor is not available. One solution might be that one tries to reconstruct it by making some compromise between the sequences X and Y . Instead, we essentially propose to concatenate the sequences X and Y . This will of course not lead to a plausible sequence of the common ancestor, but it will optimally represent the information about the common ancestor. During the evolution since the time of the ancestor Z, some parts of its genome might have changed both in X and in Y . These parts are of little use in constructing any phylogenetic tree. Other parts might not have changed in either. They are recognized anyhow by any sensible algorithm. Finally, some parts of its 7

genome will have mutated signi?cantly in X but not in Y , and vice versa. This information is essential to ?nd the correct way through higher hierarchy levels of the evolutionary tree, and it is preserved in concatenating.

4

Clustering of Minimally Dependent Components in an Electrocardiogram

As our second application we choose a case where Shannon theory is the proper setting. We show in Fig. 2 an ECG recorded from the abdomen and thorax of a pregnant woman [De Moor, 1997] (8 channels, sampling rate 500 Hz, 5 s total). It is already seen from this graph that there are at least two important components in this ECG: the heartbeat of the mother, with a frequency of ≈ 3 beat/s, and the heartbeat of the fetus with roughly twice this frequency. Both are not synchronized. In addition there is noise from various sources (muscle activity, measurement noise, etc.). While it is easy to detect anomalies in the mother’s ECG from such a recording, it would be di?cult to detect them in the fetal ECG. As a ?rst approximation we can assume that the total ECG is a linear superposition of several independent sources (mother, child, noise1 , noise2 ,...). A standard method to disentangle such superpositions is independent component analysis (ICA) [Hyv¨rinen et al., 2001]. In the simplest case one a has n independent sources si (t), i = 1 . . . n and n measured channels xi (t) obtained by instantaneous superpositions with a time independent non-singular matrix A,

n

xi (t) =

j=1

Aij sj (t) .

(23)

In this case the sources can be reconstructed by applying the inverse transformation W = A?1 which is obtained by minimizing the (estimated) mutual informations between the transformed components yi (t) = n j=1 Wij xj (t). If some of the sources are Gaussian, this leads to ambiguities [Hyv¨rinen et al., 2001], but a

1

2

3

4

5

6

7

8 0 1 2 seconds 3 4 5

Figure 2: ECG of a pregnant woman.

it gives a unique solution if the sources have more structure. In reality things are not so simple. For instance, the sources might not be independent, the number of sources (including noise sources!) might be different from the number of channels, and the mixing might involve delays. For the present case this implies that the heartbeat of the mother is seen in several reconstructed components yi , and that the supposedly “independent” components are not independent at all. In particular, all components yi which have large contributions from the mother form a cluster with large intra-cluster MIs and small inter-cluster MIs. The same is true for the fetal ECG, albeit less 0 1 2 seconds 3 4 pronounced. It is thus our aim to 1) optimally decompose the signals into least depenFigure 3: Least dependent components of the ECG dent components; 2) cluster these components hierarchically such that shown in Fig. 2, after increasing the number of channels by delay embedding. the most dependent ones are grouped together; 3) decide on an optimal level of the hierarchy, such that the clusters make most sense physiologically; 4) project onto these clusters and apply the inverse transformations to obtain cleaned signals for the sources of interest. Technically we proceeded as follows [St¨gbauer et al., 2004]: o Since we expect di?erent delays in the di?erent channels, we ?rst used Takens delay embedding 8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

[Takens, 1980] with time delay 0.002 s and embedding dimension 3, resulting in 24 channels. We then formed 24 linear combinations yi (t) and determined the de-mixing coe?cients Wij by minimizing the overall mutual information between them, using the MI estimator proposed in [Kraskov et al., 2003]. There, two classes of estimators were introduced, one with square and the other with rectangular neighbourhoods. Within each class, one can use the number of neighbours, called k in the following, on which the estimate is based. Small values of k lead to a small bias but to large statistical errors, while the opposite is true for large k. But even for very large k the bias is zero when the true MI is zero, and it is systematically such that absolute values of the MI are underestimated. Therefore this bias does not a?ect the determination of the optimal de-mixing matrix. But it depends on the dimension of the random variables, therefore large values of k are not suitable for the clustering. We thus proceeded as follows: We ?rst used k = 100 and square neighbourhoods to obtain the least dependent components yi (t), and then used k = 3 with rectangular neighbourhoods for the clustering. The resulting least dependent components are shown in Fig. 3. They are sorted such that the ?rst components (1 - 5) are dominated by the mother’s ECG, while the next three contain large contributions from the fetus. The rest contains mostly noise, although some seem to be still mixed. These results obtained by visual inspection are fully supported by the cluster analysis. The dendrogram is shown in Fig. 4. In constructing it we used S(X, Y ) (Eq.(22)) as similarity measure to ?nd the correct topology. Again we would have obtained much worse results if we had not normalized it by dividing MI by mX + mY . In plotting the actual dendrogram, however, we used the MI of the cluster to determine the height at which the two daughters join. The MI of the ?rst ?ve channels, e.g., is ≈ 1.43, while that of channels 6 to 8 is ≈ 0.34. For any two clusters (tuples) X = X1 . . . Xn and Y = Y1 . . . Ym one has I(X, Y ) ≥ I(X) + I(Y ). This guarantees, if the MI is estimated correctly, that the tree is drawn properly. The two slight glitches (when clusters (1– 14) and (15–18) join, and when (21–22) is joined with 23) result from small errors in estimating MI. They 9

2 mother 1.5 1 child 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 channel nr.

Figure 4: Dendrogram for least dependent components. The height where the two branches of a cluster join corresponds to the MI of the cluster. do in no way e?ect our conclusions. In Fig. 4 one can clearly see two big clusters corresponding to the mother and to the child. There are also some small clusters which should be considered as noise. For reconstructing the mother and child contributions to Fig. 2, we have to decide on one speci?c clustering from the entire hierarchy. We decided to make the cut such that mother and child are separated. The resulting clusters are indicated in Fig. 4 and were already anticipated in sorting the channels. Reconstructing the original ECG from the child components only, we obtain Fig. 5.

5

mutual information

Conclusions

We have shown that MI can not only be used as a proximity measure in clustering, but that it also suggests a conceptually very simple and natural hierarchical clustering algorithm. We do not claim that this algorithm, called mutual information clustering (MIC), is always superior to other algorithms. Indeed, MI is in general not easy to estimate. Obviously, when only crude estimates are possible, also MIC will not give optimal results. But as MI estimates are becoming better, also the results of MIC

1 2 3 4 5 6 7 8 0 1 2 seconds 3 4 5

Figure 5: Original ECG where all contributions except those of the child cluster have been removed. should improve. The present paper was partly triggered by the development of a new class of MI estimators for continuous random variables which have very small bias and also rather small variances [Kraskov et al., 2003]. We have illustrated our method with two applications, one from genetics and one from cardiology. For neither application MIC might give the very best clustering, but it seems promising that one common method gives decent results for both, although they are very di?erent. The results of MIC should improve, if more data become available. This is trivial, if we mean by that longer time sequences in the application to ECG, and longer parts of the genome in the application of Sec.3. It is less trivial that we expect MIC to make fewer mistakes in a phylogenetic tree, when more species are included. The reason is that close-by species will be correctly joined anyhow, and families – which now are represented only by single species and thus are poorly characterized – will be much better described by the concatenated genomes if more species are included. There are two versions of information theory, algorithmic and probabilistic, and therefore there are also two variants of MI and of MIC. We discussed in detail one application of each, and showed that indeed common concepts were involved in both. In

particular it was crucial to normalize MI properly, so that it is essentially the relative MI which is used as proximity measure. For conventional clustering algorithms using algorithmic MI as proximity measure this had already been stressed in [Li et al., 2001, Li et al., 2002], but it is even more important in MIC, both in the algorithmic and in the probabilistic versions. In the probabilistic version, one studies the clustering of probability distributions. But usually distributions are not provided as such, but are given implicitly by ?nite random samples drawn (more or less) independently from them. On the other hand, the full power of algorithmic information theory is only reached for in?nitely long sequences, and in this limit any individual sequence de?nes a sequence of probability measures on ?nite subsequences. Thus the strict distinction between the two theories is somewhat blurred in practice. Nevertheless, one should not confuse the similarity between two sequences (two English books, say) and that between their subsequence statistics. Two sequences are maximally different if they are completely random, but their statistics for short subsequences is then identical (all subsequences appear in both with equal probabilities). Thus one should always be aware of what similarities or independencies one is looking for. The fact that MI can be used in similar ways for all these problems is not trivial. We would like to thank Arndt von Haesseler, Walter Nadler and Volker Roth for many useful discussions.

References

[Gen, a] http://www.ncbi.nlm.nih.gov. [Gen, b] http://www.cs.ucsb.edu/ mli/Bioinf/software. [BWT, ] http://stl.caltech.edu/bwtzip.shtml. [Cao et al., 1998] Cao, Y., Janke, A., Waddell, P., Westerman, M., Takenaka, O., Murata, S., Okada, N., P¨abo, S., and Hasegawa, M. (1998). Con?ict a¨ among individual mitochondrial proteins in resolv-

10

ing the phylogeny of eutherian orders. J. Molec. Evol., 47(3):307–322.

[Cover and Thomas, 1991] Cover, T. M. and Thomas, J. A. (1991). Elements of Information [Tishby et al., 1999] Tishby, N., Pereira, F., and Bialek, W. (1999). The information bottleneck Theory. John Wiley and Sons, New York. method. In 37-th Annual Allerton Conference [De Moor, 1997] De Moor, B. L. R. (1997). Daisy: on Communication, Control and Computing, page Database for the identi?cation of systems. 368. www.esat.kuleuven.ac.be/sista/daisy. [Hyv¨rinen et al., 2001] Hyv¨rinen, A., Karhunen, a a J., and Oja, E. (2001). Independent component analysis. John Wiley and Sons, New York. [Jain and Dubes, 1988] Jain, A. K. and Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall, Englewood Cli?s, NJ. [Kraskov et al., 2003] Kraskov, A., St¨gbauer, H., o and Grassberger, P. (2003). Estimating mutual information. E-print, arXiv.org/cond-mat/0305641. [Li et al., 2001] Li, M., Badger, J. H., Chen, X., Kwong, S., Kearney, P., and Zhang, H. (2001). An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17(2):149–154. [Li et al., 2002] Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P. (2002). The similarity metric. E-print, arxiv.org/cs.CC/0111054. [Li and Vitanyi, 1997] Li, M. and Vitanyi, P. (1997). An introduction to Kolmogorov complexity and its applications. Springer, New York. [Reyes et al., 2000] Reyes, A., Gissi, C., Pesole, G., Catze?is, F. M., and Saccone, C. (2000). Where do rodents ?t? Evidence from the complete mitochondrial genome of sciurus vulgaris. Mol. Biol. Evol., 17:979–983. [St¨gbauer et al., 2004] St¨gbauer, H., Kraskov, A., o o and Grassberger, P. (2004). Potential of mutual information in application to ICA. [Takens, 1980] Takens, F. (1980). Detecting strange attractors in turbulence. In Rand, D. A. and 11

Young, L. S., editors, Dynamical Systems and Turbulence, volume 898 of Lecture Notes in Mathematics, page 366. Springer-Verlag, Berlin.

更多相关文档:

A *HIERARCHICAL* *CLUSTERING* *BASED* *ON* *MUTUAL* *INFORMATION* MAXIMIZATION_专业资料。*Mutual* *information* has been used in many *clustering* algorithms for measuring general ...

A *HIERARCHICAL* *CLUSTERING* *BASED* *ON* *MUTUAL* *INFORMATION* MAXIMIZATION M. Aghagolzadeha, H. Soltanian-Zadeha, c, B. Araabia, A. Aghagolzadehb a Control ...

MIC: *Mutual* *Information* *based* *hierarchical* *Clustering* Alexander Kraskov and Peter Grassberger Abstract *Clustering* is a concept used in a huge variety of ...

matrix, *based* *on* which the agglomerative *hierarchical* *clustering* was used to ...*clustering*, e.g., the Normalized *Mutual* *Information* [1], Category Utility ...

K-ANMI A *Mutual* *Information* *Based* *Clustering* Algorithm for Categorical Data_...A *HIERARCHICAL* CLUS... 4页 免费 *Mutual* *information* bas... 11页 1...

Home Search Collections Journals About Contact us My IOPscience *Hierarchical* *clustering* using *mutual* *information* This content has been downloaded from IOPscience...

The use of *mutual* *information* as a similarity measure in agglomerative ... the *hierarchical* *clustering* *based* *on* the log Bayes factor outperformed off-...

Sciences - 2007 Visualizing Attribute Interdependencies Using *Mutual* *Information*, *Hierarchical* *Clustering*, Multidimensional Scaling, and Self-organizing Maps Derek ...

1 *Information*-Maximization *Clustering* *based* *on* Squared-Loss *Mutual* *Information* Masashi Sugiyama1, Gang Niu1, Makoto Yamada2, Manabu Kimura1, and Hirotaka...

(2016), pp.169-180 http://dx.doi.org/10.14257/ijdta.2016.9.5.17 Topic Discovery Algorithm *Based* *on* *Mutual* *Information* and Label *Clustering* under ...

Comments *on* Supervised Feature Selection by *Clustering* Using Conditional *Mutual* *Information*-*based* Distances Nguyen X. Vinha,, James Baileyb, aThe ...

For our application *hierarchical* *clustering* should ... then constructing models *based* *on* leaving each ...(the *mutual* *information* of A to B will be ...

Medical Image Analysis 10 (2006) 432439 www.elsevier.com/locate/media Normalized *mutual* *information* *based* registration using k-means *clustering* and shading...

更多相关标签:

相关文档

- Feature selection based on mutual information criteria of max-dependency, max-relevance, an
- A HIERARCHICAL CLUSTERING BASED ON MUTUAL INFORMATION MAXIMIZATION
- Fuzzy clustering ensemble based on mutual information
- Feature selection based on joint mutual information
- General Multimodal Elastic Registration Based on Mutual Information
- Halton Sampling for Image Registration Based on Mutual Information
- textural mutual information based on cluster trees for multimodal deformable registration
- A similarity measure based on causal neighbours and mutual information
- MISEP-- Linear and Nonlinear ICA Based on Mutual Information
- Medical Image Segmentation Based on Mutual Information Maximization_MICCAI04

文档资料共享网 nexoncn.com
copyright ©right 2010-2020。

文档资料共享网内容来自网络，如有侵犯请联系客服。email:zhit325@126.com