当前位置:首页 >> >> Text classication using ESC-based stochastic decision lists

Text classication using ESC-based stochastic decision lists


E L SE VI E R SC I E N C E L t d .
JOURNAL PAGES

[DTD

4.2.0 ]

IPM

ARTICLE No.

554

IPM 554
PROD. TYPE:

1-19

DISPATCH

12 June 2001

F RO M D IS K

A

Information Processing and Management 000 (2001) 000±000

www.elsevier.com/locate/infoproman

Text classi?cation using ESC-based stochastic decision lists
5F, Beijing Sigma Center No. 49, Zhichun Road, Hai Dian District, 100080 Beijing, China Received 20 September 2000; accepted 17 May 2001

Abstract 7 8 9 10 11 12 13 14 15 16

19 1. Introduction 20 21 22 23 24 25 26

UN

We consider here the problem of text classi?cation based on automatically acquired rules. More precisely, we begin with a number of categories, each already containing a number of texts. We are to automatically acquire some sort of rules from the categorized texts and classify new texts on the basis of the acquired rules. The rule-based approach to text classi?cation, which was ?rst proposed in Apte, Damerau, and Weiss (1994) (Swap-1) and later in Cohen and Singer (1998) (Ripper), has advantages of readability and re?nability of the acquired knowledge when compared to non-rule-based ones.

*

Corresponding author. Tel.: +81-44-8562143; fax: +81-44-8562239. E-mail addresses: lihang@ccm.cl.nec.co.jp (H. Li), yamanisi@ccm.cl.nec.co.jp (K. Yamanishi).

0306-4573/01/$ - see front matter ? 2001 Published by Elsevier Science Ltd. PII: S 0 3 0 6 - 4 5 7 3 ( 0 1 ) 0 0 0 3 8 - 3

CO

RR

17 Keywords: Text classi?cation; Statistical learning; Stochastic decision list; Rule-based method; Extended stochastic 18 complexity

EC

We propose a new method of text classi?cation using stochastic decision lists. A stochastic decision list is an ordered sequence of IF-THEN rules, and our method can be viewed as a rule-based method for text classi?cation having advantages of readability and re?nability of acquired knowledge. Our method is unique in that decision lists are automatically constructed on the basis of the principle of minimizing extended stochastic complexity (ESC), and with it we are able to construct decision lists that have fewer errors in classi?cation. The accuracy of classi?cation achieved with our method appears better than or comparable to those of existing rule-based methods. We have empirically demonstrated that rule-based methods like ours result in high classi?cation accuracy when the categories to which texts are to be assigned are relatively speci?c ones and when the texts tend to be short. We have also empirically veri?ed the advantages of rule-based methods over non-rule-based ones. ? 2001 Published by Elsevier Science Ltd.

TE

DP

RO

Hang Li *, Kenji Yamanishi

OF

IPM 554
2 H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000

60 2. Text classi?cation using stochastic decision lists

UN

61 2.1. Stochastic decision list 62 In our method, a text is viewed as a Boolean vector (in general, a vector with multiple discrete 63 values). In the training phase, it takes as input Boolean vector/label pairs, in which each Boolean 64 vector represents a text and each label represents the category that the text falls into. A mathe65 matical representation of the input is given by

CO

27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

In this paper, we propose a new rule-based method for text classi?cation that represents rules in the form of stochastic decision lists. A stochastic decision list is an ordered sequence of IF-THEN rules, each containing a condition, a classi?cation decision, and a probability value. When used in text classi?cation, a condition generally refers to the presence or absence of certain words in a text, a classi?cation decision denotes a category to which a text is to be assigned, and a probability value refers to the likelihood of such an assignment. We should note that the performance of a stochastic decision list (or a set of rules) heavily depends on the criterion (or principle) for model selection in learning. Our method is unique in that decision lists are constructed on the basis of the principle of minimizing a quantity called extended stochastic complexity (ESC) (Yamanishi, 1998). A decision list is basically constructed in two steps: growing and pruning. In growing, it sequentially adds, on the basis of the principle of minimizing ESC, new rules to the decision list to be constructed. In pruning, it recursively deletes rules, starting from the last rule of the list, until reaching a rule for which it appears, on the basis of the principle of minimizing ESC, the pruning should be stopped. Our method has certain advantages which are not shared by other rule-based methods (we will later discuss the relationship between our method and existing methods like Ripper). First, the learning algorithm is much simpler than those employed in other rule-based methods. Second, it is theoretically guaranteed that our method will achieve high classi?cation accuracy. Actually, the principle of minimizing ESC has the lowest expected classi?cation error for any model selection criterion yet proposed (cf. Yamanishi, 1998). Experimental results indicate that our method is quite e?ective. It achieves 82.0% classi?cation accuracy in terms of break-even point for Reuters21 578 data, which is better than or comparable to those of existing rule-based methods. We have empirically demonstrated that rule-based methods like ours result in high classi?cation accuracy when the categories to which texts are to be assigned are relatively speci?c ones and when the texts tend to be short. For example, our method outperforms a method using support vector machines (Joachims, 1998), one which is known to be very e?ective, when classifying the titles of texts of Reuters data and when classifying into speci?c categories the titles and abstracts of technical papers in a text database. We will show here that rule-based methods like ours have a signi?cant advantage over nonrule-based ones. Namely, it is easy to manually modify rules after learning in order to further increase recall of classi?cation for future data. Actually, we have been able to increase recall of some of the decision lists by 10% without decreasing precision for some categories of Reuters data.

RR

EC

TE

DP

RO

OF

IPM 554
H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000 3

Fig. 1. An example stochastic decision list.

?d1 ; c1 ?; ?d2 ; c2 ?; . . . ; ?dm ; cm ?;
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86

di ? ?wi1 ; wi2 ; . . . ; wik ?

?i ? 1; . . . ; m?;

OF

?1?

87 2.2. Advantages 88 89 90 91 92 93 94

UN
1

The use of stochastic decision lists, in general rule-based methods, in text classi?cation has many advantages. Among them, readability and re?nability of the acquired knowledge are particularly important: one can easily understand the classi?cation process by checking the rules used; one can also modify the rules by hand; furthermore, one can discover the underlying knowledge contained in data. For example, readability of knowledge is required in certain classi?cation applications like the automatic assignment of e-mail messages into folders; humans need to understand how texts have

This probability can be used in the ranking of texts after classi?cation. Speci?cally, the classi?ed texts can be sorted in descending order of the associated probabilities. In this paper, we con?ne ourselves to classi?cation, however, and do not utilize the probability for that purpose.

CO

RR

where di denotes a Boolean vector and ci a label (i ? 1; . . . ; m), wij takes 1 or 0 as values, denoting, respectively, the presence or absence of the word wj for the ith text example (j ? 1; . . . ; k). Each wj is referred to as a feature, and a conjunction of features is referred to as a term. We learn a stochastic decision list like that shown in Fig. 1 from input as represented in the form of (1). A stochastic decision list here represents an ordered sequence of IF-THEN rules for text classi?cation. Each rule indicates a category to which a text is to be assigned (in this case, htradei or hnot-tradei), a condition for that assignment (each condition takes the form of a term, e.g., tari? ? 1 & trade ? 1, which indicates the simultaneous presence of the words `tari?' and `trade'), and a probability value for the assignment (e.g., 0.8). The condition of the ?nal rule is always true. In the classi?cation phase, the input text is examined with respect to rules presented in the decision list, from beginning to end. Once the condition of one of the rules is met, the text is assigned into the corresponding category with its associated probability. 1 The ?nal rule always represents a default classi?cation decision. Let X denote a feature space, Y a set of labels, X a random variable on X and Y a random variable on Y. The IF-THEN rules in a stochastic decision list induce a partitioning of X. For each cell of the partitioning, the stochastic decision list retains a probability distribution over Y. Thus, a stochastic decision list de?nes a conditional probability distribution of Y given X, which can be generally represented as Pr?Y jX ? (Yamanishi, 1992). A stochastic decision list can be viewed as a stochastic version of the deterministic decision list proposed in Rivest (1987).

EC

TE

DP

RO

IPM 554
4 H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000

106 2.3. E?ectiveness 107 108 109 110 111 112 113 114 115 116 117

UN

119 120 121 122 123 124 125 126 127 128

The goal of learning a stochastic decision list is to create one that makes the fewest errors in the classi?cation of new examples. The most important tasks are rule selection and rule arrangement. All of these tasks are related to the criterion (or principle) used for model selection. In this paper, we employ as a criterion the principle of minimizing the ESC. We call this principle the Minimum ESC principle. Ideally, we could construct all possible decision lists, and select the best one from among them on the basis of a given criterion (in our case, the Minimum ESC principle). However, such a process would be very computationally demanding. We propose here an e?cient algorithm for learning a stochastic decision list of a restricted form; we refer to this as k-stochastic decision list, for which the number of features in each term is at most k.

129 3.1. DL-ESC

130 We refer to our algorithm for learning a decision list on the basis of the Minimum ESC 131 principle as DL-ESC, which contains four steps: feature selection, term creation, growing, and

CO

RR

118 3. Learning stochastic decision lists

EC

The decision list method (generally a rule-based approach) will result in high accuracy in text classi?cation when categories are relatively speci?c ones and when texts tend to be short. For example, for the speci?c category hwheati, it would be possible to accurately determine whether or not a new text should be classi?ed into it by considering only the simultaneous presence of a few words `wheat', `ton', and `agriculture'. When the feature space is small (i.e., the number of word types is small), it is easy to precisely partition the feature space and achieve a high classi?cation accuracy using a decision list. This is usually the case when texts are short. On the other hand, when the feature space is large, precise partitioning of the feature space would make the process of learning a decision list very computationally expensive and thus intractable. One has to perform a looser partitioning and thus sacri?ce the accuracy to some degree.

TE

DP

RO

95 96 97 98 99 100 101 102 103 104 105

been classi?ed in order to retrieve them. Currently some mailer programs like `Microsoft Outlook Express' can automatically assign messages into folders by using IF-THEN rules which are de?ned by humans. A rule-based machine learning method can be easily employed here to avoid the need for human de?nitions. In practice, a text classi?er would be very useful if it could be easily modi?ed after learning. A text classi?er is usually trained from a relatively small amount of training data, but it will need to be applied to the real world, which has a nearly in?nite amount of data to be classi?ed. We should also note that the readability and the re?nability of a decision list mainly result from the fact that it is based on a partitioning of the feature space. Because examples falling into the same cell of a partitioning are treated uniformly (i.e., being classi?ed into the same category), the representation is easy to understand and modify.

OF

IPM 554
H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000 5

148 3.2. SC and ESC 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169

UN

SC (Rissanen, 1996) is a measure of the amount of information included in a given data sequence relative to a ?xed probability model. 2 The minimum description length (MDL) principle is a model selection criterion which asserts that, for a given data sequence xm , the lower a model's SC?xm ? value is, the greater its likelihood of being a model which would actually generate xm . SC?xm ? is actually de?ned as the least code-length (also referred to as description length) required to encode xm with the help of the model (Rissanen, 1996). In terms of statistical decision theory, SC is de?ned under the assumption that the model takes the form of a probability distribution and the logarithmic loss function is used as a loss function to measure the distortion that resulted from predicting the data using the model. Yamanishi has proposed a more general measure of the amount of information in a data sequence. This measure, which he refers to as ESC (Yamanishi, 1998), is an extension of SC in the sense that a model can be any parametric class of real-valued functions and any loss function can be used as a distortion measure. The Minimum ESC principle asserts that for a given data sequence xm , the model with the least value ESC?xm ? is the best model for predicting future examples with respect to a loss function. The MDL principle works well as a criterion in statistical estimation. In the case of classi?cation, however, the goal is not to perform good estimation of the `true' probability distribution with respect to the logarithmic loss function, but to minimize classi?cation errors with respect to a speci?c loss function (e.g., the discrete loss (1±0 loss) function). That is why the Minimum ESC principle is well suited to classi?cation. According to Yamanishi (1998), in fact the use of the Minimum ESC principle can, as sample size increases, make the expected loss (expected classi-

A model here refers to a class of probability distributions with a ?xed number of parameters but unspeci?ed parameter values.

2

CO

RR

EC

TE

DP

132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147

pruning. In the feature selection step, it chooses, on the basis of the stochastic complexity (SC), features which are particularly conspicuous for their frequent occurrence or non-occurrence in a category. The purpose of this step is to make the subsequent steps computationally more e?cient. In the term creation step, it constructs all of the possible terms as conditions of rules. In the growing step, DL-ESC sequentially adds new rules to the decision list to be created, each rule being selected on the basis of the Minimum ESC principle. The decision list obtained after the growing step may over?t the speci?cs of training examples, and thus may result in poor classi?cation of new examples. To resolve this problem, in the pruning step DL-ESC recursively deletes rules, starting from the last rule on the list, until reaching a rule for which it appears, on the basis of the Minimum ESC principle, the pruning should be stopped. Note that in the growing step the Minimum ESC principle is used locally, while in the pruning step it is used globally. There is, however, no guarantee that the algorithm will be able to ?nd the globally optimal decision list in terms of the Minimum ESC principle. In this paper, we describe our method only for the case in which there are two categories: positive and negative. It is a relatively straightforward process, however, to extend the method to the case for which there are more than two categories.

RO

OF

IPM 554
6 H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000

170 ?cation error) converge to zero at the fastest rate (for general loss functions) yet seen in the area of 171 machine learning. 172 3.3. Feature selection 173 174 175 176 177 178 179 180

188 where m? denotes the number of labels in cmw whose value is 1. Let cmXw be the sequence of all ci 's w 189 (ci P cm ) such that its corresponding text di does not include the word w. Here mXw denotes the 190 number of labels in cmXw . The SC value of cmXw is then calculated as

SC?c

mXw

DSC?w? ?

CO

192 where m? denotes the number of labels in cmXw whose value is 1. Xw 193 The gain in information (per example) when considering the presence or absence of w may be 194 measured by

UN

1 ?SC?cm ? ? ?SC?cmw ? ? SC?cmXw ??? m   ?  ?      m? mw mw mXw mXw 1 mw mXw p ? H ? ? log : ? H H 2m 2m m m mw m mXw

RR
?

m? Xw ? ? mXw H mXw



EC


1 mXw log ? log p; 2 2p

TE

182 183 184 185 186

where m? denotes the number of labels in cm whose value is 1. Here, log denotes the natural logarithm, p the circular constant, and H ?z? ? ?z log z ? ?1 ? z? log?1 ? z?; when 0 < z < 1; H ?z? ? 0; when z ? 0, or z ? 1. Let cmw be the sequence of all ci 's (ci P cm ) such that its corresponding text di includes the word w. Here mw denotes the number of labels in cmw . The SC value of cmw is then calculated as  ? mw 1 mw mw ? log SC?c ? ? mw H ? log p; mw 2p 2

DP

For a ?xed category, we take a word w as a feature if the presence or absence of w gives more information. We measure here the gain in information in terms of SC rather than ESC, because feature selection should be considered as statistical estimation rather than classi?cation. Let a data sequence: ?d1 ; c1 ?; ?d2 ; c2 ?; . . . ; ?dm ; cm ? be given where di denotes the ith vector (text), and ci the ith label (category) (i ? 1; 2; . . . ; m). Let cm ? c1 c2 ? ? ? cm , and d m ? d1 d2 ? ? ? dm . It is assumed here that ci P f0; 1g, where ci ? 1 means that di falls into the category of interest, while ci ? 0 means that it falls into the converse (negative) category. Then as in (Rissanen, 1996), the SC value of cm is calculated as  ? m 1 m m SC?c ? ? mH ? log ? log p; 2 2p m

RO

OF

?2?

196 197 198 199 200 201

Words for which the values of DSC are large can be thought of as those which occur or disoccur signi?cantly frequently in the category. Hence, for a given positive constant s, we take w as a feature if and only if DSC?w? exceeds s. Note that the quantity within the ?rst f? ? ?g in (2) is called information gain, which is known to be an e?ective measure for feature selection in information retrieval (cf. Yang & Pedersen, 1997). It is well known empirically that when the sample size is small, information gain values tend to be

IPM 554
H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000 7

202 large and some apparently irrelevant features tend to be selected. The quantity within the second 203 f? ? ?g can help avoid this undesirably tendency, because it will become large when data size is 204 small. Hence DSC can be thought of as an improved variant of information gain.

211 3.5. Growing 212 213 214 215 216 217

UN

CO
Fig. 2. Algorithm: term creation.

RR

219 220 221 222 223 224 225 226

where k is a positive constant, and Loss?cm ? denotes the number of labels whose value is di?erent from that of a default classi?cation decision. The default classi?cation can be, for example, one assuming that all labels are 0. For a given term t P T , let cmt be the sequence of all ci 's (ci P cm ) such that its corresponding text di makes t true. Here mt denotes the number of labels in cmt . Let Loss?cmt ? be the number of labels in cmt such that its value is di?erent from that of the classi?cation decision with t. Let cmXt be the sequence of all ci 's (ci P cm ) such that its corresponding text di makes t false. Here mXt denotes the number of labels in cmXt and Xt denotes the negation of t. Let Loss?cmXt ? be the number of

EC

TE

Fig. 3 shows our growing algorithm. In growing, individual terms which reduce ESC most are selected sequentially. Let us consider how to measure the reduction of ESC. Let a data sequence: ?d1 ; c1 ?; ?d2 ; c2 ?; . . . ; ?dm ; cm ? be given where di is the ith vector (text), and ci the ith label (category) ?i ? 1; 2; . . . ; m?. Let cm ? c1 c2 ? ? ? cm , and d m ? d1 d2 ? ? ? dm . It is assumed that ci P f1; 0g. Then, according to Yamanishi (1998), ESC of cm can be approximated as follows: p??????????????? ?3? ESC?cm ? ? Loss?cm ? ? k m log m;

DP

RO

206 207 208 209 210

In term creation, those i-terms (1 6 i 6 k) which are created on the basis of the selected features and which are made true by at least one text in training data are collected into a set T. (Note that a feature within a term can take 0 as its value, i.e., the absence of a word is also taken into account here.) Speci?cally, i-terms (2 6 i 6 k) are e?ciently created by recursively using ?i ? 1?-terms. Fig. 2 shows the algorithm.

OF

205 3.4. Term creation

IPM 554
8 H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000

Fig. 3. Algorithm: growing.

230 The following quantity then represents how much the ESC value is reduced by adding the rule 231 with the term t as a condition:

RR

DESC?t? ? ESC?cm ? ? ?ESC?cmt ? ? ESC?cmXt ?? ? fLoss?cm ? ? Loss?cmt ? ? Loss?cmXt ?g n p??????????????? p?????????????????? p??????????????????????o ? k m log m ? mt log mt ? mXt log mXt :
233 234 235 236 237 238 239

EC

p?????????????????? ESC?cmt ? ? Loss?cmt ? ? k mt log mt ; p?????????????????????? ESC?cmXt ? ? Loss?cmXt ? ? k mXt log mXt :

TE

227 labels in cmXt such that its value is di?erent from that of the classi?cation decision with Xt. The 228 ESC value of cmt and that of cmXt are, respectively, calculated as follows:

DP

RO
?4? ?5? ?6?

UN

The growing algorithm selects a term t? such that DESC?t? value is the largest at t ? t? . Note that DESC in (6) contains two factors: the quantity within the ?rst f? ? ?g represents the reduction in classi?cation errors and the quantity within the second f? ? ?g depending on sample size can help avoid over-?tting the speci?cs of training data especially when sample size is small. Note that those terms satisfying t & t? will be discarded after adding the rule containing t? , where x & y denotes that term y subsumes term x, e.g., ?tariff ? 1 & trade ? 1 & import ? 1? & ?tariff ? 1 & trade ? 1?.

240 3.6. Pruning

241 Let us consider the pruning algorithm. This algorithm receives as input the decision list that has 242 been output by the growing algorithm, and then sequentially prunes individual rules from the

CO

OF

IPM 554
H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000 9

243 decision list, starting at the end, until it obtains the decision list with the least total ESC value 244 among all pruned decision lists. 245 With regard to calculating the total ESC value in pruning, for a sequence cm we de?ne ESC of m 246 c relative to a decision list A by
t

248 where the summation is taken over all terms which appear in A, and ESC?cmt ? is calculated as in 249 (4) for each term t. We further de?ne total ESC for cm and A by

ESC?cm : A? ? ESC?cm jA? ? kH L?A? ?
251 252 253 254 255 256 257 258 259 260 261 262 263

Loss?cmt ? ? k

RO
t

X
t

X p?????????????????? mt log mt ? kH L?A?;

OF

ESC?cm jA? ?

X

ESC?cmt ?;

?7?

264 3.7. DL-SC

265 We can also design a learning algorithm by using the MDL principle (i.e., the minimum SC 266 principle) for growing and pruning rather than the Minimum ESC principle. We refer to this

UN
Fig. 4. Algorithm: pruning.

CO

RR

EC

where kH is another positive constant, one which might be di?erent from k in (3)±(5), and L?A? denotes the code-length required for encoding the rules in A. More precisely, this is calculated as L?A? ? log jT j ? log?jT j ? 1? ? ? ? ? ? log?jT j ? i ? 1? where jT j is the number of possible terms, and i the number of rules in A. This code-length is usually referred to as the model description length (Yamanishi, 1992). Notice here that there is a trade-o? between the ?rst term on the right-hand side of (7) and the remaining ones, in the sense that as the former decreases (i.e., the decision list ?ts the training examples better), the latter increases (i.e., the decision list becomes more complex). Hence the principle of selecting a list minimizing total ESC in (7) helps to prevent the list from over?tting the speci?cs of training examples. Letting A be the decision list before pruning and letting AH be the one after pruning, we prune until the condition ESC?cm : A? 6 ESC?cm : AH ? is satis?ed, or equivalently, ESC?cm jAH ? ? ESC ?cm jA? P kH ?L?A? ? L?AH ??. The pruning algorithm is summarized in Fig. 4.

TE

DP

IPM 554
10 H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000

267 algorithm as DL-SC. (An algorithm for learning stochastic decision lists on the basis of SC 268 (equivalently MDL) was ?rst proposed in Yamanishi, 1992.) 269 4. Related work 270 271 272 273 274 275 276 277 278

279 4.1. Naive Bayes 280 281 282 283 284

285 4.2. SVM 286 287 288 289 290 291 292 293

294 4.3. C4.5 295 296 297 298 299 300 301

UN

C4.5 (Quinlan, 1993) is a well-known program for learning a decision tree, which is a rooted tree representing classi?cation processes. Each internal node represents a test of a feature, and its branches correspond to test results. Each leaf node represents a classi?cation decision (in text classi?cation, an assignment of a text into a category). Since any decision tree can be transformed into a decision list (rules), a decision tree method for text classi?cation might also be viewed as a rule-based one. C4.5 mainly consists of two steps: growing and pruning. In the growing step, it recursively add nodes to the decision tree to be constructed, each node being selected on the basis

CO

Support vector machine (SVM) (Vapnik, 1995) has recently been applied to text classi?cation and found to be very e?ective for the task (Joachims, 1998; Dumais, Platt, Heckerman, & Sahami, 1998). We consider here its simplest form, namely a linear SVM. A linear SVM learns a hyperplane which separates, with maximum margin, the positive examples and negative examples located in a feature space (de?ned as a Euclidean space). Such a hyperplane can be found by using quadratic optimization techniques and used in classi?cation of new examples. Since a linear SVM actually classi?es new texts with a linear threshold function over the feature space, this method falls into the category of non-rule text classi?cation.

RR

EC

TE

In this method (e.g., (Kar & White, 1978)), it is assumed that each category retains a multinominal distribution over a set of words, and that any text belonging to the category is independently generated according to the distribution. For a new given text d, we use Bayes' rule to calculate the posterior probabilities P ?cjd? and P ?jd?, and classify the text into category c if the c di?erence between P ?cjd? and P ?jd? is larger than a certain threshold value h. c

DP

Many methods have been proposed for text classi?cation (e.g., Rocchio, 1971; Robertson & Sparck-Jones, 1976; Masand, Lino?, & Waltz, 1992; Hull, 1994; Lewis & Gale, 1994; Lewis & Ringuette, 1994; Guthrie, Walker, & Guthrie, 1994; Yang & Chute, 1994; Schutze, Hull, & Pedersen, 1995; Wiener, Pedersen, & Weigend, 1995; Lewis, Schapire, Callan, & Papka, 1996; Dagan, Karov, & Roth, 1997; Koller & Sahami, 1997; Li & Yamanishi, 1997; Ng, Goh, & Low, 1997; Schapire, Singer, & Singhal, 1998; Lam & Ho, 1998; Weiss et al., 1999; Yang & Liu, 1999; Moens & Dumortier, 2000; Nigam, McCallum, Thrun, & Mitchell, 2000; Schapire & Singer, 2000). We describe here two typical non-rule-based methods and two typical rule-based method. We also experimentally compare these methods with our own.

RO

OF

IPM 554
H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000 Table 1 Details of four data sets Data 1 Content Number of categories Number of training data Number of test data Number of word types Average number of words per text Body 90 9603 3299 21 995 70.2 Data 2 Body 8 9603 3299 21 995 70.2 Data 3 Title 90 9603 3299 8611 6.4 Data 4 Title 8 9603 3299 8611 6.4 11

302 of information gain. In the pruning step, it recursively deletes, on the basis of heuristics, subtrees 303 that may over?t the speci?cs of training examples. 304 4.4. Ripper 305 306 307 308 309 310 311 312 313 314

316 5.1. Data sets and evaluation measures 317 318 319 320 321 322 323 324 325

UN

We used Reuters-21578 data 3 in our experiment 3 and adopted the `Apte Split' in the corpus to obtain training and test data. We conducted stemming using the Oxford Learner's Dictionary, 4 and removed stop words on the basis of a list we had prepared. We further created four data sets and thus made four di?erent settings. For data set 1, we viewed the body of each article as a text and adopted as categories the 90 topic categories which have articles in both the training and test data of the split. This type of data set is the most widely used in similar studies. For data set 2, we utilized as categories eight meta-categories (e.g., `economic indicator') of the 90 basic categories, also provided in the Reuters data. We again used the bodies as texts. For data set 3, we viewed only the title of each article as a pseudo text, and

3 4

Available at http://www.research.att.com/$lewis/. Available at ftp://sable.ox.ac.uk.

CO

RR

315 5. Reuters data experiments

EC

The knowledge representation used in Ripper is roughly equivalent to a stochastic decision list, but the algorithm itself is di?erent from DL-ESC. First, in Ripper, feature selection is conducted at each step in growing, while in DL-ESC it is done before growing. Second, in Ripper, rule selection in growing and pruning are conducted on the basis of the MDL principle (equivalently the minimum SC principle), while in DL-ESC they are done on the basis of the minimum ESC principle. Third, Ripper has an additional optimality-check step, which DL-ESC does not have. DL-ESC is signi?cantly simpler than Ripper. It should also be noted that for text classi?cation, minimizing ESC is a more appropriate model selection criterion than MDL, since the goal here is not to estimate the true probability distribution with respect to logarithmic loss but to reduce classi?cation errors with respect to 1±0 loss.

TE

DP

RO

OF

IPM 554
12 H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000

335 5.2. Experimental procedures 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364

UN

We compared the performance of DL-ESC, DL-SC, Naive Bayes, and SVM with respect to the four data sets described above. For DL-ESC and DL-SC, we set the number k of the k-decision list to 3, because the growing algorithm will become computationally very demanding when k is larger than this number. We set the threshold s for feature selection to 0.001, because we found that features selected with this threshold largely agree with human intuition. For each of the four data sets the same set of features was selected for DL-ESC and DL-SC. For several categories, a large number of 3-terms were created, and growing turned out to be very time-consuming. We then discarded some 3-terms composed of features with smaller DSC values for these categories, until growing could be conducted relatively e?ciently. An ESC-based decision list was constructed for each category with each data set, using the growing algorithm and the pruning algorithm, where k ? kH was assumed and k was set to a designated value. In the classi?cation phase, a text was classi?ed into a category when the condition of one of the rules in the corresponding decision list was met. As a result, there were texts classi?ed into di?erent categories, and texts not classi?ed into any of the categories. Using decision lists constructed with di?erent k values in text classi?cation, we obtained a precision and recall curve for DL-ESC for each of the four data sets. A number of precision and recall curves for DL-SC were similarly obtained. Before applying C4.5, we employed the same feature selection method as in DL-ESC, and we also set the feature selection threshold s to the same as in DL-ESC. A precision recall curve was obtained with di?erent pruning parameters in C4.5 for each data set. For Naive Bayes, all of the words in a data set were used as features because we found that this can help Naive Bayes to achieve better performance. (The same tendency has also been observed in experiments in McCallum & Nigam, 1998). A precision recall curve was obtained for Naive Bayes with di?erent classi?cation threshold h values for each data set. For SVM, we used a tool developed by Joachims called `SVM light' 5 to construct linear SVMs, and used them in text classi?cation. We next calculated break-even point for each data set and for each method. Table 2 shows the results. We see that for all data sets DL-ESC outperforms DL-SC. For data sets 1 and 3, DL-ESC
5

See, for example, http://svm.?rst.gmd.de/.

CO

RR

EC

TE

DP

RO

326 327 328 329 330 331 332 333 334

used the 90 basic categories. For data set 4, we utilized the eight meta-categories and titles. Table 1 gives details of the data sets. We conducted evaluations in terms of precision and recall by means of the micro-averaging (cf. Lewis & Ringuette, 1994). Here precision is de?ned as the ratio of the number of categories correctly assigned to the total number of categories assigned. Recall is de?ned as the ratio of the number of categories correctly assigned to the total number of categories that should be assigned. We further adopted as a score for comparison the break-even point, which is widely used for evaluating a text classi?cation method. The break-even point is de?ned as the point at which precision equals recall; a higher score for the break-even point indicates a better performance.

OF

IPM 554
H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000 Table 2 Break-even points (%) for the methods Data 1 DL-ESC DL-SC C4.5 Bayes SVM 82.0 78.3 79.9 77.3 84.1 Data 2 84.3 82.0 83.2 89.2 89.3 Data 3 78.5 78.0 77.8 73.6 77.3 Data 4 85.8 84.3 81.7 85.9 87.3 13

365 outperforms Naive Bayes, while for data sets 2 and 4, Naive Bayes performs better than DL-ESC. 366 For data sets 1, 2 and 4, SVM performs best, while for data set 3, DL-ESC performs best.

UN

Table 3 First rules of decision lists DL-ESC

CO

377 378 379 380 381 382 383 384

5.3.2. Comparison with other rule-based methods Table 4 shows the break-even points for the Reuters-21578 data (in our case, data set 1) for ?ve rule-based methods. Since the pre-processing methods in the respective studies were di?erent, a direct comparison of the break-even points is not necessarily fair, but the table does indicate certain tendencies. We see that DL-ESC works as well as Ripper and it outperforms other rulebased methods. We conclude therefore that DL-ESC performs better than or as well as other existing rule-based methods, and it improves upon Ripper in terms of simplicity.

RR

EC

368 369 370 371 372 373 374 375 376

5.3.1. DL-ESC versus DL-SC Our experimental results indicate that it is better to employ ESC-based decision lists than SCbased ones in text classi?cation. When we inspected the decision lists used in the experiments, we found that minimizing ESC is more suitable than minimizing SC for the purposes of reducing classi?cation errors. Table 3 shows the ?rst rules of the ESC-based decision lists and those of the SC-based decision lists for some categories for data set 1. More precise rules (i.e., rules whose terms have larger numbers of features) in ESC-based decision lists seem to give DL-ESC a non-negligible gain in text classi?cation accuracy.

Acquire ? 1 & cts ? 0 & rate ? 0 3 hacqi Wheat ? 1 & lt ? 0 3 hgraini Beef ? 1 & lt ? 0 3 hcarcassi Money ? 1 & supply ? 1 & lt ? 0 3 hmoney-supplyi Wheat ? 1 & ton ? 1 3 hwheati

TE

DP
DL-SC

367 5.3. Discussion

RO

Acquire ? 1 3 hacqi Wheat ? 1 3 hgraini Beef ? 1 3 hcarcassi Money ? 1 & supply ? 1 3 hmoney-supplyi Wheat ? 1 3 hwheati

OF

IPM 554
14 H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000

Table 4 Break-even points (%) for rule-based methods Method DL-ESC DL-SC C4.5 Ripper Bayesian-net Break-even point 82.0 78.3 80.1 82.0 80.0 Reference This paper This paper This paper Cohen and Singer (1999) Dumais et al. (1998)

UN

Table 5 Break-even points (%) for other methods Method Break-even point 82.0 77.6 87.8 Reference Weiss et al. (1999) Cohen and Singer (1999) Weiss et al. (1999) k-nearest neighbor Rocchio Adaptive resampling

CO

402 403 404 405 406 407 408

5.3.5. Comparison with other methods Table 5 shows the break-even points for data set 1 for other methods. Among them, the adaptive resampling method (Weiss et al., 1999), which classi?es texts by taking votes from a pool of decision trees, achieved the highest performance (87.8%) ever reported in terms of break-even point. Although each of the trees in a pool in adaptive resampling may be interpreted and modi?ed, it would be di?cult, however, to know how a rule (i.e., a path) in a decision tree will contribute to classi?cation, or to predict how the modi?cation of a rule will a?ect classi?cation

RR

394 395 396 397 398 399 400 401

5.3.4. DL-ESC versus SVM It has been reported (Joachims, 1998; Dumais et al., 1998) that SVM can achieve high accuracy in text classi?cation (about 87.0% for break-even point for the Reuters data, i.e., data set 1). Our experimental results show the same tendency, although we only tested the simplest linear form of SVM. Our results also indicate that when categories are speci?c and when texts are short, DLESC can perform better than SVM. We should note, however, that SVM is a non-rule based method, and thus does not have the advantage of readability and re?nability of knowledge, which DL-ESC does have.

EC

TE

DP

385 386 387 388 389 390 391 392 393

5.3.3. DL-ESC versus Naive Bayes Naive Bayes is a non-rule-based method. It employs representations directly based on the feature space. In contrast, DL-ESC is a rule-based method which employs a representation based on a partitioning of the feature space, which largely contributes to the readability and re?nability of the representation. We see that a rule-based method like DL-ESC results in a higher accuracy in speci?c-category/ short-text cases, while a non-rule-based method like Naive Bayes is more accurate in generalcategory/long-text cases. These results verify the correctness of our observation that a rule-based method is more suitable to speci?c-category/short-text cases.

RO

OF

IPM 554
H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000 15

409 results. We ?nd it di?cult, therefore, to view adaptive resampling as a method o?ering su?cient 410 readability and re?nability. 411 Table 5 also shows results achieved by Rocchio and k-nearest neighbor. 412 413 414 415 416

417 5.4. Manual modi?cations of decision lists 418 419 420 421 422 423 424 425 426 427 428 429 430 431

432 6. JICST data experiment 433 434 435 436 437 438 439 440 441 442

UN

We further compared the performances of the ?ve methods: DL-ESC, DL-SC, C4.5, Bayes, and SVM, using JICST data. 6 JICST is the largest text database of titles and abstracts of technical papers written in Japanese. We selected from the database roughly 8000 texts (titles and abstracts) appearing in a period, and had the texts classi?ed into 14 categories by a technical sta? at NEC. The categories include hnatural language processingi; hinformation retrievali, and hartificial intelligencei. We used about 2/3 of the data for training and what remained for test. Table 8 gives details regarding the data. Since the texts used were titles and abstracts, and the categories were subject areas in computer science, our experiment can be considered to re?ect the situation for speci?c-category/short-text cases.

6

Commercially available at http://www.jst.go.jp.

CO

RR

EC

As we previously noted, the decision list method has merits not shared by non-rule-based methods, including the re?nability of acquired knowledge. We have been able to verify this merit in an additional experiment. We have found that for some decision lists it would be better to generalize their features in order to increase recall. We manually modi?ed several decision lists learned from data set 1 with k ? 0:1. By generalization here we mean the replacement of a feature with a feature-set containing the feature's synonyms, antonyms, or closely related words (e.g., replacing `February' with `January, February, . . ., December'). A feature-set will become 1 when any of its values becomes 1. Table 6 shows some examples of feature-sets. We next used the modi?ed decision lists to classify the test data in data set 1. Table 7 shows precision and recall before and after the modi?cations. We see that by modifying decision lists we can signi?cantly increase recall (by about 10%) without decreasing precision. For a non-rule-based method like SVM, however, such modi?cations would appear to be di?cult to perform. Our results demonstrate, therefore, that rule-based methods like DL-ESC, indeed have the advantage of re?nability.

TE

DP

RO

5.3.6. Rule-based methods versus non-rule-based methods From Table 2, we see that rule-based methods (DL-ESC, DL-SC, and C4.5) perform better in speci?c-category/short-text cases (cf. results with respect to data set 3), and non-rule-based methods (Bayes and SVM) perform better in general-category/long-text cases (cf. results with respect to data set 2).

OF

IPM 554
16 H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000

Table 6 Example feature sets Category Alum cpi Money-fx Trade Trade Table 7 Results before and after modi?cations Category Alum cpi Money-fx Trade Average Table 8 Details of the data Number of categories Number of training data Number of test data Number of word types Average number of words per text Before modi?cation Recall 47.8 28.6 55.9 74.4 51.7 Precision 100.0 72.7 62.1 53.1 71.9 Feature Aluminium February stg De?cit Import Feature set {aluminium, aluminum} fJanuary; . . . ; Decemberg {stg, dollar, yen} {surplus, de?cit} {import, export}

UN

Table 9 Break-even points (%) for the four methods Method Break-even point 74.0 72.7 71.8 61.7 64.4 DL-ESC DL-SC C4.5 Bayes SVM

CO

443 444 445 446 447 448 449 450

We conducted text classi?cation with the above ?ve methods. The procedure for each method was the same as that in the Reuters data experiment. We again evaluated the result of each of the methods in terms of break-even point given in Table 9. We see that DL-ESC and other rule-based methods signi?cantly outperform SVM and other non-rule-based ones. These results again verify the correctness of our observation that rule-based methods are more e?ective for speci?c-category/short-text cases. Fig. 5 gives a translated example decision list obtained by DL-ESC for the category of `natural language processing' denoted here as hnlpi. It seems that the category hnlpi can be very well represented in terms of the presence of a

RR

EC

TE
14 5648 2695 20 470 41.2

DP

RO
After modi?cation Recall 65.2 42.9 58.1 80.3 61.6 100 70.6 60.8 52.7 71.1 Precision

OF

IPM 554
H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000 17

Fig. 5. An example decision list.

454 7. Conclusions 455 456 457 458 459 460 461 462 463 464 465

467

Li and Yamanishi (1999).

UN

468 Acknowledgements

469 We are grateful to Dr. Doi of NEC for his encouragement and guidance. We thank Dr. Jo470 achims of the University of Dortmund for providing the text classi?cation tool SVM light. We 471 thank Mr. Takemoto and others of NEC for preparing the JICST data.

CO

466 8. Uncited references

RR

We have proposed a new rule-based method for text classi?cation using stochastic decision lists. Our method employs the principle of minimizing ESC as a criterion for model selection in learning. Advantages of our method include simplicity and the theoretical soundness of the learning algorithm. Experimental results indicate that our method performs better than or comparable to existing rule-based methods. We have empirically veri?ed with Reuters 21578 and JICST data that rule-based methods like ours are especially e?ective in speci?c-category/short-text cases. We have also empirically veri?ed that rule-based methods have a signi?cant advantage over non-rule-based ones. Namely, it is easy to manually modify rules after learning in order to further increase recall of classi?cation for future data.

EC

TE

DP

451 small number of technical terms (e.g., `corpus' and `parsing'), which are highly characteristic of 452 discussions of natural language processing. That is to say, for a rule-based method it would be 453 relatively easy to decide whether or not a text should be classi?ed into the hnlpi category.

RO

OF

IPM 554
18 H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000

472 References 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521
Apte, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3), 233±251. Cohen, W. W., & Singer, Y. (1999). Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2), 141±173. Dagan, I., Karov, Y., & Roth, D. (1997). Mistake-driven learning in text categorization. In Proceedings of the 2nd conference on empirical methods in natural language processing. Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on information and knowledge management (pp. 148±155). Guthrie, L., Walker, E., & Guthrie, J. (1994). Document classi?cation by machine: theory and practice. In Proceedings of the 15th international conference on computational linguistics (pp. 1059±1063). Hull, D. (1994). Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of 17th ACM SIGIR international conference on research and development in information retrieval (pp. 282±289). Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of European conference on machine learning. Kar, G., & White, L. J. (1978). A distance measure for automatic document classi?cation by sequential analysis. Information Processing and Management, 14, 57±69. Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of 14th international conference on machine learning (pp. 170±178). Lam, W., & Ho, C. (1998). Using a generalized instance set for automatic text categorization. In Proceedings of 21st ACM SIGIR international conference on research and development in information retrieval (pp. 81±89). Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training text classi?ers. In Proceedings of 17th ACM SIGIR international conference on research and development in information retrieval (pp. 3±12). Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for test categorization. In Proceedings of 3rd annual symposium on document analysis and information retrieval (pp. 81±93). Lewis, D. D., Schapire, R. E., Callan, J. P., & Papka, R. (1996). Training algorithms for linear text classi?ers. In Proceedings of 19th ACM SIGIR international conference on research and development in information retrieval (pp. 3± 12). Li, H., & Yamanishi, K. (1997). Document classi?cation using a ?nite mixture model. In Proceedings of 35th annual meeting of the association for computational linguistics (pp. 39±47). Li, H., & Yamanishi, K. (1999). Text classi?cation using ESC-based stochastic decision lists. In Proceedings of 8th ACM international conference on information knowledge management (pp. 122±130). Masand, B., Lino?, G., & Waltz, D. (1992). Classifying news stories using memory based reasoning. In Proceedings of 15th ACM SIGIR international conference on research and development in information retrieval (pp. 59±65). McCallum, A., & Nigam, K., (1998). A comparison of event models for Naive Bayes text classi?cation. In Proceedings of AAAI/ICML-98 workshop on learning for text categorization. Moens, M., & Dumortier, J. (2000). Text categorization: The assignment of subject descriptors to magazine articles. Information Processing & Management, 36, 841±861. Ng, H., Goh, W., & Low, K. (1997). Feature selection, perceptron learning and a usability case study for text categorization. In Proceedings of 20th ACM SIGIR international conference on research and development in information retrieval (pp. 67±73). Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classi?cation from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103±134. Quinlan, J. R. (1993). C4.5: Programs for machine learning. Los Altos, CA: Morgan Kaufmann. Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transaction on Information Theory, 42(1), 40± 47. Rivest, R. L. (1987). Learning decision lists. Machine Learning, 2, 229±246. Robertson, S. E., & Sparck-Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27, 129±146.

UN

CO

RR

EC

TE

DP

RO

OF

IPM 554
H. Li, K. Yamanishi / Information Processing and Management 000 (2001) 000±000 19

UN

CO

RR

EC

TE

522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544

Rocchio, J. (1971). Relevance feedback in information retrieval. In G. Slaton (Ed.), The smart retrieval system ± Experiments in automatic document processing (pp. 313±323). Englewood Cli?s, NJ: Prentice-Hall. Schapire, R. E., & Singer, Y. (2000). Boosttexter: A boosting-based system for text categorization. Machine Learning, 39(2/3), 135±168. Schapire, R. E., Singer, Y., & Singhal, A. (1998). Boosting and Rocchio applied to text ?ltering. In Proceedings of 21st ACM SIGIR international conference on research and development in information retrieval (pp. 215±223). Schutze, H., Hull, D. A., & Pedersen, J. O. (1995). A comparison of classi?ers and document representations for the routing problem. In Proceedings of 18th ACM SIGIR international conference on research and development in information retrieval (pp. 229±237). Vapnik V. N. (Ed.). (1995). The nature of statistical learning theory. New York: Springer. Weiss, S. M., Apte, C., Damerau, F., Johnson, D. E., Oles, F. J., Goetz, T., & Hampp, T. (1999). Maximizing textmining performance. IEEE Intelligent Systems, 14(4), 63±69. Wiener, E., Pedersen, J. O., & Weigend, A. S. (1995). A neural network approach to topic spotting. In Proceedings of fourth annual symposium on document analysis and information retrieval (pp. 317±332). Yamanishi, K. (1992). A learning criterion for stochastic rules. Machine Learning, 9, 165±203. Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Transactions on Information Theory, 44(4), 1424±1439. Yang, Y., & Chute, C. G. (1994). An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems, 12(3), 252±277. Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of 22nd ACM SIGIR international conference on research and development in information retrieval (pp. 42±49). Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of 14th international conference on machine learning (pp. 412±420).

DP

RO

OF


更多相关文档:

An Efficient Way to Learn Rules for Grapheme-to-Phoneme ....pdf

Li and K. Yamanishi, “Text Classification Using ESC based Stochastic Decision Lists”, Proceedings of 8th International Conference on Information and ...

Texture Classication of Mouse Liver Cell Nuclei Using ....pdf

Texture Classication of Mouse Liver Cell Nuclei ...govern the stochastic spatial relations between them...10. M. Tuceryan, Moment based texture ...

A VLSI optimal constructive algorithm for classication ....pdf

the largest decision region they can form is a hypercube of side 1/p.... Entropy based comparison of neural networks for classification, to appear in...

Decision Fusion using a Multi-Linear Classifier.pdf

Decision Fusion using a Multi-Linear Classifier_...classication problem In a veri cation system as ...The m2vts multi-modal face database. In ...

Learning Decision Trees for Loss Minimization in Multi-Class ....pdf

In text-to-speech systems, misclassication errors...Then, we build the nal decision tree using w ...The second method, called EvalCount, is based ...

Integrating Knowledge-Based and Collaborative-Filtering ....pdf

A ations to other users based on similarity in...A movie buff who usually likes classic film noir...Spreadsheets: Tools for Building Decision Support ...

Adaptive Sensor Modelling and Classification using a ....pdf

Adaptive Sensor Modelling and Classification using a...alarm to a basestation for further decision ...Each stochastic neuron j takes the following form...

DOI 10.1093nargkh080 SCOR Structural Classication of RNA, ....pdf

DATABASE STRUCTURE SCOR 2.0 has been fundamentally revised to use a DAG ... and the option `list all' displays all STRUCTURAL CLASSIFICATION RNA loop ...

Context based multiscale classification of images.pdf

Context based multiscale classification of images_...Recall that the classi cation features we use ...classi ed as text, otherwise, no decision is ...

Selecting Features for Ordinal Text Classification.pdf

Our third method, RR(IGOR), is based on the idea of viewing ordinal ...feature selection for ordinal text classication using probability re-...

Text Separation from Mixed Documents Using a Tree-structured ....pdf

Text Separation from Mixed Documents Using a Tree...Unlike normal decision tree(DT) which only ...classication problem, a tree-based classication ...

Automatic Face Classication Using Principal Component ....pdf

Automatic Face Classication Using Principal Component...Some possibilities are using a decision tree, ...The computation of each weight vector is based ...

Text Classication Model Based on Semantic Pattern Vector ....pdf

Text Classication Model Based on Semantic Pattern Vector Space_IT/计算机_专业资料。Text Classication Model Based on Semantic Pattern Vector Space ...

...Density Function Interpretation of Subspace Classication ....pdf

available at Density Function Interpretation of Subspace Classication Methods_...The classi cation decision is thus based entirely on measurements in the ...

A Level Set Model for Image Classication.pdf

Abstract. We present a supervised classication model based on a variational ... or by stochastic approach as in 1], but rarely in the eld of ...

...neural network for highdimensional vector classication_....pdf

network for highdimensional vector classication_专业...decision which minimizes the probability of mis...database used in the European ROARS ESPRIT ...

Improving short-text classification using unlabeled back....pdf

Improving short-text classification using unlabeled ...learner, to aid it in its decision task. ...(1998b). A web-based information system that ...

Classication of Receivers in Large Multicast Groups using ....pdf

In fact, escaping from the feedback implosion at receiver has a cost, ...The mechanism is based on the classication of receiver reports done by ...

Exemplar learning in fuzzy decision trees.pdf

Exemplar learning in fuzzy decision trees_专业资料。Decision-tree algorithms ...It is quite natural to make classication decisions based on those partitions...

High performancing Feature selection for text classification_....pdf

text classification_城乡/园林规划_工程科技_专业资料...As pointed out in [12] a percentage-based ...Kubat. Using the genetic nb DF+CHI MAX ...

更多相关标签:
网站地图

文档资料共享网 nexoncn.com copyright ©right 2010-2020。
文档资料共享网内容来自网络,如有侵犯请联系客服。email:zhit325@126.com