当前位置:首页 >> 高中教育 >> A Study of Local and Global Thresholding Techniques in Text Categorization

A Study of Local and Global Thresholding Techniques in Text Categorization


Proc. Fifth Australasian Data Mining Conference (AusDM2006)

A Study of Local and Global Thresholding Techniques in Text Categorization
Nayer M. Wanas; Dina A. Said; Nadia H. H

egazy; and Nevin M. Darwish
Pattern Recognition and Information Systems Group Informatics Department Electronics Research Institute, Cairo, Egypt Department of Computer Engineering Faculty of Engineering Cairo University, Cairo, Egypt

nwanas@ieee.org, dsaid@ieee.org, nhegazy@mcit.gov.eg, ndarwish@ieee.org

Abstract Feature Filtering is an approach that is widely used for dimensionality reduction in text categorization. In this approach feature scoring methods are used to evaluate features leading to selection. Thresholding is then applied to select the highest scoring features either locally or globally. In this paper, we investigate several local and global feature selection methods. The usage of Standard Deviation (STD) and Maximum Deviation (MD) as globalization schemes is suggested. This work provides a comparative study among fourteen thresholding techniques using di?erent scoring methods and benchmark datasets of diverse nature. This includes investigation of normalizing feature scores before combining them in the global pool. The results suggest that normalized MD outperforms other methods in thresholding Document Frequency (DF) scores using even and moderate diverse data-sets. Furthermore, the results indicated that normalizing feature scores improves the performance of rare categories and balances the bias of some techniques to frequent categories. Keywords:{Text Categorization, Thresholding Techniques, Dimensionality Reduction, Feature Filtering, Support Vector Machine} 1 Introduction

a form of bag of words (BOW)(Salton, Wong & Yang 1975). Song et al. showed that performing stop word removal and stemming during the pre-processing step may harm the performance a little bit but it has a great e?ect on reducing the feature space signi?cantly(Song, Liu & Yang 2005). 2. Dimensionality reduction (DR): is performed using either feature extraction or feature selection approaches(Sebastiani 2002). In feature extraction, new features are generated based on complex methods such as latent Semantic Indexing (LSI)(Wiener, Pedersen & Weigend 1995), Independent Component Analysis(Kolenda, Hansen & Sigurdsson 2000), and word clustering(Cohen & Singer 1999). On the other hand, the feature selection approach is based on selecting high-relevance features by either ?ltering irrelevance features(Yang & Pedersen 1997) or wrapping the features around the classi?er used(Kohavi & John 1997). 3. Feature Weighting where the selected features are weighted by commonly using term frequency inverse document frequency (tf idf ) technique(Sebastiani 2002) where tf measures the importance of the term within the document and idf measures the general importance of the term(Salton & Buckley 1988). In order to handle the diversity in document lengths, normalization of the tf idf measure could be performed which has proved to lead to a signi?cant improvement in the performance(Song et al. 2005). 4. Classi?cation: The weighted feature set is used to learn how to distinguish among the di?erent document categories. Several supervised classi?ers have been used in TC. Among them are Decision trees(Johnson, Oles, Zhang & Goetz 2002), K-nearest neighbors(Bang, Yang & Yang 2006) , Naive Bayes(Peng, Schuurmans & Wang 2004), Neural Network(Chen, Lee & Hwang 2005), Maximum Entropy(Kazama & Tsujii 2005), and Support Vector Machine(SVM)(D? ?az, Ranilla, Monta?es, Fern?ndez & Combarro 2004). SVM n a has been shown to be among the best performing classi?ers in TC applications(Joachims 1998, Dumais, Platt, Heckerman & Sahami 1998, Yang 1999, Yang, Zhang & Kisiel 2003, Li & Yang 2003, Debole & Sebastiani 2005). DR is one of the most important challenges in TC. This reduction is necessary to avoid over?tting, as well as decreasing the computational resources, storage , and the memory required to manage these features(Sebastiani 2002, Guyon & Elissee? 2003). This study concentrates on the ?lter approach of DR due to its simplicity compared to the other

Text Categorization (TC) is the process of assigning a given text to one or more categories. This process is considered as a supervised classi?cation technique, since a set of labeled (pre-classi?ed) documents is provided as a training set. The goal of TC is to assign a label to a new, unseen, document(Yang 1995). TC can play an important role in a wide variety of areas such as information retrieval(Freitas-Junior, Ribeiro-Neto, Vale, Laender & Lima 2006), news recommendation(Antonellis, Bouras & Poulopoulos 2006), word sense disambiguation(Montoyo, Suarez, Rigau & Palomar 2005), topic detection and tracking(Tsay, Shih & Wu 2005), web pages classi?cation(Yang, Slattery & Ghani 2002, Shen, Sun, Yang & Chen 2006), as well as any application requiring document organization. Text categorization can be divided into four major stages: 1. Document pre-processing where words are extracted from the documents and presented in
Copyright c 2006, Australian Computer Society, Inc. This paper appeared at Australasian Data Mining Conference (AusDM 2006), Sydney, December 2006. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 61. Peter Christen, Paul Kennedy, Jiuyong Li, Simeon Simo? and Graham Williams, Ed. Reproduction for academic, not-for pro?t purposes permitted provided this text is included.

91

CRPIT Volume 61

approach(Blum & Langley 1997, Combarro, Montanes, Diaz, Ranilla & Mones 2005). Filtering features is performed ?rst by applying feature scoring methods locally to each category in the training set in order to evaluate features. This is followed by thresholding, which is performed either locally, or globally to select the features that have the highest feature scores(Debole & Sebastiani 2005). This work proposes new techniques for thresholding feature scores in order to address some shortcomings in the existing techniques. In addition, few studies have been conducted to compare the performance of di?erent thresholding techniques. In order to evaluate the proposed thresholding techniques, a comparative study is conducted among these techniques and the existing state-of-art thresholding techniques. Since the performance of the thresholding techniques is highly a?ected by the feature scoring method used, the comparative study is performed using four highperforming feature scoring methods and di?erent thresholding values. These feature scoring methods are the Document Frequency(DF)(Yang & Pedersen 1997), Information Gain(IG)(Sebastiani 2002), Mutual Information(MI)(Yang & Pedersen 1997), and Correlation Coe?cient(CC)(Ng, Goh & Low 1997). These methods have been widely used, and have shown to be among the top performing methods(Yang & Pedersen 1997, Sebastiani 2002). It might be worth noting that CC is a modi?cation of χ2 and has been shown to outperform it in the literature(Ng et al. 1997, Ruiz & Srinivasan 2002). The following sections elaborate together to show the rational beyond the new thresholding methods as well as their performance evaluation using di?erent benchmark data-sets. 2 Thresholding Policies for Feature Filtering

(a) Fixed Local thresholding

(b) Global thresholding

Thresholding is an important issue in performing feature selection using the ?lter approach. After applying feature scoring methods to each category, thresholding is performed to select the ?nal set of features that represents the training set. Selection is done according to either (a) a local policy, or (b) a global policy. In the local policy, thresholding is applied locally on each category to compose the ?nal representative feature set(D? et al. 2004). While a global?az ization scheme is performed to extract a single global score for features in the global policy. Then, features with the highest global scores are selected(Yang & Pedersen 1997). Figure 1 illustrates the di?erence between local and global thresholding policies. 2.1 The local policy

Figure 1: Local and Global Thresholding et al. 1997) where they applied this policy to three feature scoring methods, namely DF, CC, and χ2 . The previous approaches select the same number of features from each category. In this work we refer to this policy as the ?xed local approach (FLocal). On the other hand, Soucy and Mineau(Soucy & Mineau 2003), proposed selecting features in proportion to the category distribution, or what can be considered a weighted local approach (WLocal). The rational behind this approach is that in highly-skewed datasets the ratio among words of frequent and infrequent categories maybe very large. Hence, selecting the same number of features from both distributions may degrade the performance. In (Forman 2004), Forman proposed a similar idea to ?xed local thresholding, which he called roundrobin. In this approach features are selected from each class in a round-robin manner. Additionally, Forman proposed another idea for local thresholding called rand-robin. In this method, the next class to select features from is determined randomly by a process that is controlled by the probability of the class distribution. This seems very similar to the idea of WLocal. However, one drawback of rand-robin is that in highly-skewed datasets it might not select any document from rare-categories. This is due to its random nature in selection, which depends on the category probability. 2.2 The global policy

Several researchers have suggested usage of a local policy, where a di?erent set of features is chosen from each category independent on other categories. The local policy tends to optimize the classi?cation process for each category by selecting the most relevant features in that category. Local selection policy was used by(Apt?, Damerau e & Weiss 1994) where a local dictionary of the most important words in each topic was used. They then selected from each category the words that matched the category dictionary. However, using local dictionaries su?ers from being a domain-dependent and language-dependent approach. Alternatively, Lewis and Ringuette selected the top IG features from each category(Lewis & Ringuette 1994). On the other hand, an implementation to local feature extraction was proposed by(Wiener et al. 1995) where they applied LSI on each category separately. An extension to the usage of the local strategy was presented by(Ng
92

As an alternative to the local policy, the global policy aims at providing a global view of the training set by extracting a global score from local feature scores. Thresholding is then applied to these global scores,

Proc. Fifth Australasian Data Mining Conference (AusDM2006)

where features with the highest global score are retained. Yang and Pedersen(Yang & Pedersen 1997) used Maximization (Max) and Weighted Averaging (WAvg) for extracting global scores from χ2 and MI. Additionally, they used Averaging (Avg) for DF and IG. Calvo and Ceccatto proposed the usage of Weighted Maximum (WMax), where features are weighted by the category probability(Calvo & Ceccatto 2000). Equations 1, 2, 3, and 4 provide the mathematical de?nitions of Avg, Max, WAvg, and WMax respectively. M f (wk , ci ) FAvg (wk ) = i=1 (1) M M FM ax (wk ) = max {f (wk , ci )} (2)
i=1

STD of w1 since the scores of w2 will be more diverse from its mean. In order to overcome this pitfall, we propose the Maximum Deviation (MD) as a globalization scheme. Contrary to the STD, MD gives an estimation of how diverse is the data from the Max which makes it closer to realize the de?nition of a good feature compared to the STD. Equations 7, and 8 show the mathematical de?nitions of the STD, and MD respectively.
M i=1

FST D (wk ) =

[f (wk , ci ) ? fAvg (wk )] . (7) M

2

where f (wk , ci ) is the score of the word wk w.r.t. the category ci , and M is number of categories in the training set. FW Avg (wk ) =
M M i=1

FM D (wk ) =

M i=1

[f (wk , ci ) ? fM ax (wk )] . (8) M

2

p(ci )f (wk , ci ) M

(3) (4)

Accordingly, Weighted STD (WSTD), Normalized STD (NSTD), Weighted MD (WMD), and Normalized MD (NMD) could be systematically de?ned. 2.4 Comparative Techniques studies of Thresholding

FW M ax (wk ) = max {p(ci )f (wk , ci )} .
i=1

2.3

Proposed Thresholding techniques

Feature scoring methods such as the DF, have an inherent biased to frequent categories. Similarly global techniques such as the Max or Avg tends to select features that are biased to frequent categories. Moreover, weighting the feature scores based on the category probability increases this bias and is excepted to degrade the performance mainly in the classi?cation of rare categories. These rare categories may have high importance and their number maybe large especially in highly skewed datasets. Alternatively, we propose normalizing feature scores before applying the globalization scheme in order to enhance the classi?cation process of rare categories and balance the bias of feature scoring methods such as DF. Accordingly, Normalized Average (NAvg), and Normalized Maximization (NMax) are de?ned as shown in equations 5, and 6 respectively: (5) M M f (wk , ci ) FN M ax (wk ) = max . (6) i=1 p(ci ) However, an interesting question that arises is how to de?ne a good feature. It is usually assumed to be the feature that has the maximum or the maximum average score in the training set. As a matter of fact, when using a feature scoring method such as DF, this de?nition is rendered inappropriate. Based on this de?nition, the features that exist in all categories with the same DF will be considered as good features. However, a good feature is the one that has a score in one category that is substantially di?erent from its score in all other categories. In order to identify such features, we propose the usage of the Standard Deviation (STD) as a globalization scheme. STD gives an estimated measure of how diverse the data is from the mean. Although intuitively the STD should capture good features, as opposed to the Max and Avg, it has its shortcomings. To illustrate this, suppose that a certain feature w1 occurs in only one category with DF= x while another feature w2 occurs in two categories with the same DF=x. According to the de?nition of a good feature w1 should be considered better than w2 . When using DF as a scoring method, the mean of w2 will be higher than the mean of w1 . Accordingly, the STD of w2 will be greater than the FN Avg (wk ) =
M f (wk ,ci ) i=1 p(ci )

Table 1 provides a summary of di?erent studies conducted to evaluate the performance of current thresholding techniques. The comparisons among the Max and WAvg performed by(Yang & Pedersen 1997, Galavotti et al. 2000) indicated that using the Max is better than WAvg. However, the two measures used in these experiments, M icroF1 and 11averaging point, are mainly a?ected by the performance of frequent categories especially in a highly skewed dataset such as Reuters-21578. The study performed by(Calvo & Ceccatto 2000) took M acroF1 into consideration which is a measure of the performance of rare categories. However, the conclusion of this study contradicted the results of(Yang & Pedersen 1997, Galavotti et al. 2000) in M icroF1 despite using the same dataset. While the previous three studies used only Reuters dataset in the evaluation, Soucy and Mineau(Soucy & Mineau 2003) used four di?erent datasets and four feature scoring methods. However, they reported only the best performance achieved for each dataset regardless of the feature scoring method and threshold value used. Additionally, the evaluation measure used in this study, M icroF1 , does not generally a?ected by the performance of rare categories. The study of(Forman 2004) was conducted using different small threshold values and nineteen di?erent datasets. However, the results reported were only the average performance of the experiments conduct on these datasets. Both studies however fall short of presenting the complete picture required to determine the most suitable thresholding techniques for di?erent datasets. D? et. al.(D? et al. 2004) presented ?az ?az a comparison between FLocal and Avg. Although this study didn’t evaluate the WLocal or even the traditional Max, however, it supported the results of(Soucy & Mineau 2003, Forman 2004) that local thresholding maybe better than global thresholding. In order to avoid the pitfalls of these studies, in this work we present a comparative study among the proposed thresholding techniques and the existing six techniques. This study is conducted using di?erent threshold values, di?erent feature scoring methods, and datasets of diverse natures. The results of this study are reported using M icroF1 and M acroF1 in order to evaluate the performance of frequent and rare categories.

93

CRPIT Volume 61

Table 1: A Summary of Comparative Studies among Thresholding Techniques
Yang and Pedersen(Yang & Pedersen 1997) Galavotti et. al.(Galavotti, Sebastiani & Simi 2000) Calvo and Ceccatto(Calvo & Ceccatto 2000) Scoring Methods IG, χ2 Simpli?ed χ2 χ
2

Datasets Reuters-21578 Reuters-21578 Reuters-21578

Soucy and Mineau(Soucy & Mineau 2003)

Forman(Forman 2004)

IG, DF χ2 , Cross Entropy IG, χ2

Reuters-21578 lingSpam, DigiTrad, Ohsumed 19 datasets

Conclusions 11-average point: Max > WAvg M icroF1 : Max > WAvg M icroF1 : WAvg, WMax > Max M acroF1 : Max,WAvg > WMax M icroF1 : FLocal > WLocal > Max IG (M acroF1 ): FLocal > WLocal > Max > Avg χ2 (M acroF1 ): FLocal > Max > Avg > WLocal Recall and Precision: FLocal > Avg

D? et al.(D? et al. 2004) ?az ?az

IG, tf , tf idf

Reuters-21578, Ohsumed

3

Experimental Setup

4.1

20NG Data-set

Three di?erent data-sets were used in this study, they are the 20 Newsgroups Collection (20NG)1 (Joachims 1997), Ohsumed data-set2 (Joachims 1998, Combarro et al. 2005), and Reuters-21578 ModeApt? split3 e (Reuters(90))(Joachims 1998). For the Ohsumed dataset, the ?rst 20,000 documents with abstracts published in 1991 were considered which resulted in 23 categories. The ?rst 10,000 have been used as training set and the rest as test set. While 20NG is an evenly distributed data-set, Ohsumed data-set can be considered as a moderate diverse data-set. Ohsumed has 23 categories where the largest category has about 1800 document and the smallest one has 65 documents. On the other hand, Reuters (90) can be considered as a highly diverse dataset as it has about 30 categories whose number of documents is below 10. The F1 measure is used as an e?ective measure. F1 was ?rst proposed as a measure of e?ectiveness in TC by Lewis(Lewis & Ringuette 1994). M icroF1 and M acroF1 tests are two measures based on F1 . The M acroF1 test equally weights all categories, and thus it is in?uenced by the performance of rare categories. On the other hand, M icroF1 test equally weights all the documents, and therefore it is a?ected by the performance of frequent categories(Yang 1999). 4 Results

Since the 20NG dataset is an evenly distributed dataset, the M icroF1 and M acroF1 values match. Therefore, only the results of M icroF1 are reported in Figure 2. Additionally, due to the equal distribution of categories in the data-set, weighting or normalizing feature scores will not be of value since all the categories have the same probability. Therefore, the evaluated thresholding techniques on this data-set are limited to only the Avg, Max, STD, MD and FLocal. The following observations can be seen from the results: ? The results show that the MD has a limited improvement when using DF as a scoring method, especially noticeable at low threshold values. On the hand, FLocal has the best performance compared to the remaining methods. ? Generally, the Max, and MD exhibit an almost identical performance in feature scoring methods except DF. As a matter of fact, the correlation between the feature sets they produce show a high similarity, generally above 90%. This is due to the fact that these scoring methods take into account the relevance of the feature in the category under investigation and other categories in the training set. Therefore, using either the Max, or MD tends to attain nearly the same set of features. This is an expected result since these scoring methods follow from the rational of a good feature. ? Generally, FLocal shows a performance superior to the Max and MD in thresholding feature scoring methods except DF. This is due to the diversity in the number of good features from one category to the other. Therefore, globalization schemes such as the Max, or MD tend to be biased to certain categories in accordance to the number of good features they include. On the other hand, FLocal is an unbiased technique that forces the selection of the same number of features from all the categories. ? The MD is usually slightly better than the STD which shows the potential of MD to capture discriminative features. ? The Avg thresholding technique exhibits the worst performance since it sums the scores of the features across categories and hence it tends to select non-discriminative features. 4.2 Ohsumed Data-set

The study compares the performance of six feature thresholding techniques, namely Averaging (Avg), Maximization (Max), Fixed Local (FLocal), Weighted Local (WLocal) in addition to our proposed techniques Standard Deviation (STD), and Maximum Deviation (MD). The globalization techniques are evaluated using the original, weighted, and normalized scores. This amounts to fourteen thresholding methods evaluated using four feature scoring methods. In this study we apply classi?cation using the SV M light (Joachims 1999)4 . The experimental results are evaluated using 20 news group (20NG), Ohsumed, and Reuters-21578 ModeApt` (Reuters(90)) data-sets. In the following, we e will focus on the evaluation of the thresholding techniques for each data-set separately.
The 20 Newsgroups and the bydate split can be found at http://people.csail.mit.edu/people/jrennie/20Newsgroups 2 Ohsumed is available http://trec.nist.gov/data/t9 ?ltering/. 3 Reuters-21578 is available at http://www.daviddlewis.com/resources/testcollections/reuters21578/. 4 SV M light is publicly published at http://svmlight.joachims.org.
1

Tables 2, and 3 show the performance of M icroF1 , and M acroF1 for the Ohsumed data-set respectively.

94

Proc. Fifth Australasian Data Mining Conference (AusDM2006)

0.8

0.8 0.75 0.7 0.65 0.6 0.55 Avg MD Max STD FLocal 0 2 4 Threshold
(a)

0.75

0.7
Micro-F1 Micro-F1

0.65

0.6

0.55

0.5 0.45

0.5

Avg MD Max STD FLocal 0 2 4 Threshold
(b)

6

8

10

6

8

10

0.8

0.8

0.78

0.78

0.76
Micro-F1 Micro-F1

0.76

0.74

0.74

0.72

0.72

0.7

0.68

Avg MD Max STD FLocal 0 2 4 Threshold
(c)

0.7

6

8

10

0.68

Avg MD Max STD FLocal 0 2 4 Threshold
(d)

6

8

10

Figure 2: M icroF1 of the 20NG using Thresholding Techniques (a) CC, (b) DF, (c) IG, and (d) MI

95

CRPIT Volume 61

Analyzing these results we can make the following observations: ? NMD achieves the best M icroF1 and M acroF1 in thresholding DF scores. ? For CC, MI, and IG, FLocal has the best performance for M acroF1 . On the other hand, WLocal is better than FLocal considering the M icroF1 and small threshold values. However, starting at a certain threshold, the performance of FLocal almost matches that of WLocal. This threshold di?ers from one scoring method to the other. ? Consistent with the results from the 20NG, the MD and Max show a high degree of similarity in performance when thresholding the CC, IG, and MI. ? The poor performance of the Avg in thresholding DF scores is not surprising. Selecting the highest average DF features at small threshold values tends to retain the features that exist in all categories. These features are not useful to discriminate among categories. ? Generally, weighting the scores performs poorly compared to both the original and normalized scores. On the other hand, normalizing scores is better than using the original score at the macro level, since it adds bias to rare categories. 4.3 Reuters(90) Data-set 5

of features from rare categories. If the scoring method used is not in itself a high-performing scheme, then adding these limited number of selected features will not be enough to discriminate rare categories. This argument is asserted by examining the F1 of individual categories. Conclusions

Figures 3, and 4 illustrate the performance of M icroF1 , and M acroF1 for the Reuters (90) dataset respectively. Due to space limitations, we reduced the ?gure to include only the best six thresholding techniques(MD, Max, NMD, NMax, FLocal and WLocal). The analysis of the performance of these techniques yields the following observations: ? The performance of NMD and NMax is very poor specially for small threshold values. Since Reuters(90) is a highly skewed dataset, normalizing scores will be highly biased to rare categories at small threshold values. ? At the micro level, MD, Max, FLocal, and WLocal have similar performance with a limited superiority of WLocal at small threshold values due to the added bias to frequent categories. ? At the macro level, one can conclude that the WLocal is the best method in thresholding IG and MI at threshold values greater than 1%. It is surprising to observe that the WLocal is generally better than the FLocal on the macro level. This is despite the fact that the FLocal allows the selection of more features from rare categories. Examining the F1 of individual categories shows that FLocal in fact slightly enhances the performance of rare categories. However, WLocal enhances the performance of frequent categories signi?cantly as it selects more features due to the highly skewed nature of Reuters(90). Since M acroF1 , as mentioned in section ??, gives the same weight to all categories , the WLocal is generally better than the FLocal in thresholding IG and MI. ? On the other hand, FLocal is the best method considering the CC and DF for threshold values less than 7.5%. Beyond this threshold, WLocal outperforms FLocal. Using WLocal in thresholding CC and DF scores in small threshold values leads to a selection of a very small number
96

In this study we propose performing global feature selection using the Standard Deviation (STD) and Maximum Deviation (MD). In order to balance the bias of some scoring features to frequent categories, we suggest normalizing feature scores before applying the globalization scheme. We conducted a comparative study using our new techniques and stateof-art thresholding techniques. The study was performed using four feature scoring methods and various datasets of di?erent nature. Generally, localization techniques are better than globalization methods, which supports the results obtained by(Soucy & Mineau 2003, D? et al. 2004, ?az Forman 2004). Additionally, localization techniques are much faster since thresholding is done on each category individually as opposed to the whole training set. Furthermore, there might be a possibility to perform thresholding on local categories independently in parallel, which may help speedup the process of dimensionality reduction. With the exception of the DF scoring method, FLocal is the best thresholding method in even distributed data-set and generally moderate diverse data-set. On the other hand, WLocal is the best in the M icroF1 and some cases of the M acroF1 of highly diverse data-set. Furthermore, WLocal achieves a performance that is either better than or equivalent to that of FLocal in the M icroF1 of moderate diverse data-set. MD shows an enhancement in the performance of thresholding using the DF as a scoring method in an even data-set. Normalized MD shows also potential in improvement the performance of moderate diverse data-set and large threshold values of highly diverse data-set. However, MD has a performance similar to the Max in all other feature scoring methods. Generally, the MD outperforms STD in most cases. This supports that MD is more e?cient in identifying good features. On the other hand, the Avg shows a poor performance comparable to other thresholding techniques. This is in consistent with the results of (Yang & Pedersen 1997, Galavotti et al. 2000). Normalizing the features scores before applying the globalization method shows also an enhancement in the M acroF1 of moderate diverse data-set. However, when using IG and MI in highly-skewed datasets, the performance does not match that of the original unmodi?ed score. As a general conclusion, the results suggest the usage of MD in an even distributed data-set and NMD for a moderate diverse data-set for thresholding DF scores. For methods except DF, the results recommend FLocal for even data-set. Furthermore, WLocal should be used in moderate diverse data-set in case that frequent categories are more important and the number of desired features is small. Otherwise, FLocal is recommended. However for a highly skewed data-set, WLocal is recommended for IG and MI as it generally enhances the classi?cation of both frequent and rare categories. Additionally, WLocal is suggested for CC and DF if frequent categories are more important while FLocal is recommended for rare ones.

Proc. Fifth Australasian Data Mining Conference (AusDM2006)

Table 2: M icroF1 of the Ohsumed data-set
Avg % 0.5 1 1.5 2.5 5 7.5 10 0.5 1 1.5 2.5 5 7.5 10 0.5 1 1.5 2.5 5 7.5 10 0.5 1 1.5 2.5 5 7.5 10 Metric Avg 0.193 0.277 0.328 0.441 0.511 0.552 0.567 0.000 0.141 0.370 0.370 0.485 0.534 0.553 0.480 0.520 0.545 0.580 0.606 0.610 0.611 0.470 0.520 0.551 0.587 0.616 0.624 0.625 WAvg 0.248 0.289 0.326 0.392 0.448 0.484 0.514 0.062 0.177 0.356 0.356 0.471 0.520 0.547 0.445 0.495 0.533 0.553 0.590 0.603 0.606 0.400 0.459 0.489 0.536 0.583 0.603 0.608 NAvg 0.150 0.242 0.298 0.358 0.456 0.503 0.527 0.000 0.089 0.384 0.384 0.475 0.534 0.556 0.432 0.493 0.529 0.563 0.595 0.600 0.607 0.274 0.410 0.469 0.524 0.591 0.610 0.610 MD 0.375 0.444 0.488 0.530 0.571 0.587 0.596 0.349 0.389 0.431 0.494 0.544 0.575 0.592 0.493 0.537 0.562 0.598 0.628 0.633 0.633 0.483 0.525 0.543 0.582 0.610 0.617 0.620 MD WMD 0.299 0.356 0.420 0.472 0.546 0.568 0.580 0.251 0.293 0.324 0.401 0.513 0.541 0.561 0.445 0.504 0.531 0.565 0.593 0.612 0.620 0.400 0.476 0.504 0.537 0.573 0.592 0.604 NMD 0.366 0.416 0.460 0.528 0.574 0.589 0.600 0.423 0.486 0.507 0.548 0.591 0.604 0.605 0.441 0.501 0.524 0.560 0.592 0.615 0.622 0.435 0.490 0.518 0.551 0.604 0.611 0.614 Max 0.377 0.455 0.487 0.537 0.573 0.589 0.599 0.277 0.378 0.402 0.486 0.544 0.576 0.589 0.493 0.535 0.562 0.596 0.627 0.633 0.632 0.484 0.523 0.543 0.585 0.610 0.616 0.622 Max WMax 0.299 0.354 0.406 0.472 0.539 0.569 0.589 0.243 0.293 0.334 0.401 0.515 0.541 0.559 0.445 0.504 0.531 0.566 0.592 0.612 0.620 0.399 0.478 0.503 0.539 0.577 0.595 0.607 NMax 0.365 0.424 0.453 0.534 0.579 0.596 0.603 0.381 0.444 0.491 0.526 0.574 0.596 0.600 0.441 0.501 0.525 0.561 0.608 0.617 0.624 0.436 0.491 0.518 0.555 0.608 0.614 0.616 STD 0.356 0.420 0.456 0.509 0.559 0.581 0.592 0.298 0.379 0.406 0.481 0.541 0.574 0.585 0.491 0.530 0.558 0.603 0.625 0.629 0.633 0.470 0.514 0.536 0.568 0.602 0.611 0.612 STD WSTD 0.311 0.407 0.445 0.488 0.543 0.569 0.583 0.255 0.315 0.329 0.418 0.510 0.538 0.565 0.444 0.504 0.533 0.566 0.598 0.613 0.622 0.404 0.474 0.498 0.531 0.571 0.592 0.598 NSTD 0.347 0.411 0.438 0.504 0.558 0.581 0.595 0.391 0.458 0.498 0.529 0.576 0.591 0.601 0.438 0.500 0.529 0.566 0.601 0.619 0.625 0.431 0.488 0.511 0.565 0.600 0.609 0.610 0.380 0.447 0.490 0.542 0.584 0.601 0.604 0.385 0.455 0.485 0.527 0.573 0.597 0.601 0.486 0.532 0.567 0.605 0.629 0.635 0.636 0.475 0.518 0.562 0.595 0.620 0.621 0.623 Local Flocal WLocal 0.413 0.457 0.505 0.546 0.585 0.596 0.602 0.395 0.430 0.477 0.516 0.557 0.576 0.586 0.505 0.553 0.578 0.603 0.628 0.630 0.632 0.495 0.535 0.563 0.592 0.617 0.621 0.618

CC

DF

IG

MI

Table 3: M acroF1 of the Ohsumed data-set
Avg % 0.5 1 1.5 2.5 5 7.5 10 0.5 1 1.5 2.5 5 7.5 10 0.5 1 1.5 2.5 5 7.5 10 0.5 1 1.5 2.5 5 7.5 10 Metric Avg 0.108 0.154 0.202 0.304 0.391 0.439 0.455 0.000 0.041 0.109 0.185 0.310 0.374 0.405 0.310 0.391 0.413 0.463 0.510 0.517 0.517 0.295 0.387 0.425 0.471 0.521 0.533 0.537 WAvg 0.083 0.099 0.127 0.185 0.246 0.302 0.341 0.018 0.049 0.095 0.159 0.276 0.349 0.390 0.244 0.283 0.355 0.372 0.444 0.478 0.488 0.192 0.242 0.261 0.334 0.410 0.464 0.477 NAvg 0.145 0.225 0.285 0.331 0.422 0.463 0.480 0.000 0.023 0.098 0.208 0.320 0.401 0.442 0.342 0.409 0.448 0.492 0.523 0.520 0.523 0.281 0.372 0.413 0.473 0.520 0.529 0.530 MD 0.195 0.289 0.329 0.381 0.429 0.464 0.473 0.160 0.205 0.238 0.304 0.379 0.440 0.470 0.360 0.421 0.456 0.500 0.554 0.555 0.558 0.317 0.396 0.418 0.468 0.520 0.530 0.534 MD WMD 0.093 0.162 0.226 0.291 0.375 0.409 0.440 0.067 0.098 0.127 0.196 0.336 0.386 0.416 0.239 0.286 0.336 0.388 0.429 0.469 0.486 0.192 0.266 0.287 0.349 0.406 0.436 0.466 NMD 0.269 0.326 0.372 0.427 0.482 0.498 0.517 0.332 0.393 0.422 0.462 0.511 0.524 0.521 0.371 0.448 0.478 0.513 0.537 0.548 0.549 0.359 0.422 0.462 0.495 0.534 0.538 0.536 Max 0.198 0.295 0.333 0.391 0.429 0.467 0.481 0.084 0.197 0.214 0.300 0.379 0.441 0.466 0.360 0.418 0.454 0.497 0.548 0.555 0.555 0.317 0.395 0.419 0.473 0.520 0.530 0.534 Max WMax 0.093 0.154 0.202 0.272 0.353 0.415 0.449 0.065 0.097 0.135 0.196 0.336 0.386 0.412 0.239 0.287 0.336 0.389 0.429 0.469 0.488 0.192 0.266 0.286 0.350 0.410 0.440 0.468 NMax 0.274 0.339 0.369 0.434 0.488 0.506 0.519 0.242 0.344 0.387 0.430 0.488 0.510 0.514 0.371 0.448 0.478 0.515 0.546 0.549 0.552 0.364 0.424 0.462 0.499 0.538 0.538 0.536 STD 0.176 0.239 0.275 0.333 0.417 0.451 0.479 0.112 0.199 0.220 0.295 0.375 0.436 0.462 0.340 0.414 0.453 0.504 0.546 0.553 0.553 0.303 0.359 0.407 0.445 0.505 0.519 0.521 STD WSTD 0.119 0.216 0.246 0.289 0.366 0.414 0.438 0.067 0.118 0.130 0.218 0.331 0.368 0.412 0.244 0.289 0.352 0.391 0.432 0.478 0.503 0.196 0.265 0.285 0.340 0.404 0.440 0.458 NSTD 0.203 0.265 0.318 0.396 0.470 0.491 0.503 0.254 0.353 0.396 0.427 0.485 0.503 0.514 0.366 0.446 0.483 0.518 0.540 0.550 0.549 0.356 0.412 0.445 0.501 0.528 0.531 0.527 0.273 0.362 0.395 0.444 0.499 0.517 0.521 0.272 0.359 0.392 0.440 0.490 0.514 0.518 0.405 0.457 0.508 0.546 0.561 0.566 0.562 0.397 0.446 0.498 0.528 0.548 0.548 0.545 Local Flocal WLocal 0.273 0.316 0.365 0.407 0.465 0.482 0.497 0.230 0.274 0.332 0.389 0.435 0.460 0.477 0.390 0.445 0.476 0.500 0.549 0.552 0.553 0.386 0.427 0.448 0.488 0.532 0.540 0.535

CC

DF

IG

MI

97

CRPIT Volume 61

0.9 0.85 0.8 0.75 0.7
Micro-F1

0.9 0.8 0.7 0.6
Micro-F1

0.5 0.4 0.3

0.65 0.6 0.55 0.5 0.45 0.4 0 2 4 Threshold
(a)

MD NMD Max NMax FLocal WLocal 6 8 10

0.2 0.1 0

MD NMD Max NMax FLocal WLocal 0 2 4 Threshold
(b)

6

8

10

0.9 0.8 0.7 0.6
Micro-F1 Micro-F1

0.9 0.8 0.7 0.6 0.5 0.4 0.3 MD NMD Max NMax FLocal WLocal 0 2 4 Threshold
(c)

0.5 0.4 0.3 0.2 0.1 0

0.2 0.1 0

MD NMD Max NMax FLocal WLocal 0 2 4 Threshold
(d)

6

8

10

6

8

10

Figure 3: M icroF1 of the Reuters(90) data-set using the Thresholding Techniques (a) CC, (b) DF, (c) IG, and (d) MI

98

Proc. Fifth Australasian Data Mining Conference (AusDM2006)

0.45 0.4 0.35 0.3
Macro-F1

0.45 0.4 0.35 0.3
Macro-F1

0.25 0.2 0.15

0.25 0.2 0.15 0.1 0.05

MD NMD Max NMax FLocal WLocal 0 2 4 Threshold
(a)

0.1 0.05 0

MD NMD Max NMax FLocal WLocal 0 2 4 Threshold
(b)

6

8

10

6

8

10

0.5 0.45 0.4 0.35
Macro-F1 Macro-F1

0.5 0.45 0.4 0.35 0.3 0.25 0.2 MD NMD Max NMax FLocal WLocal 0 2 4 Threshold
(c)

0.3 0.25 0.2 0.15 0.1 0.05

0.15 0.1 0.05

MD NMD Max NMax FLocal WLocal 0 2 4 Threshold
(d)

6

8

10

6

8

10

Figure 4: M acroF1 of the Reuters(90) data-set using Thresholding Techniques (a) CC, (b) DF, (c) IG, and (d) MI

99

CRPIT Volume 61

References Antonellis, I., Bouras, C. & Poulopoulos, V. (2006), Personalized news categorization through scalable text classi?cation., in ‘Proc. of the 8th AsiaPaci?c Web Conference- Frontiers of WWW Research and Development (APWeb 2006)’, Vol. 3841 of Lecture Notes in Computer Science, Springer-Verlag New York, Inc., Harbin, China, pp. 391–401. Apt?, C., Damerau, F. & Weiss, S. (1994), ‘Autoe mated learning of decision rules for text categorization’, ACM Transactions on Information Systems (TOIS) 12(3), 233–251. Bang, S., Yang, J. & Yang, H. (2006), ‘Hierarchical document categorization with k-NN and concept-based thesauri’, Information Processing and Management 42(2), 387–406. Blum, A. & Langley, P. (1997), ‘Selection of relevant features and examples in machine learning.’, Arti?cial Intelligence 97(1-2), 245–271. Calvo, R. & Ceccatto, H. (2000), ‘Intelligent document classi?cation’, Intelligent Data Analysis 4(5), 411–420. Chen, C.-M., Lee, H.-M. & Hwang, C.-W. (2005), ‘A hierarchical neural network document classi?er with linguistic feature selection’, Applied Intelligence 23(3), 277–294. Cohen, W. & Singer, Y. (1999), ‘Contextsensitive learning methods for text categorization’, ACM Transactions on Information Systems 17(2), 141–173. Combarro, E., Montanes, E., Diaz, I., Ranilla, J. & Mones, R. (2005), ‘Introducing a family of linear measures for feature selection in text categorization’, IEEE Transaction on Knowledge and Date Engineering 17(9), 1223–1232. Debole, F. & Sebastiani, F. (2005), ‘An analysis of the relative hardness of reuters-21578 subsets’, Journal of the American Society for Information Science and Technology (JASIST) 56(6), 584– 596. D? ?az, I., Ranilla, J., Monta?es, E., Fern?ndez, J. & n a Combarro, E. (2004), ‘Improving performance of text categorization by combining ?ltering and support vector machines’, Journal of the American Society for Information Science and Technology (JASIST) 55(7), 579–592. Dumais, S., Platt, J., Heckerman, D. & Sahami, M. (1998), Inductive learning algorithms and representations for text categorization, in ‘Proc. of the 7th ACM International Conference on Information and Knowledge Management (CIKM’98)’, ACM Press, New York, United States, Bethesda, United States, pp. 148–155. Forman, G. (2004), A pitfall and solution in multiclass feature selection for text classi?cation, in ‘Proc. of the 21st International Conference on Machine Learning (ICML’04)’, Vol. 69 of ACM International Conference Proceeding Series, ACM Press, New York, United States, Ban?, Alberta, Canada, p. 38. Freitas-Junior, H., Ribeiro-Neto, B., Vale, R., Laender, A. & Lima, L. (2006), ‘Categorizationdriven cross-language retrieval of medical information’, Journal of the American Society for Information Science and Technology (JASIST) 57(4), 501 – 510.
100

Galavotti, L., Sebastiani, F. & Simi, M. (2000), Experiments on the use of feature selection and negative evidence in automated text categorization, in ‘Proc. of the 4th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’00)’, Vol. 1923 of Lecture Notes in Computer Science, Springer-Verlag New York, Inc., Lisbon, Portugal, pp. 59–68. Guyon, I. & Elissee?, A. (2003), ‘An introduction to variable and feature selection’, Journal of Machine Learning Research (JMLR), Special issue on special feature 3, 1157–1182. Joachims, T. (1997), A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, in ‘Proc. of the 14th International Conference on Machine Learning (ICML’97)’, Morgan Kaufmann Publishers, San Francisco, United States, Nashville, United States, pp. 143– 151. Joachims, T. (1998), Text categorization with support vector machines: learning with many relevant features, in ‘Proc. of the 10th European Conference on Machine Learning (ECML’98)’, Vol. 1398 of Lecture Notes in Computer Science, Springer-Verlag New York, Inc., Chemnitz, Germany, pp. 137–142. Joachims, T. (1999), Making large-scale support vector machine learning practical, in ‘Advances in Kernel Methods – Support Vector Learning’, MIT Press, Cambridge, MA, United States, pp. 169–184. Johnson, D., Oles, F. J., Zhang, T. & Goetz, T. (2002), ‘A decision-tree-based symbolic rule induction system for text categorization’, IBM systems Journal 41(3), 428–437. Kazama, J. & Tsujii, J. (2005), ‘Maximum entropy models with inequality constraints: A case study on text categorization’, Machine Learning 60(13), 159–194. Kohavi, R. & John, G. H. (1997), ‘Wrappers for feature subset selection’, Arti?cial Intelligence 97(1-2), 273–324. Kolenda, T., Hansen, L. & Sigurdsson, S. (2000), Independent Components in text, in M. Girolami, ed., ‘Advances in Independent Component Analysis’, Springer-Verlag New York, Inc., pp. 229– 250. Lewis, D. & Ringuette, M. (1994), A comparison of two learning algorithms for text categorization, in ‘Proc. of the 3rd Symposium on Document Analysis and Information Retrieval (SDAIR’94)’, ISRI; University of Nevada, Las Vegas, United States, pp. 81–93. Li, F. & Yang, Y. (2003), A loss function analysis for classi?cation methods in text categorization, in ‘Proc. of the 20th International Conference on Machine Learning (ICML’03)’, AAAI Press, Menlo Park, United States, Washington, DC, USA, pp. 472–479. Montoyo, A., Suarez, A., Rigau, G. & Palomar, M. (2005), ‘Combining knowledge- and corpus-based word-sense-disambiguation methods’, Journal of Arti?cial Intelligence Research 23, 299–330.

Proc. Fifth Australasian Data Mining Conference (AusDM2006)

Ng, H., Goh, W. & Low, K. (1997), Feature selection, perceptron learning, and a usability case study for text categorization, in ‘Proc. of the 20th ACM International Conference on Research and Development in Information Retrieval (SIGIR’97)’, ACM Press, New York, United States, Philadelphia, United States, pp. 67–73. Peng, F., Schuurmans, D. & Wang, S. (2004), ‘Augmenting Naive Bayes classi?ers with statistical language models’, Information Retrieval 7(34), 317–345. Ruiz, M. & Srinivasan, P. (2002), ‘Hierarchical text classi?cation using neural networks’, Journal of Information Retrieval 5(1), 87–118. Salton, G. & Buckley, C. (1988), ‘Term-weighting approaches in automatic text retrieval’, Information Processing and Management 24(5), 513– 523. Salton, G., Wong, A. & Yang, C. S. (1975), ‘A vector space model for automatic indexing’, Communications of the ACM 18(11), 613–620. Sebastiani, F. (2002), ‘Machine learning in automated text categorization’, ACM Computing Surveys (CSUR) 34(1), 1–47. Shen, D., Sun, J.-T., Yang, Q. & Chen, Z. (2006), A comparison of implicit and explicit links for web page classi?cation, in ‘Proc. of the 15th international conference on World Wide Web ( WWW ’06)’, ACM Press, New York, United States, Edinburgh, Scotland, pp. 643–650. Song, F., Liu, S. & Yang, J. (2005), ‘A comparative study on text representation schemes in text categorization’, Pattern Analysis & Applications 8(1-2), 199–209. Soucy, P. & Mineau, G. W. (2003), Feature selection strategies for text categorization., in ‘Proc. of the 16th Conference of Canadian Conference on AI, the Canadian Society for Computational Studies of Intelligence’, Vol. 2671 of Lecture Notes in Computer Science, Springer-Verlag New York, Inc., Halifax, Canada, pp. 505–509.

Tsay, J.-J., Shih, C.-Y. & Wu, B.-L. (2005), Autocrawler: An integrated system for automatic topical crawler, in ‘Proc. of the 4th Annual ACIS International Conference on Computer and Information Science (ICIS’05)’, IEEE Computer Society, Washington, DC, USA, Jeju Island, South Korea, pp. 462–467. Wiener, E., Pedersen, J. & Weigend, A. (1995), A neural network approach to topic spotting, in ‘Proc. of the 4th Symposium on Document Analysis and Information Retrieval (SDAIR’95)’, Las Vegas, United States, pp. 317–332. Yang, Y. (1995), Noise reduction in a statistical approach to text categorization, in ‘Proc. of the18th ACM International Conference on Research and Development in Information Retrieval (SIGIR’95)’, ACM Press, New York, United States, Seattle, Washington, United States, pp. 256–263. Yang, Y. (1999), ‘An evaluation of statistical approaches to text categorization’, Journal of Information Retrieval 1(1/2), 69–90. Yang, Y. & Pedersen, J. (1997), A comparative study on feature selection in text categorization, in ‘Proc. of the 14th International Conference on Machine Learning (ICML’97)’, Morgan Kaufmann Publishers, San Francisco, United States, Nashville, Tennessee, United States, pp. 412– 420. Yang, Y., Slattery, S. & Ghani, R. (2002), ‘A study of approaches to hypertext categorization’, Journal of Intelligent Information Systems 18(2/3), 219– 241. Special Issue on Automated Text Categorization. Yang, Y., Zhang, J. & Kisiel, B. (2003), A scalability analysis of classi?ers in text categorization, in ‘Proc. of the 26th ACM International Conference on Research and Development in Information Retrieval (SIGIR’03)’, ACM Press, New York, United States, Toronto, Canada, pp. 96– 103.

101


更多相关文档:

Text Categorization and the Analysis of Lyrics

Ichiro Fujinaga 30 March 2008 Text Categorization and the Analysis of Lyrics...2.2 Feature reduction techniques In a modest-sized corpus the number of ...
更多相关标签:
global study | global study 专业 | study eshipglobal | categorization | subcategorization | text categorization | soft thresholding | thresholding |
网站地图

文档资料共享网 nexoncn.com copyright ©right 2010-2020。
文档资料共享网内容来自网络,如有侵犯请联系客服。email:zhit325@126.com