New Feature Selection Method Improves the Discrimination of Terms for Text Categorization

Category Computer Science

tldr #

A research team from East China Normal University, led by Li Zhang, has proposed a new method, Max-Difference Maximization Criterion (MDMC), for estimating the importance of terms in text categorization. The method introduces a new weight based off of class information occupancy, which is then combined with Accuracy2 to yield better discrimination of terms. MDMC has been proven to outperform other methods such as ISCA, PCA, and NMF.

content #

For text categorization, it is necessary to select a set of features (terms) with high discrimination by using feature selection. In text feature selection, Accuracy2 (ACC2) treats terms with same absolute document rate difference but different discrimination equally, which is unreasonable. Existing improved methods (normalized difference measure (NDM), max-min ratio (MMR) and trigonometric comparison measure (TCM) ) based on ACC2 may confuse the importance of rare and sparse terms on account of challenge for parameter selection.

MDMC is proposed by a research team from East China Normal University

To solve the problems, a research team led by Li Zhang published their new research in Frontiers of Computer Science. The team proposed max-difference maximization criterion (MDMC) , which introduces a new weight based on class information occupancy and combines it with ACC2 to estimate the importance of terms. As a result, MDMC can avoid overestimate of sparse terms.

In the research, they analyze the weight distributions of methods (ACC2, NDM, MMR, TCM and MDMC) and intuitively show the mechanism of MDMC to estimate the importance of terms, which is shown in online resources. Experiments demonstrate that MDMC is capable of catching more discriminant terms without any parameter than other filter ones regardless of classifier, and shows its superiority over other dimensionality reduction methods (improved sine cosine algorithm (ISCA) , principal component analysis (PCA) and non-negative matrix factorization (NMF) ).

MDMC introduces a newly developed weight based on class information occupancy

hashtags #
worddensity #