Improving KNN-based e-mail classification into folders generating class-balanced datasets

Pablo Bermejo, Jose A. Gamez, Jose M. Puerta, Roberto Uribe

In this paper we deal with an e-mail classification problem known as email foldering, which consists on the classification of incoming mail into the different folders previously created by the user. This task has received less attention in the literature than spam filtering and is quite complex due to the (usually large) cardinality (number of folders) and lack of balance (documents per class) of the class variable. On the other hand, proximity based algorithms have been used in a wide range of fields since decades ago. One of the main drawbacks of these classifiers, known as lazy classifiers, is their computational load due to their need to compute the distance of a new sample to each point in the vectorial space to decide which class it belongs to. This is why most of the developed techniques for these classifiers consist on edition and condensation of the training set. In this work we make an approach to the problem of e-mail classification into folders. It is suggested a new algorithm based on neighbourgood called Gaussian Balanced K-NN, which does not edit nor condense the database but samples a whole new training set from the marginal gaussian distributions of the initial set. This algorithm lets choose the computational load of the classifier and also balances the training set, alleviating the same problems that edition and condensation techniques try to solve.

PDF full paper