PROCEEDINGS IPMU '08
Improving KNN-based e-mail classification into folders generating class-balanced datasets
Pablo Bermejo, Jose A. Gamez, Jose M. Puerta, Roberto Uribe
In this paper we deal with an e-mail
classification problem known as email
foldering, which consists on the
classification of incoming mail into
the different folders previously created
by the user. This task has received
less attention in the literature
than spam filtering and is quite complex
due to the (usually large) cardinality
(number of folders) and lack
of balance (documents per class) of
the class variable. On the other
hand, proximity based algorithms
have been used in a wide range of
fields since decades ago. One of the
main drawbacks of these classifiers,
known as lazy classifiers, is their
computational load due to their need
to compute the distance of a new
sample to each point in the vectorial
space to decide which class it belongs
to. This is why most of the developed
techniques for these classifiers
consist on edition and condensation
of the training set. In this work we
make an approach to the problem of
e-mail classification into folders. It is
suggested a new algorithm based on
neighbourgood called Gaussian Balanced
K-NN, which does not edit
nor condense the database but samples
a whole new training set from
the marginal gaussian distributions
of the initial set. This algorithm lets
choose the computational load of the
classifier and also balances the training
set, alleviating the same problems
that edition and condensation
techniques try to solve.
PDF full paper |