CumInCAD is a Cumulative Index about publications in Computer Aided Architectural Design
supported by the sibling associations ACADIA, CAADRIA, eCAADe, SIGraDi, ASCAAD and CAAD futures

PDF papers
authors Scott, Sam
year 1998
title Feature Engineering for a Symbolic Approach to Text Classification
source University of Ottawa Computer Science
summary Most text classification research to date has used the standard 'bag of words' model for text representation inherited from the word-based indexing techniques used in information retrieval research. There have been a number of past attempts to find better representations, but very few positive results have been found. Most of this previous work, however, has concentrated on retrieval rather than classification tasks, and none has involved symbolic learning algorithms. This thesis investigates a number of feature engineering methods for text classification in the context of a symbolic rule-based learning algorithm. The focus is on changing the standard 'bag of words' representation of text by incorporating some shallow linguistic processing techniques. Several new representations of text are explored in the hopes that they will allow the learner to find points of high information gain that were not present in the original set of words. Representations based on both semantic and syntactic linguistic knowledge are defined and evaluated using the RIPPER rule-learning system. Two major corpora are used for evaluation: a standard, widely-used corpus of news stories, and a new corpus of folk song lyrics. The results of the experiments are mostly negative. Although in some cases the new representations are at least as good as the bag of words, the improvements in quantitative performance that were hoped for do not materialize. However, the results are not entirely discouraging. The syntactically defined representations may enable the learner to produce simpler and more comprehensible hypotheses, and the semantically defined representations do produce some real performance gains on smaller classification tasks that for various reasons fail to scale up to larger tasks. Some ideas are offered as to why the new representations fail to produce better results, and some suggestions are made for continuing the research in future.
series thesis:MSc
references Content-type: text/plain
last changed 2003/02/12 21:37
pick and add to favorite papersHOMELOGIN (you are user _anon_788763 from group guest) CUMINCAD Papers Powered by SciX Open Publishing Services 1.002