The classifiers that we examined so far (SemiL, ISDA, SVMLight, Mallet) put documents into two groups - positive or negative, depending on whether they fall on one side of the boundary or the other. Boundary is defined by kernel functions, which can be linear, quadratic, user-defined, etc.
Problem is that for document classification, words and expressions are used as features. These are smaller units and can be handled easily. But our task is a bit more subtle in the sense that we'd like to classify expressions themselves. So the fundamental question is: are there still smaller units which can be used as features? Obviously, individual letters are not a good choice. Probably words, which form the expression? I'm not sure...
If we look at the problem from another viewpoint, expressions conform to certain grammar productions. So, can we use those particular productions as our features? Need to consider this further...
Another (may be minor) problem that we encountered while dealing with the classifiers is that they treat the test input files as separate documents. So for our purposes, should we encapsulate one expression in one file and feed the files as input? Or are there better methods?
Thursday, November 6, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment