Thursday, November 13, 2008

Difference!

Well, we already noted the difference between document classification and word classification. For document classification, words (and also expressions) are features. For word classification, several probability metrics are defined which act as features. These probability metrics hinge upon the existing dependency between successive words and expressions.

Now we go on refining and farthering our ideas to the more intricate realm of expression classification, something closely resembling our aim. We are classifying expressions binarily - colloquial or non-colloquial. It has a distinct flavor. While in word classification we go on looking at the dependency models (say, e.g., in part-of-speech (POS) tagging, we consider graphical models like HMM, MEMM and CRF), here we can't apply the same idea. There are two principal reasons. First, the dependencies here are not as strong as in POS tagging. Sometimes, there are no dependencies at all. Second, we really have not been able to find out more than two features (presence in a lexicon and non-conformance to certain grammar rules), which makes the problem unfit for a classification mechanism. The "novelty" of an expression could have been one of the features, but there also we see that many (if not most) of the novelties result in non-colloquial forms and terminologies rather than colloquialisms.

If classification were to be used, we could have employed algorithms like Naive Bayes or Maximum Entropy. But it gradually becomes apparent that we should rather go for a rule-based (or grammar-based) approach.

No comments: