Let us consider the first portion of our task: to identify colloquialisms in text and to translate them into formal expressions. For the time being, we'll concentrate on the first half: identification of colloquialisms in text.
Now there are several methods that can be implemented to resolve this issue. Some of these have been discussed in earlier posts and comments, like grammar checkers, dictionary search (vocab search), etc. We also concluded that naive dictionary search will be a brute-force approach and discarded it in favor of another one in which dictionary search is aided by grammar checking.
It seems that there exist better methods to accomplish our goal. For example, for a particular setting and type of input text, we can build a trainable pattern classifier. All it has got to do, is to classify expressions found in a text into two disjoint groups: colloquial and formal. However, we need to answer questions like: what will be the training approach, what text features we'll have to consider in this classification, how to shrink the initial feature space, etc.
Raw grammar checking may also be improved with the help of a rule-based system. In a rule-based system, there is a set of facts and a set of rules. Rules act on facts to generate more facts. In this way, a rule-based system works much like a grammar checker, with a minor difference. Rule-based systems are basically synthetic in nature, i.e., they synthesize (generate) new expressions from facts and rules. These expressions are then compared to the input text. On the other hand, grammar checkers are analytic systems, meaning that they actually analyze input text first to see whether they fit any of the available grammar rules (productions).
In our case, if we go for a trainable classifier approach, we have to first train the system using a set of representative, pre-annotated texts containing colloquialisms. After that, we can test the system using similar non-annotated texts containing colloquialisms. If we go for a grammar checker approach, then first we'll have to define the productions that are appropriate, as mentioned in the previous posts. Finally, if we turn to rule-based system, we need to fashion the facts and the rules, such that facts correspond to basic colloquialisms, and rules help us derive more involved ones. Here, rules act incrementally (or inductively), means basic facts lead to next level facts, next level facts lead to next higher level facts, and so on.
Tuesday, September 9, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment