If a grammar checker is to be used, we need to feed it documents and it should output non-conforming expressions. But the available grammar checkers only MARK those expressions in the file, instead of outputting them as a list. So, we'll need to automate the process of extracting those marked expressions and putting them in a single list. We are using two grammar checkers, viz. MS Word and Abiword. None of these is open-source, so customizing the code is out-of-question. Till now I'm extracting these marked expressions manually, but it seems that I'll need a program or some code for doing this automatically.
Another problem comes in the form of difference between grammar checkers' opinions. We are using only two grammar checkers, and the problem is already showing up. It's likely to increase if more than two are used. An example: the word "nano" is caught in Abiword, and not in MS Word. In this case, we give MS Word our preference. But in some cases, Abiword catches some colloquial expressions which MS Word fails to catch. There the judgment goes to Abiword.
The use of lexicons makes the last problem more involved. To see why, we must realize that lexicons act as rudimentary grammar checkers. So each lexicon effectively adds its own opinion regarding an expression. We can create a bitmap for each expression, where a "1" represents colloquialism and a "0" represents otherwise. So for each expression, we'll get one bit corresponding to one grammar checker or lexicon.
Example:-
Expression MS Word Abiword GC3 GC4 ... GCm L1 L2 L3 L4 ... Ln
I'll 1 1 1 0 ... 1 1 1 1 1 ... 1
Nano 0 1 1 0 ... 1 0 0 0 1 ... 0
making off 0 1 0 0 ... 0 0 0 1 0 ... 1
Here we have shown bitmaps (hypothetical) for 3 expressions using m grammar checkers (GC) and n lexicons (L). To this bitmap we can add p no of columns for opinions from p independent human reviewers (R, say). Then the bitmap will be m + n + p + 1 columns wide. Our second problem is to handle and interpret this bitmap efficiently and intelligibly.
We're not yet done. The bitmap, as we can see, is sparse if we use blogs. So we can probably build it manually. [We progress column-by-column, do each GC, L and R separately on each expression. Whenever we encounter a mark, we put a 1. Otherwise, we just skip. Later, we can write a program to fill those skips with "0"s.] However, for chat transcripts, it'll be much less sparse. So, it becomes difficult to build manually. Then we'll need to switch convention: represent colloquialisms as "0"s and others as "1"s.
Finally, we can remove those bitmaps altogether for which all the entries are "0"s. These represent absolute formalisms. Those bitmaps for which every entry is a "1", represent absolute colloquialisms. The remaining ones pose a challenge for us, for they denote a grey area, which is not very small - they are the concussion points. Even here we have a refuge - concept of majority. If majority says it's a "1", then it's a colloquialism. If majority says it's a "0", then it's a formalism. Now we are almost done. The only perplexing case comes when there is exactly 50-50 distribution of "1"s and "0"s. These are rare, but for the sake of complete specification we need to handle these cases as well. There are two ways: 1> To assign weights to columns, 2> To make all these cases colloquialisms. There are pros and cons of both.
If we assign weights, then there will be a value column. If the value is greater than 50, it's colloquialism. Otherwise, it's not. It almost eliminates, or at least significantly reduces, the 50-50 cases. But there are two cons: how to assign appropriate weights and how to solve the remaining 50-50 cases, if there are any.
The second method of considering all the 50-50's as colloquialisms helps in the sense that they make our task much simpler; even if we're in error, we're erring on the conservative side. The worst that we can expect is all these expressions are actually formalisms. But since their number should be small (given m + n + p is big enough), we can relegate these errors as false positives. However, as is evident from the approach, it's inherently erroneous and the error increases as the no of such expressions increases.
Monday, December 1, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment