http://en.wikipedia.org/wiki/Gradient_well-formedness
This problem, I think, is important for our research. It's one of the unsolved problems in linguistics. I found surprising similarity between the internal structures of this problem and our quest.
The problem is basically as follows: if an expression's degree of well-formedness varies, then how can we categorize it as either well-formed or ill-formed. Now replace "well-formed" with "formal" and "ill-formed" with "colloquial". It becomes very similar to our problem. Even more important is the fact that till now, I've found no previous work discussing the issue of colloquialism detection whatsoever, but this problem seems to have drawn attention of linguists for a number of years. This link mentions 2 papers, each proposing a new way of solving (or attempting to solve) the problem. Both hinge upon the methods and ideals of Optimality Theory (http://en.wikipedia.org/wiki/Optimality_theory) in some way or other.
The first one (http://www.sfb441.uni-tuebingen.de/~sam/papers/DGfS04.handout.pdf) discusses Decathlon Model, which has two modules - constraint application (blind, cumulative) and output selection (competitive, probabilistic).
The second one (http://www.linguistics.ucla.edu/people/hayes/gradient.pdf) actually modifies the Optimality Theory in a very subtle fashion to generate a working model for dealing with the problem. It deals with the ideas of RANKING, CONSTRAINTS and STRICTNESS in linguistics.
Optimality Theory has 3 sets - generative set GEN, constraint set CON and evaluation rules set EVAL. This second approach modifies CON and EVAL in subtle ways to model the gradience. However, the problem in both these papers is that they have tested their models on phonetic data, not on texts. Also, optimality theory was mainly developed in phonetics. But the well-formedness problem is much more general, I think. So the solutions should work anyway.
We'll have to test the things from our viewpoint.
Monday, December 8, 2008
Monday, December 1, 2008
New Problems I Face
If a grammar checker is to be used, we need to feed it documents and it should output non-conforming expressions. But the available grammar checkers only MARK those expressions in the file, instead of outputting them as a list. So, we'll need to automate the process of extracting those marked expressions and putting them in a single list. We are using two grammar checkers, viz. MS Word and Abiword. None of these is open-source, so customizing the code is out-of-question. Till now I'm extracting these marked expressions manually, but it seems that I'll need a program or some code for doing this automatically.
Another problem comes in the form of difference between grammar checkers' opinions. We are using only two grammar checkers, and the problem is already showing up. It's likely to increase if more than two are used. An example: the word "nano" is caught in Abiword, and not in MS Word. In this case, we give MS Word our preference. But in some cases, Abiword catches some colloquial expressions which MS Word fails to catch. There the judgment goes to Abiword.
The use of lexicons makes the last problem more involved. To see why, we must realize that lexicons act as rudimentary grammar checkers. So each lexicon effectively adds its own opinion regarding an expression. We can create a bitmap for each expression, where a "1" represents colloquialism and a "0" represents otherwise. So for each expression, we'll get one bit corresponding to one grammar checker or lexicon.
Example:-
Expression MS Word Abiword GC3 GC4 ... GCm L1 L2 L3 L4 ... Ln
I'll 1 1 1 0 ... 1 1 1 1 1 ... 1
Nano 0 1 1 0 ... 1 0 0 0 1 ... 0
making off 0 1 0 0 ... 0 0 0 1 0 ... 1
Here we have shown bitmaps (hypothetical) for 3 expressions using m grammar checkers (GC) and n lexicons (L). To this bitmap we can add p no of columns for opinions from p independent human reviewers (R, say). Then the bitmap will be m + n + p + 1 columns wide. Our second problem is to handle and interpret this bitmap efficiently and intelligibly.
We're not yet done. The bitmap, as we can see, is sparse if we use blogs. So we can probably build it manually. [We progress column-by-column, do each GC, L and R separately on each expression. Whenever we encounter a mark, we put a 1. Otherwise, we just skip. Later, we can write a program to fill those skips with "0"s.] However, for chat transcripts, it'll be much less sparse. So, it becomes difficult to build manually. Then we'll need to switch convention: represent colloquialisms as "0"s and others as "1"s.
Finally, we can remove those bitmaps altogether for which all the entries are "0"s. These represent absolute formalisms. Those bitmaps for which every entry is a "1", represent absolute colloquialisms. The remaining ones pose a challenge for us, for they denote a grey area, which is not very small - they are the concussion points. Even here we have a refuge - concept of majority. If majority says it's a "1", then it's a colloquialism. If majority says it's a "0", then it's a formalism. Now we are almost done. The only perplexing case comes when there is exactly 50-50 distribution of "1"s and "0"s. These are rare, but for the sake of complete specification we need to handle these cases as well. There are two ways: 1> To assign weights to columns, 2> To make all these cases colloquialisms. There are pros and cons of both.
If we assign weights, then there will be a value column. If the value is greater than 50, it's colloquialism. Otherwise, it's not. It almost eliminates, or at least significantly reduces, the 50-50 cases. But there are two cons: how to assign appropriate weights and how to solve the remaining 50-50 cases, if there are any.
The second method of considering all the 50-50's as colloquialisms helps in the sense that they make our task much simpler; even if we're in error, we're erring on the conservative side. The worst that we can expect is all these expressions are actually formalisms. But since their number should be small (given m + n + p is big enough), we can relegate these errors as false positives. However, as is evident from the approach, it's inherently erroneous and the error increases as the no of such expressions increases.
Another problem comes in the form of difference between grammar checkers' opinions. We are using only two grammar checkers, and the problem is already showing up. It's likely to increase if more than two are used. An example: the word "nano" is caught in Abiword, and not in MS Word. In this case, we give MS Word our preference. But in some cases, Abiword catches some colloquial expressions which MS Word fails to catch. There the judgment goes to Abiword.
The use of lexicons makes the last problem more involved. To see why, we must realize that lexicons act as rudimentary grammar checkers. So each lexicon effectively adds its own opinion regarding an expression. We can create a bitmap for each expression, where a "1" represents colloquialism and a "0" represents otherwise. So for each expression, we'll get one bit corresponding to one grammar checker or lexicon.
Example:-
Expression MS Word Abiword GC3 GC4 ... GCm L1 L2 L3 L4 ... Ln
I'll 1 1 1 0 ... 1 1 1 1 1 ... 1
Nano 0 1 1 0 ... 1 0 0 0 1 ... 0
making off 0 1 0 0 ... 0 0 0 1 0 ... 1
Here we have shown bitmaps (hypothetical) for 3 expressions using m grammar checkers (GC) and n lexicons (L). To this bitmap we can add p no of columns for opinions from p independent human reviewers (R, say). Then the bitmap will be m + n + p + 1 columns wide. Our second problem is to handle and interpret this bitmap efficiently and intelligibly.
We're not yet done. The bitmap, as we can see, is sparse if we use blogs. So we can probably build it manually. [We progress column-by-column, do each GC, L and R separately on each expression. Whenever we encounter a mark, we put a 1. Otherwise, we just skip. Later, we can write a program to fill those skips with "0"s.] However, for chat transcripts, it'll be much less sparse. So, it becomes difficult to build manually. Then we'll need to switch convention: represent colloquialisms as "0"s and others as "1"s.
Finally, we can remove those bitmaps altogether for which all the entries are "0"s. These represent absolute formalisms. Those bitmaps for which every entry is a "1", represent absolute colloquialisms. The remaining ones pose a challenge for us, for they denote a grey area, which is not very small - they are the concussion points. Even here we have a refuge - concept of majority. If majority says it's a "1", then it's a colloquialism. If majority says it's a "0", then it's a formalism. Now we are almost done. The only perplexing case comes when there is exactly 50-50 distribution of "1"s and "0"s. These are rare, but for the sake of complete specification we need to handle these cases as well. There are two ways: 1> To assign weights to columns, 2> To make all these cases colloquialisms. There are pros and cons of both.
If we assign weights, then there will be a value column. If the value is greater than 50, it's colloquialism. Otherwise, it's not. It almost eliminates, or at least significantly reduces, the 50-50 cases. But there are two cons: how to assign appropriate weights and how to solve the remaining 50-50 cases, if there are any.
The second method of considering all the 50-50's as colloquialisms helps in the sense that they make our task much simpler; even if we're in error, we're erring on the conservative side. The worst that we can expect is all these expressions are actually formalisms. But since their number should be small (given m + n + p is big enough), we can relegate these errors as false positives. However, as is evident from the approach, it's inherently erroneous and the error increases as the no of such expressions increases.
Subscribe to:
Comments (Atom)