Welcome back after a long break. A lot has happened in the interim - some good, some shockingly bad. But anyway, I'm back at work and there seems to be some progress.
We finished our last post with a relatively little-known problem in Linguistics - Gradient Well-formedness. Although it still seems that my problem is inherently linked to this more abstract and harder domain, I needed to get a handle on whatever I was doing. Grammar-checker and Dictionary were those two handles.
Fortunately, I got a whole bunch of blog pages to work with. This dataset, called Splog Blog Dataset, was collected for an entirely different purpose. It served as a basis for detecting spam blogs or "splogs". But there being a lot of colloquialisms involved in it, I thought it to be a good starting point for running my experiments on.
Till date I have used MS Word, AbiWord, TextPad, PSPad and DocPad as rudimentary grammar checkers. Although MS Word is sophisticated, others often fall short. I felt the need to design one grammar checker on my own, but have not as yet done anything in that direction. However, the results were not completely unsatisfactory, even with these rudimentary tools. I recorded precision and recall values for each of these tools and the initial results on 6 documents show recall values near 0.5 (average) and precision values near 0.2 (average). I need to see whether combining the results of these tools increases recall and precision values.
I'll also need to see the effect of a dictionary. The only standard dictionary that can be "run" programmatically on all these files seems to be WordNet, and I have downloaded and installed it on my machine. I'll investigate whether WordNet is able to catch colloquial expressions and also the combined effect of WordNet and grammar checkers.
Tuesday, March 10, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment