First let us take the trainable classifier approach. We asked some questions in our previous post:
What will be the training approach?
What text features we'll have to consider in this classification?
How to shrink the initial feature space?
In this context, we looked at http://en.wikipedia.org/wiki/Supervised_learning to see what they say. They have identified five steps:
a) Determine the type of training examples.
b) Gather a suitable training set.
c) Feature selection (and redundant feature elimination).
d) Learning function and learning algorithm identification.
e) Actual learning, optimization (with parameter adjustment) and cross-validation.
There are other issues like Empirical Risk Minimization, Active Learning, etc. Some good classifiers are artificial neural networks (ANN), multilayer perceptrons, decision trees, SVMs, k-nearest neighbor (kNN), Gaussian, Naive Bayes and RBF classifiers.
Another important issue to consider is semi-supervised learning and its applicability in solving our current problem (http://en.wikipedia.org/wiki/Semi-supervised_learning). A useful approach might be to use transduction (http://en.wikipedia.org/wiki/Transductive_learning), because transductions come particularly handy while dealing with binary classifications like ours (formal vs. informal).
As an initial step, we have accessed a large number of blogs. These will serve as the training data set. Regarding features, a useful initial approximation is to use the colloquial expressions and words as features. The value of the features will be the tf-idf products of these terms. We know that this product normally gives the true importance of a term in a document. This step also yields an occurrence matrix which helps in eliminating redundant features. Redundancy of a feature may be assessed in terms of its combined tf-idf across all documents (need to clarify this).
Our next two tasks involve selection of an appropriate learning model and actually training the data. We have some softwares available to help us with this task, e.g., SVM-Light, Mallet, ISDA and SemiL. While SVM-Light essentially implements an SVM in C, Mallet goes even farther with its richer environment of document classification, sequence tagging, topic modeling and numerical optimization tools. ISDA is a GUI-based SVM tool and SemiL implements graph-based semi-supervised learning techniques for large-scale problems. MATLAB also features an SVM toolbox that is pretty rich in aspects. It's available from several sources. There are other implementations like SVMTool, SSVM (and other tools - by University of Wisconsin, Madison) and a lot of other tools like DTREG, SVM-Fold, etc.
Regarding training approaches, Microsoft has described a novel one - Sequential Minimal Optimization. We'll have to see if it's applicable to our case. I'm trying to gather as much footing as possible to start the task, for I don't have much background in this area.
During feature selection, we'll need to consult a dictionary of colloquialisms. It's a good idea to start with a standard dictionary. Many such dictionaries are available on the Web, e.g., http://www.peevish.co.uk/slang/, http://www.urbandictionary.com/, http://www.geocities.com/informalenglish/dictindex.html, etc. We, however, will probably need a set of files or a database as our dictionary. Reason is very simple: these can be accessed much more easily. While extracting (and later on, eliminating) colloquial expressions, we'll also need to consider grammar rules (to eliminate formal jargons, to generate newer and previously unseen colloquialisms), word-sense disambiguation (say, e.g., what does the word "cabbage" mean in a particular context?) and some other important things (like, say, change of meaning when two words combine to form a new expression). In some cases (we'll have to see which ones), we may have to take recourse to rule-based systems as well. However, it seems to me that semantic analysis is going to have a greater impact on our study than anything else. An important pointer in this direction is here.
There are other learning tools besides SVM, like ANN, Naive Bayes, kNN, decision trees, etc. We don't know yet which one is going to impact more in this case. Some tools in this direction are http://www.mathtools.net/MATLAB/Neural_Networks/index.html, http://www.cs.cofc.edu/~manaris/ai-education-repository/neural-n-tools.html, http://www.makhfi.com/tools.htm, http://www.bestfreewaredownload.com/s-gejpuiti-multilayer-perceptron-c-33-math-scientific-tools-freeware.html, http://www.consulttoday.com/KnnRabbit.aspx, http://mac.softpedia.com/get/Internet-Utilities/POPFile.shtml, etc.
Monday, September 15, 2008
Tuesday, September 9, 2008
How to solve the problem
Let us consider the first portion of our task: to identify colloquialisms in text and to translate them into formal expressions. For the time being, we'll concentrate on the first half: identification of colloquialisms in text.
Now there are several methods that can be implemented to resolve this issue. Some of these have been discussed in earlier posts and comments, like grammar checkers, dictionary search (vocab search), etc. We also concluded that naive dictionary search will be a brute-force approach and discarded it in favor of another one in which dictionary search is aided by grammar checking.
It seems that there exist better methods to accomplish our goal. For example, for a particular setting and type of input text, we can build a trainable pattern classifier. All it has got to do, is to classify expressions found in a text into two disjoint groups: colloquial and formal. However, we need to answer questions like: what will be the training approach, what text features we'll have to consider in this classification, how to shrink the initial feature space, etc.
Raw grammar checking may also be improved with the help of a rule-based system. In a rule-based system, there is a set of facts and a set of rules. Rules act on facts to generate more facts. In this way, a rule-based system works much like a grammar checker, with a minor difference. Rule-based systems are basically synthetic in nature, i.e., they synthesize (generate) new expressions from facts and rules. These expressions are then compared to the input text. On the other hand, grammar checkers are analytic systems, meaning that they actually analyze input text first to see whether they fit any of the available grammar rules (productions).
In our case, if we go for a trainable classifier approach, we have to first train the system using a set of representative, pre-annotated texts containing colloquialisms. After that, we can test the system using similar non-annotated texts containing colloquialisms. If we go for a grammar checker approach, then first we'll have to define the productions that are appropriate, as mentioned in the previous posts. Finally, if we turn to rule-based system, we need to fashion the facts and the rules, such that facts correspond to basic colloquialisms, and rules help us derive more involved ones. Here, rules act incrementally (or inductively), means basic facts lead to next level facts, next level facts lead to next higher level facts, and so on.
Now there are several methods that can be implemented to resolve this issue. Some of these have been discussed in earlier posts and comments, like grammar checkers, dictionary search (vocab search), etc. We also concluded that naive dictionary search will be a brute-force approach and discarded it in favor of another one in which dictionary search is aided by grammar checking.
It seems that there exist better methods to accomplish our goal. For example, for a particular setting and type of input text, we can build a trainable pattern classifier. All it has got to do, is to classify expressions found in a text into two disjoint groups: colloquial and formal. However, we need to answer questions like: what will be the training approach, what text features we'll have to consider in this classification, how to shrink the initial feature space, etc.
Raw grammar checking may also be improved with the help of a rule-based system. In a rule-based system, there is a set of facts and a set of rules. Rules act on facts to generate more facts. In this way, a rule-based system works much like a grammar checker, with a minor difference. Rule-based systems are basically synthetic in nature, i.e., they synthesize (generate) new expressions from facts and rules. These expressions are then compared to the input text. On the other hand, grammar checkers are analytic systems, meaning that they actually analyze input text first to see whether they fit any of the available grammar rules (productions).
In our case, if we go for a trainable classifier approach, we have to first train the system using a set of representative, pre-annotated texts containing colloquialisms. After that, we can test the system using similar non-annotated texts containing colloquialisms. If we go for a grammar checker approach, then first we'll have to define the productions that are appropriate, as mentioned in the previous posts. Finally, if we turn to rule-based system, we need to fashion the facts and the rules, such that facts correspond to basic colloquialisms, and rules help us derive more involved ones. Here, rules act incrementally (or inductively), means basic facts lead to next level facts, next level facts lead to next higher level facts, and so on.
Subscribe to:
Comments (Atom)