Monday, September 15, 2008

Some further considerations

First let us take the trainable classifier approach. We asked some questions in our previous post:

What will be the training approach?
What text features we'll have to consider in this classification?
How to shrink the initial feature space?

In this context, we looked at http://en.wikipedia.org/wiki/Supervised_learning to see what they say. They have identified five steps:

a) Determine the type of training examples.
b) Gather a suitable training set.
c) Feature selection (and redundant feature elimination).
d) Learning function and learning algorithm identification.
e) Actual learning, optimization (with parameter adjustment) and cross-validation.

There are other issues like Empirical Risk Minimization, Active Learning, etc. Some good classifiers are artificial neural networks (ANN), multilayer perceptrons, decision trees, SVMs, k-nearest neighbor (kNN), Gaussian, Naive Bayes and RBF classifiers.

Another important issue to consider is semi-supervised learning and its applicability in solving our current problem (http://en.wikipedia.org/wiki/Semi-supervised_learning). A useful approach might be to use transduction (http://en.wikipedia.org/wiki/Transductive_learning), because transductions come particularly handy while dealing with binary classifications like ours (formal vs. informal).

As an initial step, we have accessed a large number of blogs. These will serve as the training data set. Regarding features, a useful initial approximation is to use the colloquial expressions and words as features. The value of the features will be the tf-idf products of these terms. We know that this product normally gives the true importance of a term in a document. This step also yields an occurrence matrix which helps in eliminating redundant features. Redundancy of a feature may be assessed in terms of its combined tf-idf across all documents (need to clarify this).

Our next two tasks involve selection of an appropriate learning model and actually training the data. We have some softwares available to help us with this task, e.g., SVM-Light, Mallet, ISDA and SemiL. While SVM-Light essentially implements an SVM in C, Mallet goes even farther with its richer environment of document classification, sequence tagging, topic modeling and numerical optimization tools. ISDA is a GUI-based SVM tool and SemiL implements graph-based semi-supervised learning techniques for large-scale problems. MATLAB also features an SVM toolbox that is pretty rich in aspects. It's available from several sources. There are other implementations like SVMTool, SSVM (and other tools - by University of Wisconsin, Madison) and a lot of other tools like DTREG, SVM-Fold, etc.

Regarding training approaches, Microsoft has described a novel one - Sequential Minimal Optimization. We'll have to see if it's applicable to our case. I'm trying to gather as much footing as possible to start the task, for I don't have much background in this area.

During feature selection, we'll need to consult a dictionary of colloquialisms. It's a good idea to start with a standard dictionary. Many such dictionaries are available on the Web, e.g., http://www.peevish.co.uk/slang/, http://www.urbandictionary.com/, http://www.geocities.com/informalenglish/dictindex.html, etc. We, however, will probably need a set of files or a database as our dictionary. Reason is very simple: these can be accessed much more easily. While extracting (and later on, eliminating) colloquial expressions, we'll also need to consider grammar rules (to eliminate formal jargons, to generate newer and previously unseen colloquialisms), word-sense disambiguation (say, e.g., what does the word "cabbage" mean in a particular context?) and some other important things (like, say, change of meaning when two words combine to form a new expression). In some cases (we'll have to see which ones), we may have to take recourse to rule-based systems as well. However, it seems to me that semantic analysis is going to have a greater impact on our study than anything else. An important pointer in this direction is here.

There are other learning tools besides SVM, like ANN, Naive Bayes, kNN, decision trees, etc. We don't know yet which one is going to impact more in this case. Some tools in this direction are http://www.mathtools.net/MATLAB/Neural_Networks/index.html, http://www.cs.cofc.edu/~manaris/ai-education-repository/neural-n-tools.html, http://www.makhfi.com/tools.htm, http://www.bestfreewaredownload.com/s-gejpuiti-multilayer-perceptron-c-33-math-scientific-tools-freeware.html, http://www.consulttoday.com/KnnRabbit.aspx, http://mac.softpedia.com/get/Internet-Utilities/POPFile.shtml, etc.

No comments: