Monday, December 8, 2008

Problem of Gradient Well-formedness

http://en.wikipedia.org/wiki/Gradient_well-formedness

This problem, I think, is important for our research. It's one of the unsolved problems in linguistics. I found surprising similarity between the internal structures of this problem and our quest.

The problem is basically as follows: if an expression's degree of well-formedness varies, then how can we categorize it as either well-formed or ill-formed. Now replace "well-formed" with "formal" and "ill-formed" with "colloquial". It becomes very similar to our problem. Even more important is the fact that till now, I've found no previous work discussing the issue of colloquialism detection whatsoever, but this problem seems to have drawn attention of linguists for a number of years. This link mentions 2 papers, each proposing a new way of solving (or attempting to solve) the problem. Both hinge upon the methods and ideals of Optimality Theory (http://en.wikipedia.org/wiki/Optimality_theory) in some way or other.

The first one (http://www.sfb441.uni-tuebingen.de/~sam/papers/DGfS04.handout.pdf) discusses Decathlon Model, which has two modules - constraint application (blind, cumulative) and output selection (competitive, probabilistic).

The second one (http://www.linguistics.ucla.edu/people/hayes/gradient.pdf) actually modifies the Optimality Theory in a very subtle fashion to generate a working model for dealing with the problem. It deals with the ideas of RANKING, CONSTRAINTS and STRICTNESS in linguistics.

Optimality Theory has 3 sets - generative set GEN, constraint set CON and evaluation rules set EVAL. This second approach modifies CON and EVAL in subtle ways to model the gradience. However, the problem in both these papers is that they have tested their models on phonetic data, not on texts. Also, optimality theory was mainly developed in phonetics. But the well-formedness problem is much more general, I think. So the solutions should work anyway.

We'll have to test the things from our viewpoint.

Monday, December 1, 2008

New Problems I Face

If a grammar checker is to be used, we need to feed it documents and it should output non-conforming expressions. But the available grammar checkers only MARK those expressions in the file, instead of outputting them as a list. So, we'll need to automate the process of extracting those marked expressions and putting them in a single list. We are using two grammar checkers, viz. MS Word and Abiword. None of these is open-source, so customizing the code is out-of-question. Till now I'm extracting these marked expressions manually, but it seems that I'll need a program or some code for doing this automatically.

Another problem comes in the form of difference between grammar checkers' opinions. We are using only two grammar checkers, and the problem is already showing up. It's likely to increase if more than two are used. An example: the word "nano" is caught in Abiword, and not in MS Word. In this case, we give MS Word our preference. But in some cases, Abiword catches some colloquial expressions which MS Word fails to catch. There the judgment goes to Abiword.

The use of lexicons makes the last problem more involved. To see why, we must realize that lexicons act as rudimentary grammar checkers. So each lexicon effectively adds its own opinion regarding an expression. We can create a bitmap for each expression, where a "1" represents colloquialism and a "0" represents otherwise. So for each expression, we'll get one bit corresponding to one grammar checker or lexicon.

Example:-

Expression MS Word Abiword GC3 GC4 ... GCm L1 L2 L3 L4 ... Ln

I'll 1 1 1 0 ... 1 1 1 1 1 ... 1

Nano 0 1 1 0 ... 1 0 0 0 1 ... 0

making off 0 1 0 0 ... 0 0 0 1 0 ... 1


Here we have shown bitmaps (hypothetical) for 3 expressions using m grammar checkers (GC) and n lexicons (L). To this bitmap we can add p no of columns for opinions from p independent human reviewers (R, say). Then the bitmap will be m + n + p + 1 columns wide. Our second problem is to handle and interpret this bitmap efficiently and intelligibly.


We're not yet done. The bitmap, as we can see, is sparse if we use blogs. So we can probably build it manually. [We progress column-by-column, do each GC, L and R separately on each expression. Whenever we encounter a mark, we put a 1. Otherwise, we just skip. Later, we can write a program to fill those skips with "0"s.] However, for chat transcripts, it'll be much less sparse. So, it becomes difficult to build manually. Then we'll need to switch convention: represent colloquialisms as "0"s and others as "1"s.


Finally, we can remove those bitmaps altogether for which all the entries are "0"s. These represent absolute formalisms. Those bitmaps for which every entry is a "1", represent absolute colloquialisms. The remaining ones pose a challenge for us, for they denote a grey area, which is not very small - they are the concussion points. Even here we have a refuge - concept of majority. If majority says it's a "1", then it's a colloquialism. If majority says it's a "0", then it's a formalism. Now we are almost done. The only perplexing case comes when there is exactly 50-50 distribution of "1"s and "0"s. These are rare, but for the sake of complete specification we need to handle these cases as well. There are two ways: 1> To assign weights to columns, 2> To make all these cases colloquialisms. There are pros and cons of both.

If we assign weights, then there will be a value column. If the value is greater than 50, it's colloquialism. Otherwise, it's not. It almost eliminates, or at least significantly reduces, the 50-50 cases. But there are two cons: how to assign appropriate weights and how to solve the remaining 50-50 cases, if there are any.

The second method of considering all the 50-50's as colloquialisms helps in the sense that they make our task much simpler; even if we're in error, we're erring on the conservative side. The worst that we can expect is all these expressions are actually formalisms. But since their number should be small (given m + n + p is big enough), we can relegate these errors as false positives. However, as is evident from the approach, it's inherently erroneous and the error increases as the no of such expressions increases.

Thursday, November 13, 2008

Difference!

Well, we already noted the difference between document classification and word classification. For document classification, words (and also expressions) are features. For word classification, several probability metrics are defined which act as features. These probability metrics hinge upon the existing dependency between successive words and expressions.

Now we go on refining and farthering our ideas to the more intricate realm of expression classification, something closely resembling our aim. We are classifying expressions binarily - colloquial or non-colloquial. It has a distinct flavor. While in word classification we go on looking at the dependency models (say, e.g., in part-of-speech (POS) tagging, we consider graphical models like HMM, MEMM and CRF), here we can't apply the same idea. There are two principal reasons. First, the dependencies here are not as strong as in POS tagging. Sometimes, there are no dependencies at all. Second, we really have not been able to find out more than two features (presence in a lexicon and non-conformance to certain grammar rules), which makes the problem unfit for a classification mechanism. The "novelty" of an expression could have been one of the features, but there also we see that many (if not most) of the novelties result in non-colloquial forms and terminologies rather than colloquialisms.

If classification were to be used, we could have employed algorithms like Naive Bayes or Maximum Entropy. But it gradually becomes apparent that we should rather go for a rule-based (or grammar-based) approach.

Four key issues revisited and correlated

1> Identifying colloquialisms
2> Identifying the contexts
3> Translating to non-colloquial forms w.r.t. contexts
4> Extracting (possibly) new information from the non-colloquial forms

Extension:- Conversion from non-colloquial to colloquial.

Tasks 1 and 2 are related in the sense that sometimes some expression may turn out to be colloquial or not, depending on the context. Tasks 2 and 3 are related, because you can't do the translation unless and until you know the context! Similarly, tasks 3 and 4 are also related, for "new information" definitely implies something that was not there in the context, or that "enhanced" the context in some way. So it does depend on the previous translation step. Any discrepancy in the previous step will baffle this step.

Thursday, November 6, 2008

Some Papers on Word Classification

Our task seems somewhat related to the problem of word classification. So here are some pointers:

1> http://portal.acm.org/citation.cfm?id=1036123

2> http://www.aifb.uni-karlsruhe.de/~sst/Research/Publications/eacl2003-pekar-staab.pdf

3> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.247

4> http://cat.inist.fr/?aModele=afficheN&cpsidt=2457953

5> http://www2.computer.org/portal/web/csdl/doi/10.1109/ICASSP.1993.319224

6> http://slp.csie.ntnu.edu.tw/ppt%5C20080717_ytlo_MWCE.ppt

7> http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=01227767

8> http://www.springerlink.com/content/e752355351t15137/ (very important)

9> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.7661

10> http://arxiv.org/abs/cmp-lg/9405029

11> http://sciencelinks.jp/j-east/article/200013/000020001300A0434831.php

12> http://home.wlv.ac.uk/~in8113/papers/coling04EEDWorkshop_pekar.pdf

13> http://arxiv.org/abs/cmp-lg/9503011

14> http://arxiv.org/abs/cmp-lg/9610004

15> http://www.aclweb.org/anthology-new/C/C69/C69-2201.pdf

16> http://www.ittc.ku.edu/publications/documents/Futrelle1993_rawc.pdf

17> https://eprints.kfupm.edu.sa/68942/

18> http://mail.udgvirtual.udg.mx/biblioteca/bitstream/123456789/1482/1/Word_classification.pdf

19> http://www.scribd.com/doc/100387/A-Classification-Approach-to-Word-Prediction

20> http://www.cse.ust.hk/~dekai/library/WU_Dekai/CarpuatSuWu_Senseval3.pdf

21> http://sig.media.eng.hokudai.ac.jp/~araki/2002/1996-D-8.pdf

22> http://www.loa-cnr.it/Papers/ijcai03.pdf

23> http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-100/Preslav_Nakov.pdf

24> http://www.corpus.bham.ac.uk/PCLC/asmussen_paper.pdf

A Note on Features

The classifiers that we examined so far (SemiL, ISDA, SVMLight, Mallet) put documents into two groups - positive or negative, depending on whether they fall on one side of the boundary or the other. Boundary is defined by kernel functions, which can be linear, quadratic, user-defined, etc.

Problem is that for document classification, words and expressions are used as features. These are smaller units and can be handled easily. But our task is a bit more subtle in the sense that we'd like to classify expressions themselves. So the fundamental question is: are there still smaller units which can be used as features? Obviously, individual letters are not a good choice. Probably words, which form the expression? I'm not sure...

If we look at the problem from another viewpoint, expressions conform to certain grammar productions. So, can we use those particular productions as our features? Need to consider this further...

Another (may be minor) problem that we encountered while dealing with the classifiers is that they treat the test input files as separate documents. So for our purposes, should we encapsulate one expression in one file and feed the files as input? Or are there better methods?

Tuesday, November 4, 2008

Papers

1> Rohit J. Kate and Raymond J. Mooney. 2007. Semi-Supervised Learning for Semantic Parsing using Support Vector Machines. This paper shows a semi-supervised SVM approach to semantic parsing. Authors modified the KRISP method of supervised semantic parsing by capping it with a transductive SVM (Chen et al's), and termed the resulting system SEMISUP-KRISP. It may come of use to us if we want to do semantic parsing. It also clarifies the idea of how transduction and semi-supervised learning can be integrated with SVMs.

2> Sugato Basu, Mikhail Bilenko, Raymond J. Mooney. 2004. A Probabilistic Framework for Semi-Supervised Clustering. This paper proposes a probabilistic model for semisupervised clustering based on Hidden Markov Random Fields (HMRFs) which works well in case of prototype-based clustering. Authors devised a K-means style algorithm (which they termed HMRF-KMeans) that gives superior performance for small datasets in higher dimensions.

3> Sugato Basu, Arindam Banerjee, Raymond J. Mooney. 2004. Active Semi-Supervision for Pairwise Constrained Clustering. This paper gives 3 algorithms PCKMeans, Explore and Consolidate, which work in concert to yield a pairwise-constrained clustering of data. Pairwise-constrained means that there are must-link and cannot-link relationships among unlabeled data. For our purpose, we can view this approach as a refinement over pure semi-supervised learning, for here we are supplying, alongwith unlabeled data, the above-mentioned must-link and cannot-link relationships.

4> Arindam Banerjee, Chase Krumpelman, Joydeep Ghosh, Sugato Basu, Raymond J. Mooney. 2005. Model-based Overlapping Clustering. It has been shown that many recent datasets actually exhibit overlapping cluster property in the sense that a single piece of data can be assigned to multiple clusters. This paper generalizes Segal et al's probabilistic relational model (PRM) and additionally offers several algorithmic modifications that improve both the performance and applicability of the model. In our case, this paper is only useful if we really hit upon a dataset which exhibits cluster overlaps.

5> Sugato Basu, Arindam Banerjee, Raymond Mooney. 2002. Semi-supervised Clustering by Seeding. Although a bit old, this paper ushers in the idea of merging KMeans with semi-supervised models as well as EM-style algorithms.

6> Brian Kulis, Sugato Basu, Inderjit Dhillon, Raymond Mooney. 2005. Semi-supervised Graph Clustering: A Kernel Approach. This paper has 6 important contributions enumerated in pages 1-2. Apart from combining Graph clustering and Kernel KMeans and capping it with semi-supervised approach, authors have shown that their method generalizes Spectral Learning, significantly outperforms HMRF-KMeans, and can detect clusters with non-linear boundaries (thanks to the connection between graph-based and vector-based clustering).

7> Hendrik K¨uck, Peter Carbonetto, Nando de Freitas. A Constrained Semi-Supervised Learning Approach to Data Association. Data association is an important problem in computer vision. Authors show a constrained semi-supervised learning approach for tackling this problem, which outperforms other existing approaches, e.g., EM algorithms and Markov Chain Monte Carlo (MCMC) methods. If we use data association at some point, then this paper will be of use.

8> Rie Kubota Ando, Tong Zhang. A High-Performance Semi-Supervised Learning Method for Text Chunking. This paper discusses a structural learning approach that is very effective in bringing out the semi-supervised aspect of text chunking.

9> David Nadeau, Peter D. Turney. A Supervised Learning Approach to Acronym Identification. Acronym identification is very important for our task, because there are several acronyms, both standard as well as non-standard, which fall under the realms of colloquialism. Examples are ASAP (as soon as possible), AFAIK (as far as I know), etc. An acronym may be either colloquialism or formal expression depending on the context. For example, WS in a particular context may either refer to William Shakespeare (colloquialism), or Western Samoa (formal expression). However, I can't say how much the idea of acronym identification will be useful to us.

10> Ion Muslea, Steven Minton, Craig A. Knoblock. Active + Semi-Supervised Learning = Robust Multi-View Learning. Multi-view learning is important where features can be grouped into subsets. So we'll reap benefit from this paper if we see subsets in the feature space. Robust multi-view learning is a semi-supervised approach to tackle this problem.

11> Rich Caruana, Nikos Karampatziakis, Ainur Yessenalina. 2008. An Empirical Evaluation of Supervised Learning in High Dimensions. Since high dimensionality in NLP datasets is very common, there must be a suitable evaluation of how supervised learning performs in such cases. This paper presents an empirical study of supervised learning evaluation.

12> Gokhan Tur, Dilek Hakkani-Tu¨r, Robert E. Schapire. 2004. Combining active and semi-supervised learning for spoken language understanding. Although written primarily for implementing a goal-oriented call routing system, this paper moves beyond by providing means for understanding spoken language (which includes colloquialisms) in an active, semi-supervised manner.

13> Nuanwan Soonthornphisaj, Boonserm Kijsirikul. 2005. Combining ILP with Semi-supervised Learning for Web Page Categorization. Although web page categorization is not our task, we can have important insight from this paper regarding how ILP (inductive logic programming) and ICT (iterative-cross training) can be applied to the problem of semi-supervised learning.

14> Steven M. Beitzel, Eric C. Jensen, Ophir Frieder, David D. Lewis, Abdur Chowdhury, Aleksander Kołcz. Improving Automatic Query Classification via Semi-supervised Learning. Automatic query classification is a very important aspect of our problem, because it allows us to differentiate between queries for colloquial and formal terms. This paper gives a semi-supervised method for improving the process.

15> Junhui Wang, Xiaotong Shen. Large Margin Semi-supervised Learning. This paper discusses the theory of semi-supervised learning and how it can be improved to obtain larger margins.

16> Te Ming Huang, Vojislav Kecman. 2005. Performance Comparisons of Semi-Supervised Learning Algorithms. Authors present a detailed and painstaking study of several semi-supervised algorithms, and compare their performances.

17> Ke Chen, Shihai Wang. Regularized Boost for Semi-Supervised Learning. This paper introduces the important idea of boosting algorithms and goes on giving details of a regularized boosting method.

18> J. Andrew Bagnell. 2005. Robust Supervised Learning.

19> Ashish Kapoor, Eric Horvitz, Sumit Basu. Selective Supervision: Guiding Supervised Learning with Decision-Theoretic Active Learning.

20> Yves Grandvalet, Yoshua Bengio. Semi-supervised Learning by Entropy Minimization. This paper gives a new approach to semi-supervised learning.

21> Te Ming Huang, Vojislav Kecman. 2004. Semi-supervised Learning from Unbalanced Labeled Data – An Improvement. This paper discusses an improved semi-supervised learning method which works even when the amount of labeled data is very small.

22> Avleen S. Bijral, Manuel E. Lladser, Gregory Grudic. Semi-supervised Learning of a Markovian Metric. This paper discusses important concepts of distances and metrics, and goes on detailing a semi-supervised learning method on a Markovian metric.

23> Marc'Aurelio Ranzato, Martin Szummer. 2008. Semi-supervised Learning of Compact Document Representations with Deep Networks. Compact document representations are very useful forms of data visualization. This paper discusses a semi-supervised "deep-network" approach for building compact document representations.

24> Gideon S. Mann, Andrew McCallum. 2007. Simple, Robust, Scalable Semi-supervised Learning via Expectation Regularization. This paper gives another approach to semi-supervised learning - expectation regularization.

25> Dina Goren-Bar, Tsvi Kuflik, Dror Lev. Supervised Learning for Automatic Classification of Documents using Self-Organizing Maps. This paper gives yet another approach to semi-supervised learning - self-organizing maps.

26> Irena Spasi´c, Goran Nenadi´c, Kostas Manios, Sophia Ananiadou. 2002. Supervised Learning of Term Similarities. This paper brings in the idea that term similarities may also be discovered with the help of supervised learning.

27> Thanh Phong Pham, Hwee Tou Ng, Wee Sun Lee. 2005. Word Sense Disambiguation with Semi-Supervised Learning. Word sense disambiguation is important for our purpose, and this paper gives a semi-supervised approach for tackling this problem.

Tools and Lexicons

1> http://www.comp.nus.edu.sg/~rpnlpir/
2> http://www.cs.cofc.edu/~manaris/ai-education-repository/nlp-tools.html
3> http://nlp.stanford.edu/links/statnlp.html

Monday, October 13, 2008

Some important books

I have studied some books and book drafts regarding this task. What follows is a rough outline of important materials covered in each of them. A lot more still needs to be read and understood. Also, some of the following materials I have covered only partially. However, I'll give a general idea about the contexts.

Book drafts:

1> Introduction to Information Retrieval (Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. Chapters 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 18 should be of use.

2> Introduction to Machine Learning (Nils J. Nilsson). Chapters 1, 3, 4, 5, 6, 9, 11, 12 should be of use.

Books:

1> Information Retrieval: Algorithms and Heuristics (David A. Grossman, Ophir Frieder). Chapters 2, 3, 4, 5 should be of use.

2> Pattern Recognition Principles (Julius T. Tou, Rafael C. Gonzalez). Chapters 2, 3, 4, 5, 6, 7, 8 should be of use.

3> Speech and Language Processing (Daniel Jurafsky, James H. Martin). Chapters 3, 5, 6, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21 are of use.

4> Linguistics: An Introduction to Language and Communication (Adrian Akmajian, Richard A. Demers, Ann K. Farmer, Robert M. Harnish). Chapters 2, 5, 6, 7, 8, 9, 10 should be of use; chapters 11 and 12 are also important.

5> Principles of Artificial Intelligence (Nils J. Nilsson). Chapters 1, 2, 3, 6 are important.

6> Artificial Intelligence: A New Synthesis (Nils J. Nilsson). Chapters 3, 4, 5, 6, 7, 8, 9, 10, 17, 18, 19, 20, 23, 24 are important.

7> Digital Image Processing (Rafael C. Gonzalez, Richard E. Woods). Chapters 9, 10, 11, 12 are important.

8> Computer Vision: A Modern Approach (David A. Forsyth, Jean Ponce). Chapters 14, 15, 16, 22, 23 are important.

Monday, September 15, 2008

Some further considerations

First let us take the trainable classifier approach. We asked some questions in our previous post:

What will be the training approach?
What text features we'll have to consider in this classification?
How to shrink the initial feature space?

In this context, we looked at http://en.wikipedia.org/wiki/Supervised_learning to see what they say. They have identified five steps:

a) Determine the type of training examples.
b) Gather a suitable training set.
c) Feature selection (and redundant feature elimination).
d) Learning function and learning algorithm identification.
e) Actual learning, optimization (with parameter adjustment) and cross-validation.

There are other issues like Empirical Risk Minimization, Active Learning, etc. Some good classifiers are artificial neural networks (ANN), multilayer perceptrons, decision trees, SVMs, k-nearest neighbor (kNN), Gaussian, Naive Bayes and RBF classifiers.

Another important issue to consider is semi-supervised learning and its applicability in solving our current problem (http://en.wikipedia.org/wiki/Semi-supervised_learning). A useful approach might be to use transduction (http://en.wikipedia.org/wiki/Transductive_learning), because transductions come particularly handy while dealing with binary classifications like ours (formal vs. informal).

As an initial step, we have accessed a large number of blogs. These will serve as the training data set. Regarding features, a useful initial approximation is to use the colloquial expressions and words as features. The value of the features will be the tf-idf products of these terms. We know that this product normally gives the true importance of a term in a document. This step also yields an occurrence matrix which helps in eliminating redundant features. Redundancy of a feature may be assessed in terms of its combined tf-idf across all documents (need to clarify this).

Our next two tasks involve selection of an appropriate learning model and actually training the data. We have some softwares available to help us with this task, e.g., SVM-Light, Mallet, ISDA and SemiL. While SVM-Light essentially implements an SVM in C, Mallet goes even farther with its richer environment of document classification, sequence tagging, topic modeling and numerical optimization tools. ISDA is a GUI-based SVM tool and SemiL implements graph-based semi-supervised learning techniques for large-scale problems. MATLAB also features an SVM toolbox that is pretty rich in aspects. It's available from several sources. There are other implementations like SVMTool, SSVM (and other tools - by University of Wisconsin, Madison) and a lot of other tools like DTREG, SVM-Fold, etc.

Regarding training approaches, Microsoft has described a novel one - Sequential Minimal Optimization. We'll have to see if it's applicable to our case. I'm trying to gather as much footing as possible to start the task, for I don't have much background in this area.

During feature selection, we'll need to consult a dictionary of colloquialisms. It's a good idea to start with a standard dictionary. Many such dictionaries are available on the Web, e.g., http://www.peevish.co.uk/slang/, http://www.urbandictionary.com/, http://www.geocities.com/informalenglish/dictindex.html, etc. We, however, will probably need a set of files or a database as our dictionary. Reason is very simple: these can be accessed much more easily. While extracting (and later on, eliminating) colloquial expressions, we'll also need to consider grammar rules (to eliminate formal jargons, to generate newer and previously unseen colloquialisms), word-sense disambiguation (say, e.g., what does the word "cabbage" mean in a particular context?) and some other important things (like, say, change of meaning when two words combine to form a new expression). In some cases (we'll have to see which ones), we may have to take recourse to rule-based systems as well. However, it seems to me that semantic analysis is going to have a greater impact on our study than anything else. An important pointer in this direction is here.

There are other learning tools besides SVM, like ANN, Naive Bayes, kNN, decision trees, etc. We don't know yet which one is going to impact more in this case. Some tools in this direction are http://www.mathtools.net/MATLAB/Neural_Networks/index.html, http://www.cs.cofc.edu/~manaris/ai-education-repository/neural-n-tools.html, http://www.makhfi.com/tools.htm, http://www.bestfreewaredownload.com/s-gejpuiti-multilayer-perceptron-c-33-math-scientific-tools-freeware.html, http://www.consulttoday.com/KnnRabbit.aspx, http://mac.softpedia.com/get/Internet-Utilities/POPFile.shtml, etc.

Tuesday, September 9, 2008

How to solve the problem

Let us consider the first portion of our task: to identify colloquialisms in text and to translate them into formal expressions. For the time being, we'll concentrate on the first half: identification of colloquialisms in text.

Now there are several methods that can be implemented to resolve this issue. Some of these have been discussed in earlier posts and comments, like grammar checkers, dictionary search (vocab search), etc. We also concluded that naive dictionary search will be a brute-force approach and discarded it in favor of another one in which dictionary search is aided by grammar checking.

It seems that there exist better methods to accomplish our goal. For example, for a particular setting and type of input text, we can build a trainable pattern classifier. All it has got to do, is to classify expressions found in a text into two disjoint groups: colloquial and formal. However, we need to answer questions like: what will be the training approach, what text features we'll have to consider in this classification, how to shrink the initial feature space, etc.

Raw grammar checking may also be improved with the help of a rule-based system. In a rule-based system, there is a set of facts and a set of rules. Rules act on facts to generate more facts. In this way, a rule-based system works much like a grammar checker, with a minor difference. Rule-based systems are basically synthetic in nature, i.e., they synthesize (generate) new expressions from facts and rules. These expressions are then compared to the input text. On the other hand, grammar checkers are analytic systems, meaning that they actually analyze input text first to see whether they fit any of the available grammar rules (productions).

In our case, if we go for a trainable classifier approach, we have to first train the system using a set of representative, pre-annotated texts containing colloquialisms. After that, we can test the system using similar non-annotated texts containing colloquialisms. If we go for a grammar checker approach, then first we'll have to define the productions that are appropriate, as mentioned in the previous posts. Finally, if we turn to rule-based system, we need to fashion the facts and the rules, such that facts correspond to basic colloquialisms, and rules help us derive more involved ones. Here, rules act incrementally (or inductively), means basic facts lead to next level facts, next level facts lead to next higher level facts, and so on.

Wednesday, August 27, 2008

Applications

Well, the first thing that comes to my mind is extraction of useful information. Now the concept of "usefulness" varies, even among the savants of the same field. But we'll take a broader approach to evade any controversy as much as possible.

For example, talking about Chemistry, as I had pointed out already, we may look into whether the colloquialisms reveal any new or interesting chemical data, or whether they refer to some specific compound or a formula. Implicitly we assume here that "new" will be useful, but that may not be true everywhere.

Another good application that comes to my mind is Emotion Analysis. It has several subfields like Metaphor Mining, Sarcasm Mining, etc. Opinion Mining is another good idea.

A big question: will the general people benefit from these technical achievements?

Answer: Most definitely. To see why, we need to consider the general people itself. People are more likely to express their inner thoughts and opinions in colloquialisms and spoken English, which often results in terse spoken terms or even slang terms. Presence of these terms in a written document may court serious consequences, and that's the reason people generally do not refer to them while writing something - they mask their actual ideas and opinions with something more tolerable and readable.

Bottomline: Formal documents fail to give us the actual expressions protagonists had in mind; they only supply the approximate or ethically correct ones. If we would like to know what they really thought about, we need to look at what they spoke (or, for that reason, what they wrote in blogs, which are much less prone to open public criticisms).

Question: Why would we be concerned about what people have in their minds?

Answer: There are reasons. Imagine yourself a big firm-owner. You'd like to reap as much benefit as possible from your employees. This is precisely the reason you'd like to make them as much happy as possible, for making them happy is a crucial step in getting any work done. Now, in general, your employees wouldn't open their mind to you; they'll open it to someone else - probably colleagues or family-members. So you can never determine whether he's happy or not, by only looking at formal situations. You may have to meet his colleagues or family-members, or at least, read his blog and see whatever he might have put up there. Now that impacts the incentive you'll get from your employee.

There are other reasons. Imagine yourself a security personnel. You'd like to check the record of suspects, convicts and criminals. Now, those records reflect only superficial facts - not the ones they actually have in their minds, not the big schemes they're probably interested in. They won't ever talk about those to you, either. Don't even expect it. So you can either eavesdrop, and listen intently to what they are doing with their mates, or still better, keep a record of their chats (and blogs, if there's any). Now these people are so cunning that they won't utter those things in chats or blogs. But still you might get a clue. For example, one or two codewords, expressed in colloquialisms and slangs, may turn you on. You can prevent some big attacks this way. But a prerequisite is a vocab of criminal slangs and codewords. So as we can see, it has important relations with Cryptography and Security. We may actually need to break a code (expressed in colloquial or slang form) to learn their intention.

There are other examples like parent-child relationship, teacher-student relationship, etc. Our task will benefit and help to thrive these relationships. In this way, we need to be concerned in some way about Sociology as well, for it provides a very fertile ground for searching relationships. Whenever there is a relationship with some constraints (such that one party may hold back some information from the other), be it personal or professional or formal, there is an application of our work.

Enough about big examples and applications. In a nutshell, whenever some people REALLY wants to know something about some other people, there is an application of our work.

Monday, August 25, 2008

Motivations

So long as there has been a problem, there has been an underlying motivation to solve that problem. In this case, the motivation is primarily to extract useful information by expanding colloquialisms, and to generate easily understandable, pithy expressions by contracting formal jargons. Apropos of Chemistry, our quest boils down to a search for important chemical terms and expressions hidden in the folds of colloquialisms.

So we can see that besides language processing methods and models, we need to be thoroughly involved with IR techniques as well. Language processing will enable us to generate some useful expressions from some other ones. IR will help us to identify the "usefulness" contained in those expressions.

Personal motivations include my passion for learning languages and their intricacies. I am much drawn to multilingual and cross-lingual issues as well. All these factors drove me to choose this particular problem.

Problem Statement

Transform a colloquial expression to its formal counterpart, and transform a formal expression to its colloquial counterpart.

Later scope: We can explore whether these transformations reveal new facts, or just make things easier to understand and communicate.

An issue: We need to be certain about which one of the many possible expressions we are going to transform a given expression into. For colloquial -> formal type transformations, it's probably the most informative one, and for reverse transformations, it's the expressions easiest to understand.

Since it is never easy to consolidate the notions of "most informative" and "easiest to understand", we may take a detour. For example, we can generate only that expression (or those expressions) which is (are) easiest for a computer to produce. Or we can also generate all possible expressions and leave the task of discrimination and sieving upon the user (or the software/program/module that'll make use of our code). However, it is a naive approach and may not be encouraged further.

Thursday, August 21, 2008

What is a "Colloquial Expression"?

We have been using the term "Colloquial Expression" as if its meaning was taken for granted. However, it has concrete definition(s), as the following sites say. One thing should be noted at this point. A Colloquial Expression is the same as a Colloquialism.

1> According to http://www.answers.com/topic/colloquialism, "colloquialism, the use of informal expressions appropriate to everyday speech rather than to the formality of writing, and differing in pronunciation, vocabulary, or grammar."

This, as we can see, is a more-or-less generalized and holistic definition. It takes into account almost every aspect of what a colloquial expression is about.

2> According to http://disted.tamu.edu/classes/telecom98s/eva/terms.htm, "1. A colloquial expression; that is, an expression that is characteristic or appropriate to ordinary or familiar conversation rather than formal speech or writing. In standard American English, He hasn't got anyis colloquial, whereas He has noneis formal. 2. Colloquial style or usage. Note: Colloquialisms are often viewed upon with disapproval, as if they indicate "vulgar, bad, or incorrect" usage. However, they are merely part of a familiar style used in speaking rather than in writing."

An important point: Colloquial Expression should not be equated with "vulgar, bad or incorrect". Well, there might be vulgar, bad or incorrect terms interspersed within it, but that's definitely not the sole thing or the sole purpose.

3> According to http://en.wikipedia.org/wiki/Colloquial_expression, "A colloquialism is an expression not used in formal speech, writing or paralinguistics.
...
Colloquialisms denote a manner of speaking or writing that is characteristic of familiar "common" conversation; informal colloquialisms can include words (such as "y'all" or "gonna" or "wanna"), phrases (such as "ain't nothin'", "dressed for bear" and "dead as a doornail"), or sometimes even an entire aphorism ("There's more than one way to skin a cat")."

It gives examples of several colloquial expressions and goes on giving details about types of colloquialisms and their instances.

4> According to http://encyclopedia.farlex.com/Colloquial+expression, "Informal word or phrase appropriate to familiar, everyday conversation. Colloquialisms are more acceptable than slang in a wider social context."

A pithy and somewhat controversial definition. Pithy, because it summarizes colloquialism in the first sentence and somewhat controversial, because it tries to give slangs a separate status in the second sentence. But slangs are definitely within the realm of colloquialism, although they are more restricted and in some contexts, less appropriate.

5> According to http://en.wiktionary.org/wiki/colloquial_expression, "A phrase that appears more often in spoken than in written language. Colloqiual expressions are similar to slang, but tend to be more universal, whereas slang can often be limited to a particular social group;"

Second sentence tells us "Colloqiual expressions are similar to slang". But as we know, not only similar, slangs are actually a special form of colloquialism. So instead of saying the above, we should rather say, "Colloqiual expressions are a superset to slangs".

Some really cool examples of colloquial expressions are here:

http://www.englishforums.com/search/Colloquial+expressions.htm

Another important thing. We had discussed about "expansion" and "contraction" already. We must, however, remember that contraction may not always necessarily imply shortening of the number of words in a phrase or clause, as envisaged here:

http://mbm.dotnet11.hostbasket.com/iis/testy/test13.asp

Here, all contractions have actually increased the number of words.

But these are formal definitions. To me, a colloquial expression is something that is not appropriate while writing or speaking officially, formally or courteously, and something that is fully appropriate while writing or speaking informally (as if with friends or close partners or colleagues). These include slangs as well.

Tuesday, August 19, 2008

Real Examples

1> Example not directly related to important chemistry facts

From http://cultureofchemistry.blogspot.com/2008/07/protecting-groups.html

"The boys had the tent next door to ours. I came back from dinner one night to find a very happy squirrel just making off with a chip container from the kids tent. At which point I remembered the dried fruit I'd left in my pack after the morning hike. Whew...it was still there. The rodents had been attracted to the far more tasty snack leavings next door. The boys tent is serving as (a chemist would say) a protecting group."

There are a lot of things to be considered. First, look at the boldfaced expressions. These are either colloquial expressions, or incorrect word-forms coming probably out of haste, or a pithy clause like the last one.

"Boys" and "kids" would really be "boys' " and "kids' ". "Whew" is an expression suggesting pure fantasy or amazement on the author's part. "A chemist would say" alludes to the fact that a protecting group behaves nearly the same way the boys' tent was behaving to squirrels.

But probably these expressions are not bringing out any important chemistry fact.

2> Example directly related to an important chemistry fact

Let's look at another para from
http://cultureofchemistry.blogspot.com/2008/06/weird-words-of-science-isotope.html

"Each atom of an element has a characteristic number of protons - positively charged particles - in their nucleus. An atom with five protons is boron. One with 82? Lead."

Here, "One with 82?" is most definitely a colloquial expression and is not by any chance a complete sentence. It is a question. "What is the atom that has 82 protons in it?" The answer, "Lead", is also not a full sentence, but it refers to the sentence "The atom that has 82 protons in it, is Lead." So, these colloquial expressions are actually refering to important chemical details.

Wednesday, August 13, 2008

Examples of expansion and contraction

Simple expansions and contractions:

It's = It is/has (is/has ambiguity to be resolved by checking for the presence of participial forms after It's)
There's = There is/has (is/has ambiguity to be resolved by checking for the presence of participial forms after There's)
I'd = I would/had (would/had ambiguity to be resolved by checking for the presence of participial forms after I'd)

These are all syntactic expansions/contractions.

More involved expansions and contractions:

1> Expansion Ambiguity

A: Tell me something about CO2.
B: Oh...it's the carbon thing again!

Here, A and B are having a conversation and the "carbon thing" is most certainly a colloquial expression. This, when interpreted (we can say "expanded semantically") with contextual help, boils down to CO2, our familiar word in chemistry! But again, "carbon thing" may actually imply something totally different. For example, if B is bored with A's rantings, he'll probably say the same thing! Here, the ambiguity might be resolved the following way. If it's "the carbon thing", then we can reasonably guess that B is bored, and he's refering to anything and everything related to carbon. On the other hand, if it were "that carbon thing", we could have been sure that B was actually talking about CO2.

There are other examples:

2> Placeholders and Figuratives

A: D'you think we are going to estimate the solubility of this mixture.
B: Well, I think we do.

Here also, what B says is basically "We are going to estimate the solubility of this mixture.", and "well" is merely a placeholder, or we can say, a figurative. So in expanding B's words, we have to discard "well", and then we have to interpret the rest of the sentence in conjunction with whatever A had uttered previously.

So there are 2 tasks:

1> Identify the placeholders and figuratives, and remove all of them.
2> Interpret the rest with contextual help. The context may either precede the text, or succeed it. [In the above two cases, context preceded text]

One example where text precedes context:

3> Context succeeds text

A: Hmmm, great job!
B: The isolation part?

B's question is in colloquial form. When expanded, it'll be "Are you talking about the isolation part?" A's remark "great job" (colloq) tentatively (because we know nothing about the great job, and B's answer to the remark is merely another question!) refers to the isolation part, and "hmmm" is nothing but a placeholder. But that may not be true everywhere. For example,

4> Placeholders, or something else?

A: CO2 reacts with water to yield H2CO3. Take it for granted.
B: Hmmm.

Here, "hmmm" is more than a placeholder. It actually means that B is reserving his judgement about A's remark. So, when expanded, it'll be something like "Yes, I take it for granted that CO2 reacts with water to yield H2CO3.", or "No, I do not take it for granted that CO2 reacts with water to yield H2CO3.", or even "It may be that CO2 reacts with water to yield H2CO3, of which I am not sure yet."

Like expansion, contraction is also an important and difficult issue. Not only should we be concerned about the context, but at the same time we should have at our disposal all possible colloquial forms that are appropriate. The most appropriate (and probably the pithiest) one will replace our formal, expansive jargon. For example,

5> A formal dialog

A: CO2 reacts with water to yield H2CO3. Take it for granted.
B: Yes, I take it for granted that CO2 reacts with water to yield H2CO3.

To convert this example to the previous one, would require us to know about all possible colloquial expressions that COULD HAVE replaced the second sentence, and then to select "hmmm" (it might be "Yup!" as well) as the fittest among them. Here, some information about B's mental state may be extremely valuable, for as we understand, colloquial forms vary considerably with changes in a person's mental state. As an example, the following could be a perfect contraction:

6> Colloquial expressions vary with mental state

A: CO2 reacts with water to yield H2CO3. Take it for granted.
B: Bingo!

Definitely, there are fine distinctions between these two cases. When B says "hmmm", we understand that he is very attentive and heeding. On the other hand, a "bingo!" tells us that B had been thoroughly involved, and overjoyed with this new finding. Important thing is that one "bingo!" from B tells us that he had been overjoyed and this finding was NEW to him!

Finally, it must be admitted that colloquial forms are not very easy to obtain in general (digitized) literature. They are obtained either from chats or blogs (like this one... [:-D]), or from memos chemists use to exchange ideas and data. These sources will form our corpora.

More about corpora in the next post.

Tuesday, August 12, 2008

1

This week: I'll gather information about informal communication, colloquialisms and their impact on processing Chemistry data. Also I'll explore the multilingual issues that seems to be pertinent in this regard.

Next week: This week first, then next week.

Issues: Forms of colloquialism, variations in colloquialism across several languages, methods of extraction and expansion, possible applications. A very important thing is CONTRACTION, i.e., generation of colloquial forms from expanded (and may be somewhat formal and rigmarolic) ideas and discussions. So basically we need to explore BOTH directions - how to expand and then how to contract (appropriately).

Achievements:

http://en.wikipedia.org/wiki/Wiki

http://www.cs.umass.edu/~culotta/pubs/culotta04dependency.pdf

studying on kernels, dependency trees, etc. Need to know about methods...