Well, the first thing that comes to my mind is extraction of useful information. Now the concept of "usefulness" varies, even among the savants of the same field. But we'll take a broader approach to evade any controversy as much as possible.
For example, talking about Chemistry, as I had pointed out already, we may look into whether the colloquialisms reveal any new or interesting chemical data, or whether they refer to some specific compound or a formula. Implicitly we assume here that "new" will be useful, but that may not be true everywhere.
Another good application that comes to my mind is Emotion Analysis. It has several subfields like Metaphor Mining, Sarcasm Mining, etc. Opinion Mining is another good idea.
A big question: will the general people benefit from these technical achievements?
Answer: Most definitely. To see why, we need to consider the general people itself. People are more likely to express their inner thoughts and opinions in colloquialisms and spoken English, which often results in terse spoken terms or even slang terms. Presence of these terms in a written document may court serious consequences, and that's the reason people generally do not refer to them while writing something - they mask their actual ideas and opinions with something more tolerable and readable.
Bottomline: Formal documents fail to give us the actual expressions protagonists had in mind; they only supply the approximate or ethically correct ones. If we would like to know what they really thought about, we need to look at what they spoke (or, for that reason, what they wrote in blogs, which are much less prone to open public criticisms).
Question: Why would we be concerned about what people have in their minds?
Answer: There are reasons. Imagine yourself a big firm-owner. You'd like to reap as much benefit as possible from your employees. This is precisely the reason you'd like to make them as much happy as possible, for making them happy is a crucial step in getting any work done. Now, in general, your employees wouldn't open their mind to you; they'll open it to someone else - probably colleagues or family-members. So you can never determine whether he's happy or not, by only looking at formal situations. You may have to meet his colleagues or family-members, or at least, read his blog and see whatever he might have put up there. Now that impacts the incentive you'll get from your employee.
There are other reasons. Imagine yourself a security personnel. You'd like to check the record of suspects, convicts and criminals. Now, those records reflect only superficial facts - not the ones they actually have in their minds, not the big schemes they're probably interested in. They won't ever talk about those to you, either. Don't even expect it. So you can either eavesdrop, and listen intently to what they are doing with their mates, or still better, keep a record of their chats (and blogs, if there's any). Now these people are so cunning that they won't utter those things in chats or blogs. But still you might get a clue. For example, one or two codewords, expressed in colloquialisms and slangs, may turn you on. You can prevent some big attacks this way. But a prerequisite is a vocab of criminal slangs and codewords. So as we can see, it has important relations with Cryptography and Security. We may actually need to break a code (expressed in colloquial or slang form) to learn their intention.
There are other examples like parent-child relationship, teacher-student relationship, etc. Our task will benefit and help to thrive these relationships. In this way, we need to be concerned in some way about Sociology as well, for it provides a very fertile ground for searching relationships. Whenever there is a relationship with some constraints (such that one party may hold back some information from the other), be it personal or professional or formal, there is an application of our work.
Enough about big examples and applications. In a nutshell, whenever some people REALLY wants to know something about some other people, there is an application of our work.
Wednesday, August 27, 2008
Monday, August 25, 2008
Motivations
So long as there has been a problem, there has been an underlying motivation to solve that problem. In this case, the motivation is primarily to extract useful information by expanding colloquialisms, and to generate easily understandable, pithy expressions by contracting formal jargons. Apropos of Chemistry, our quest boils down to a search for important chemical terms and expressions hidden in the folds of colloquialisms.
So we can see that besides language processing methods and models, we need to be thoroughly involved with IR techniques as well. Language processing will enable us to generate some useful expressions from some other ones. IR will help us to identify the "usefulness" contained in those expressions.
Personal motivations include my passion for learning languages and their intricacies. I am much drawn to multilingual and cross-lingual issues as well. All these factors drove me to choose this particular problem.
So we can see that besides language processing methods and models, we need to be thoroughly involved with IR techniques as well. Language processing will enable us to generate some useful expressions from some other ones. IR will help us to identify the "usefulness" contained in those expressions.
Personal motivations include my passion for learning languages and their intricacies. I am much drawn to multilingual and cross-lingual issues as well. All these factors drove me to choose this particular problem.
Problem Statement
Transform a colloquial expression to its formal counterpart, and transform a formal expression to its colloquial counterpart.
Later scope: We can explore whether these transformations reveal new facts, or just make things easier to understand and communicate.
An issue: We need to be certain about which one of the many possible expressions we are going to transform a given expression into. For colloquial -> formal type transformations, it's probably the most informative one, and for reverse transformations, it's the expressions easiest to understand.
Since it is never easy to consolidate the notions of "most informative" and "easiest to understand", we may take a detour. For example, we can generate only that expression (or those expressions) which is (are) easiest for a computer to produce. Or we can also generate all possible expressions and leave the task of discrimination and sieving upon the user (or the software/program/module that'll make use of our code). However, it is a naive approach and may not be encouraged further.
Later scope: We can explore whether these transformations reveal new facts, or just make things easier to understand and communicate.
An issue: We need to be certain about which one of the many possible expressions we are going to transform a given expression into. For colloquial -> formal type transformations, it's probably the most informative one, and for reverse transformations, it's the expressions easiest to understand.
Since it is never easy to consolidate the notions of "most informative" and "easiest to understand", we may take a detour. For example, we can generate only that expression (or those expressions) which is (are) easiest for a computer to produce. Or we can also generate all possible expressions and leave the task of discrimination and sieving upon the user (or the software/program/module that'll make use of our code). However, it is a naive approach and may not be encouraged further.
Thursday, August 21, 2008
What is a "Colloquial Expression"?
We have been using the term "Colloquial Expression" as if its meaning was taken for granted. However, it has concrete definition(s), as the following sites say. One thing should be noted at this point. A Colloquial Expression is the same as a Colloquialism.
1> According to http://www.answers.com/topic/colloquialism, "colloquialism, the use of informal expressions appropriate to everyday speech rather than to the formality of writing, and differing in pronunciation, vocabulary, or grammar."
This, as we can see, is a more-or-less generalized and holistic definition. It takes into account almost every aspect of what a colloquial expression is about.
2> According to http://disted.tamu.edu/classes/telecom98s/eva/terms.htm, "1. A colloquial expression; that is, an expression that is characteristic or appropriate to ordinary or familiar conversation rather than formal speech or writing. In standard American English, He hasn't got anyis colloquial, whereas He has noneis formal. 2. Colloquial style or usage. Note: Colloquialisms are often viewed upon with disapproval, as if they indicate "vulgar, bad, or incorrect" usage. However, they are merely part of a familiar style used in speaking rather than in writing."
An important point: Colloquial Expression should not be equated with "vulgar, bad or incorrect". Well, there might be vulgar, bad or incorrect terms interspersed within it, but that's definitely not the sole thing or the sole purpose.
3> According to http://en.wikipedia.org/wiki/Colloquial_expression, "A colloquialism is an expression not used in formal speech, writing or paralinguistics.
...
Colloquialisms denote a manner of speaking or writing that is characteristic of familiar "common" conversation; informal colloquialisms can include words (such as "y'all" or "gonna" or "wanna"), phrases (such as "ain't nothin'", "dressed for bear" and "dead as a doornail"), or sometimes even an entire aphorism ("There's more than one way to skin a cat")."
It gives examples of several colloquial expressions and goes on giving details about types of colloquialisms and their instances.
4> According to http://encyclopedia.farlex.com/Colloquial+expression, "Informal word or phrase appropriate to familiar, everyday conversation. Colloquialisms are more acceptable than slang in a wider social context."
A pithy and somewhat controversial definition. Pithy, because it summarizes colloquialism in the first sentence and somewhat controversial, because it tries to give slangs a separate status in the second sentence. But slangs are definitely within the realm of colloquialism, although they are more restricted and in some contexts, less appropriate.
5> According to http://en.wiktionary.org/wiki/colloquial_expression, "A phrase that appears more often in spoken than in written language. Colloqiual expressions are similar to slang, but tend to be more universal, whereas slang can often be limited to a particular social group;"
Second sentence tells us "Colloqiual expressions are similar to slang". But as we know, not only similar, slangs are actually a special form of colloquialism. So instead of saying the above, we should rather say, "Colloqiual expressions are a superset to slangs".
Some really cool examples of colloquial expressions are here:
http://www.englishforums.com/search/Colloquial+expressions.htm
Another important thing. We had discussed about "expansion" and "contraction" already. We must, however, remember that contraction may not always necessarily imply shortening of the number of words in a phrase or clause, as envisaged here:
http://mbm.dotnet11.hostbasket.com/iis/testy/test13.asp
Here, all contractions have actually increased the number of words.
But these are formal definitions. To me, a colloquial expression is something that is not appropriate while writing or speaking officially, formally or courteously, and something that is fully appropriate while writing or speaking informally (as if with friends or close partners or colleagues). These include slangs as well.
1> According to http://www.answers.com/topic/colloquialism, "colloquialism, the use of informal expressions appropriate to everyday speech rather than to the formality of writing, and differing in pronunciation, vocabulary, or grammar."
This, as we can see, is a more-or-less generalized and holistic definition. It takes into account almost every aspect of what a colloquial expression is about.
2> According to http://disted.tamu.edu/classes/telecom98s/eva/terms.htm, "1. A colloquial expression; that is, an expression that is characteristic or appropriate to ordinary or familiar conversation rather than formal speech or writing. In standard American English, He hasn't got anyis colloquial, whereas He has noneis formal. 2. Colloquial style or usage. Note: Colloquialisms are often viewed upon with disapproval, as if they indicate "vulgar, bad, or incorrect" usage. However, they are merely part of a familiar style used in speaking rather than in writing."
An important point: Colloquial Expression should not be equated with "vulgar, bad or incorrect". Well, there might be vulgar, bad or incorrect terms interspersed within it, but that's definitely not the sole thing or the sole purpose.
3> According to http://en.wikipedia.org/wiki/Colloquial_expression, "A colloquialism is an expression not used in formal speech, writing or paralinguistics.
...
Colloquialisms denote a manner of speaking or writing that is characteristic of familiar "common" conversation; informal colloquialisms can include words (such as "y'all" or "gonna" or "wanna"), phrases (such as "ain't nothin'", "dressed for bear" and "dead as a doornail"), or sometimes even an entire aphorism ("There's more than one way to skin a cat")."
It gives examples of several colloquial expressions and goes on giving details about types of colloquialisms and their instances.
4> According to http://encyclopedia.farlex.com/Colloquial+expression, "Informal word or phrase appropriate to familiar, everyday conversation. Colloquialisms are more acceptable than slang in a wider social context."
A pithy and somewhat controversial definition. Pithy, because it summarizes colloquialism in the first sentence and somewhat controversial, because it tries to give slangs a separate status in the second sentence. But slangs are definitely within the realm of colloquialism, although they are more restricted and in some contexts, less appropriate.
5> According to http://en.wiktionary.org/wiki/colloquial_expression, "A phrase that appears more often in spoken than in written language. Colloqiual expressions are similar to slang, but tend to be more universal, whereas slang can often be limited to a particular social group;"
Second sentence tells us "Colloqiual expressions are similar to slang". But as we know, not only similar, slangs are actually a special form of colloquialism. So instead of saying the above, we should rather say, "Colloqiual expressions are a superset to slangs".
Some really cool examples of colloquial expressions are here:
http://www.englishforums.com/search/Colloquial+expressions.htm
Another important thing. We had discussed about "expansion" and "contraction" already. We must, however, remember that contraction may not always necessarily imply shortening of the number of words in a phrase or clause, as envisaged here:
http://mbm.dotnet11.hostbasket.com/iis/testy/test13.asp
Here, all contractions have actually increased the number of words.
But these are formal definitions. To me, a colloquial expression is something that is not appropriate while writing or speaking officially, formally or courteously, and something that is fully appropriate while writing or speaking informally (as if with friends or close partners or colleagues). These include slangs as well.
Tuesday, August 19, 2008
Real Examples
1> Example not directly related to important chemistry facts
From http://cultureofchemistry.blogspot.com/2008/07/protecting-groups.html
"The boys had the tent next door to ours. I came back from dinner one night to find a very happy squirrel just making off with a chip container from the kids tent. At which point I remembered the dried fruit I'd left in my pack after the morning hike. Whew...it was still there. The rodents had been attracted to the far more tasty snack leavings next door. The boys tent is serving as (a chemist would say) a protecting group."
There are a lot of things to be considered. First, look at the boldfaced expressions. These are either colloquial expressions, or incorrect word-forms coming probably out of haste, or a pithy clause like the last one.
"Boys" and "kids" would really be "boys' " and "kids' ". "Whew" is an expression suggesting pure fantasy or amazement on the author's part. "A chemist would say" alludes to the fact that a protecting group behaves nearly the same way the boys' tent was behaving to squirrels.
But probably these expressions are not bringing out any important chemistry fact.
2> Example directly related to an important chemistry fact
Let's look at another para from
http://cultureofchemistry.blogspot.com/2008/06/weird-words-of-science-isotope.html
"Each atom of an element has a characteristic number of protons - positively charged particles - in their nucleus. An atom with five protons is boron. One with 82? Lead."
Here, "One with 82?" is most definitely a colloquial expression and is not by any chance a complete sentence. It is a question. "What is the atom that has 82 protons in it?" The answer, "Lead", is also not a full sentence, but it refers to the sentence "The atom that has 82 protons in it, is Lead." So, these colloquial expressions are actually refering to important chemical details.
From http://cultureofchemistry.blogspot.com/2008/07/protecting-groups.html
"The boys had the tent next door to ours. I came back from dinner one night to find a very happy squirrel just making off with a chip container from the kids tent. At which point I remembered the dried fruit I'd left in my pack after the morning hike. Whew...it was still there. The rodents had been attracted to the far more tasty snack leavings next door. The boys tent is serving as (a chemist would say) a protecting group."
There are a lot of things to be considered. First, look at the boldfaced expressions. These are either colloquial expressions, or incorrect word-forms coming probably out of haste, or a pithy clause like the last one.
"Boys" and "kids" would really be "boys' " and "kids' ". "Whew" is an expression suggesting pure fantasy or amazement on the author's part. "A chemist would say" alludes to the fact that a protecting group behaves nearly the same way the boys' tent was behaving to squirrels.
But probably these expressions are not bringing out any important chemistry fact.
2> Example directly related to an important chemistry fact
Let's look at another para from
http://cultureofchemistry.blogspot.com/2008/06/weird-words-of-science-isotope.html
"Each atom of an element has a characteristic number of protons - positively charged particles - in their nucleus. An atom with five protons is boron. One with 82? Lead."
Here, "One with 82?" is most definitely a colloquial expression and is not by any chance a complete sentence. It is a question. "What is the atom that has 82 protons in it?" The answer, "Lead", is also not a full sentence, but it refers to the sentence "The atom that has 82 protons in it, is Lead." So, these colloquial expressions are actually refering to important chemical details.
Wednesday, August 13, 2008
Examples of expansion and contraction
Simple expansions and contractions:
It's = It is/has (is/has ambiguity to be resolved by checking for the presence of participial forms after It's)
There's = There is/has (is/has ambiguity to be resolved by checking for the presence of participial forms after There's)
I'd = I would/had (would/had ambiguity to be resolved by checking for the presence of participial forms after I'd)
These are all syntactic expansions/contractions.
More involved expansions and contractions:
1> Expansion Ambiguity
A: Tell me something about CO2.
B: Oh...it's the carbon thing again!
Here, A and B are having a conversation and the "carbon thing" is most certainly a colloquial expression. This, when interpreted (we can say "expanded semantically") with contextual help, boils down to CO2, our familiar word in chemistry! But again, "carbon thing" may actually imply something totally different. For example, if B is bored with A's rantings, he'll probably say the same thing! Here, the ambiguity might be resolved the following way. If it's "the carbon thing", then we can reasonably guess that B is bored, and he's refering to anything and everything related to carbon. On the other hand, if it were "that carbon thing", we could have been sure that B was actually talking about CO2.
There are other examples:
2> Placeholders and Figuratives
A: D'you think we are going to estimate the solubility of this mixture.
B: Well, I think we do.
Here also, what B says is basically "We are going to estimate the solubility of this mixture.", and "well" is merely a placeholder, or we can say, a figurative. So in expanding B's words, we have to discard "well", and then we have to interpret the rest of the sentence in conjunction with whatever A had uttered previously.
So there are 2 tasks:
1> Identify the placeholders and figuratives, and remove all of them.
2> Interpret the rest with contextual help. The context may either precede the text, or succeed it. [In the above two cases, context preceded text]
One example where text precedes context:
3> Context succeeds text
A: Hmmm, great job!
B: The isolation part?
B's question is in colloquial form. When expanded, it'll be "Are you talking about the isolation part?" A's remark "great job" (colloq) tentatively (because we know nothing about the great job, and B's answer to the remark is merely another question!) refers to the isolation part, and "hmmm" is nothing but a placeholder. But that may not be true everywhere. For example,
4> Placeholders, or something else?
A: CO2 reacts with water to yield H2CO3. Take it for granted.
B: Hmmm.
Here, "hmmm" is more than a placeholder. It actually means that B is reserving his judgement about A's remark. So, when expanded, it'll be something like "Yes, I take it for granted that CO2 reacts with water to yield H2CO3.", or "No, I do not take it for granted that CO2 reacts with water to yield H2CO3.", or even "It may be that CO2 reacts with water to yield H2CO3, of which I am not sure yet."
Like expansion, contraction is also an important and difficult issue. Not only should we be concerned about the context, but at the same time we should have at our disposal all possible colloquial forms that are appropriate. The most appropriate (and probably the pithiest) one will replace our formal, expansive jargon. For example,
5> A formal dialog
A: CO2 reacts with water to yield H2CO3. Take it for granted.
B: Yes, I take it for granted that CO2 reacts with water to yield H2CO3.
To convert this example to the previous one, would require us to know about all possible colloquial expressions that COULD HAVE replaced the second sentence, and then to select "hmmm" (it might be "Yup!" as well) as the fittest among them. Here, some information about B's mental state may be extremely valuable, for as we understand, colloquial forms vary considerably with changes in a person's mental state. As an example, the following could be a perfect contraction:
6> Colloquial expressions vary with mental state
A: CO2 reacts with water to yield H2CO3. Take it for granted.
B: Bingo!
Definitely, there are fine distinctions between these two cases. When B says "hmmm", we understand that he is very attentive and heeding. On the other hand, a "bingo!" tells us that B had been thoroughly involved, and overjoyed with this new finding. Important thing is that one "bingo!" from B tells us that he had been overjoyed and this finding was NEW to him!
Finally, it must be admitted that colloquial forms are not very easy to obtain in general (digitized) literature. They are obtained either from chats or blogs (like this one... [:-D]), or from memos chemists use to exchange ideas and data. These sources will form our corpora.
More about corpora in the next post.
It's = It is/has (is/has ambiguity to be resolved by checking for the presence of participial forms after It's)
There's = There is/has (is/has ambiguity to be resolved by checking for the presence of participial forms after There's)
I'd = I would/had (would/had ambiguity to be resolved by checking for the presence of participial forms after I'd)
These are all syntactic expansions/contractions.
More involved expansions and contractions:
1> Expansion Ambiguity
A: Tell me something about CO2.
B: Oh...it's the carbon thing again!
Here, A and B are having a conversation and the "carbon thing" is most certainly a colloquial expression. This, when interpreted (we can say "expanded semantically") with contextual help, boils down to CO2, our familiar word in chemistry! But again, "carbon thing" may actually imply something totally different. For example, if B is bored with A's rantings, he'll probably say the same thing! Here, the ambiguity might be resolved the following way. If it's "the carbon thing", then we can reasonably guess that B is bored, and he's refering to anything and everything related to carbon. On the other hand, if it were "that carbon thing", we could have been sure that B was actually talking about CO2.
There are other examples:
2> Placeholders and Figuratives
A: D'you think we are going to estimate the solubility of this mixture.
B: Well, I think we do.
Here also, what B says is basically "We are going to estimate the solubility of this mixture.", and "well" is merely a placeholder, or we can say, a figurative. So in expanding B's words, we have to discard "well", and then we have to interpret the rest of the sentence in conjunction with whatever A had uttered previously.
So there are 2 tasks:
1> Identify the placeholders and figuratives, and remove all of them.
2> Interpret the rest with contextual help. The context may either precede the text, or succeed it. [In the above two cases, context preceded text]
One example where text precedes context:
3> Context succeeds text
A: Hmmm, great job!
B: The isolation part?
B's question is in colloquial form. When expanded, it'll be "Are you talking about the isolation part?" A's remark "great job" (colloq) tentatively (because we know nothing about the great job, and B's answer to the remark is merely another question!) refers to the isolation part, and "hmmm" is nothing but a placeholder. But that may not be true everywhere. For example,
4> Placeholders, or something else?
A: CO2 reacts with water to yield H2CO3. Take it for granted.
B: Hmmm.
Here, "hmmm" is more than a placeholder. It actually means that B is reserving his judgement about A's remark. So, when expanded, it'll be something like "Yes, I take it for granted that CO2 reacts with water to yield H2CO3.", or "No, I do not take it for granted that CO2 reacts with water to yield H2CO3.", or even "It may be that CO2 reacts with water to yield H2CO3, of which I am not sure yet."
Like expansion, contraction is also an important and difficult issue. Not only should we be concerned about the context, but at the same time we should have at our disposal all possible colloquial forms that are appropriate. The most appropriate (and probably the pithiest) one will replace our formal, expansive jargon. For example,
5> A formal dialog
A: CO2 reacts with water to yield H2CO3. Take it for granted.
B: Yes, I take it for granted that CO2 reacts with water to yield H2CO3.
To convert this example to the previous one, would require us to know about all possible colloquial expressions that COULD HAVE replaced the second sentence, and then to select "hmmm" (it might be "Yup!" as well) as the fittest among them. Here, some information about B's mental state may be extremely valuable, for as we understand, colloquial forms vary considerably with changes in a person's mental state. As an example, the following could be a perfect contraction:
6> Colloquial expressions vary with mental state
A: CO2 reacts with water to yield H2CO3. Take it for granted.
B: Bingo!
Definitely, there are fine distinctions between these two cases. When B says "hmmm", we understand that he is very attentive and heeding. On the other hand, a "bingo!" tells us that B had been thoroughly involved, and overjoyed with this new finding. Important thing is that one "bingo!" from B tells us that he had been overjoyed and this finding was NEW to him!
Finally, it must be admitted that colloquial forms are not very easy to obtain in general (digitized) literature. They are obtained either from chats or blogs (like this one... [:-D]), or from memos chemists use to exchange ideas and data. These sources will form our corpora.
More about corpora in the next post.
Tuesday, August 12, 2008
1
This week: I'll gather information about informal communication, colloquialisms and their impact on processing Chemistry data. Also I'll explore the multilingual issues that seems to be pertinent in this regard.
Next week: This week first, then next week.
Issues: Forms of colloquialism, variations in colloquialism across several languages, methods of extraction and expansion, possible applications. A very important thing is CONTRACTION, i.e., generation of colloquial forms from expanded (and may be somewhat formal and rigmarolic) ideas and discussions. So basically we need to explore BOTH directions - how to expand and then how to contract (appropriately).
Achievements:
http://en.wikipedia.org/wiki/Wiki
http://www.cs.umass.edu/~culotta/pubs/culotta04dependency.pdf
studying on kernels, dependency trees, etc. Need to know about methods...
Next week: This week first, then next week.
Issues: Forms of colloquialism, variations in colloquialism across several languages, methods of extraction and expansion, possible applications. A very important thing is CONTRACTION, i.e., generation of colloquial forms from expanded (and may be somewhat formal and rigmarolic) ideas and discussions. So basically we need to explore BOTH directions - how to expand and then how to contract (appropriately).
Achievements:
http://en.wikipedia.org/wiki/Wiki
http://www.cs.umass.edu/~culotta/pubs/culotta04dependency.pdf
studying on kernels, dependency trees, etc. Need to know about methods...
Subscribe to:
Comments (Atom)