ResearchAndYou
FOLLOW US ON facebook twitter
  
E-mail ID: 
Password:   
 
Home      Resources      Computational Linguistics
Computational Linguistics
 

 

Computational Linguistics

What, Why & How?



Computational Linguistics (CL) is the study of language using computers and language-using computers. The word "language" here refers to human languages or natural languages as opposed to artificial or programming languages. CL was born in 1950s, when computer scientists of those days attempted to automatically translate text between languages. Today, it is one of the major interdisciplinary subfields of computer science with extreme importance to the industry and mankind. 

One of the primary goals of CL is to design computers that are able to comprehend and converse in human languages in a manner similar to the native speakers. If this goal is realized, then we will be able to build machines that will pass the Turing Test. Therefore, this must be an extremely hard task. Indeed, writing programs that understand or converse in natural languages is very challenging due to several reasons. First, natural languages are inherently ambiguous. When one says "I shot an elephant in my Pajamas", was it the elephant who was wearing the pajama, or the person who shot it; or is it the case that the elephant was sitting on the speakers pajama when it was shot? Ambiguities also arise due to multiple meanings associated with the same word. For instance, the sentence "I made her duck" can be interpreted in seven different ways depending on the meanings of "made" (cooked or repaired or converted) and duck (water fowl or to lower the head). Apart from linguistic rules, resolving such ambiguities require world knowledge, common sense and knowledge of the context. This is the second major reason that makes CL hard. Representing and gathering world knowledge and common sense in a computer are challenging problems, let alone reasoning with them.

On the other hand, if computers could possess certain basic language skills they could make our lives much simpler and better. For instance, we could issue commands or program machines just by speaking out our intentions in our native tongues. If there were machines that could translate a document written in English to Hindi, then we could have translated all the English content on the web to Hindi so that a Hindi speaker can have access to potentially all the information present on the web. This is especially useful because the amount of information present in English on the web is several orders of magnitude higher than any other human language, whereas only a small fraction of the world's population can read and understand English. Machine translation systems can, thus, potentially bridge this access-to-information gap between English and non-English speakers. Even simple tools such as spellcheckers and grammar-checkers, summarization systems and text readers are of extreme help to us, and some of these technologies have already been there for quite some time, whereas the others are still in making.

Natural Language Processing (NLP), which is studied under Artificial Intelligence, is closely related to CL. However, while the aim of NLP is to build systems that can process (analyze or generate) human languages, CL has a broader scope covering all aspects of human languages studied with the help of computers or computational models. In other words, NLP is more about applications, engineering and system building, while CL includes theoretical and cross-disciplinary studies as well. Nonetheless, these terms are often used interchangeably.

Rule-based Vs. Statistical NLP

In earlier days, NLP applications were primarily rule-based. For instance, to build a machine translation system linguists had to write rules for translation of different sentence constructions. This approach, though sound in principle, requires thousands of rules to be designed meticulously to ensure coverage of every possible forms of the language. Rule-based systems are also not portable across languages and suffer the lack of scalability. 

In 1990s, a new approach to NLP was introduced, which is based on machine learning techniques. This approach is also known as data-driven or statistical NLP, and is by far the most popular one till date. According to this approach, the researcher designs generic algorithms for solving certain NLP tasks. The algorithms take as input a lot of linguistic data and come up with systems that can solve the task. The accuracy of the system typically depends on the quality or quantity of data. For instance, to come up with a machine translation (MT) system between languages A and B, the researcher might design a statistical MT system that takes as input millions of sentence pairs in languages A and B, which are translations of each other and then automatically comes up with a system that can translate between A and B. These systems are scalable and portable across languages. Thus, to develop an MT system between C and D, it is sufficient to rerun the generic algorithm on a new set of sentence pairs in C and D that are translations of each other.

Statistical NLP systems are based on the principles of machine learning. Depending on the type of input data requirement, machine learning algorithms can be broadly classified as supervised, semi-supervised and unsupervised. A supervised algorithm learns from training examples (e.g., sentences), and their proper labeling (e.g., the translations), whereas an unsupervised algorithm learns from training examples without labels. Semi-supervised algorithms use both kinds of data. 

Development of high quality statistical NLP requires not only smart machine learning algorithms but also huge amount of data. Since labeled examples are harder and costlier to procure, unsupervised and semi-supervised techniques are becoming more popular these days. Another line of research looks at leveraging crowd-sourcing and online games to procure a large amount of labeled examples at a much lower cost and time.


Problems in CL

Today CL has grown as an independent area and there are various sub-fields and problems that are of interest to people. Broadly, these problems can be classified either based on the end-user application that they serve or the linguistic problem they solve. The latter division came up because in order to solve a bigger problem, often it helps to subdivide it into smaller ones. For instance to translate a sentence, it might be useful to first look at individual words, their suffixes and prefixes to understand their meaning and function, then at phrases and finally at the structure of the whole sentence.

Here is a partial list of the applications and problems that are of interest in recent times.
  • End-user Application driven classification
    • Machine translation
    • Transliteration
    • Question Answering
    • Summarization
    • Opinion Mining and Sentiment classification
    • Speech Recognition and Generation
  • Task driven classification
    • Computational Phonology (how to pronounce a word properly)
    • Computational Morphology (understanding the parts of a word)
    • Word Segmentation
    • Parts-of-speech tagging
    • Entity recognition
    • Chunking
    • Parsing
    • Language modeling
    • Language generation
    • Word sense disambiguation
    • Semantic role labeling
    • Paraphrasing and entailment
    • Anaphora resolution
    • Creation of lexical resources
    • Corpus annotation
    • Resource Creation - Standardization, data collection through crowd-sourcing and games
Some other interesting applications of CL includes
  • Language teaching
  • Detection and correction of errors in language usage of general users, but also second language learners
  • NLP for rehabilitation engineering
  • NLP for mining biological texts
  • Reconstruction of ancient languages and characterization of relatedness between language pairs
  • Computational models of language change and evolution
  • Computational human language learning and psycholinguistics
  • Analysis of noisy text such as SMS, chat or email data models

Bibliography and Resources

Text Books:
  • Speech and Language Processing by D. Jurafsky & J. H. Martin, Pearson Education Asia, 2002 (2000)
  • Foundations of Statistical Natural Language Processing by C. D. Manning & H. Schutze, MIT Press, 1999
  • Natural Language Understanding by J. Allen, Pearson Education, 2003 (1995)

Useful Websites:
Journals and Conferences

You can look out for recent developments or publish your latest results in the following journals:
  • Computational Linguistics, MIT Press
  • ACM Trans. on Speech and Language Processing
  • ACM Trans. on Asian Language Information Processing
  • Journal of Language Resources and Evaluation
  • Computer Speech and Language, Elsevier
  • Journal of Machine Translation

The Association for Computational Linguistics (ACL) and its special interest groups (SIGs) organize several conferences and area-specific workshops which are the best places to go and watch the latest developments in the field. Some of the premier international conferences in CL are:
  • Annual Meeting of the ACL (referred to as ACL)
  • Coling
  • North American ACL (NAACL HLT), European ACL (EACL), IJCNLP

Most of these conferences also have special tracks for presenting student papers. All the ACL conference and workshop papers are available online from ACL Anthology. 
In India, International Conference on NLP or ICON is an annual event that attracts papers from the country as well as abroad. 

Related Areas

CL is closely related to Speech Processing, Information Retrieval and Machine Learning. Speech Processing deals with analysis, recognition and synthesis of human speech, and applications of such systems. Information retrieval deals with searching for the right information from a huge set of documents, such as the Web. Machine learning techniques are heavily 
Click here to discuss this article.
 
About Us Guidelines Disclaimer Terms of Use Privacy Policy © ResearchAndYou.com
Designed and Developed by plaNETsurf Creations Pvt. Ltd.