Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation.
Terminology
Corpous
Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. The plural form of corpus is corpora. Some popular corpora are British National Corpus (BNC), COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus. Monolingual corpora represent only one language while bilingual corpora represent two languages.
A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples. Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.
Tokens
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.
What is a token? There is a technical definition in NLP, but we can think about them as data that represent meaningful units of text:
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.
A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stems", "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".
Stop Words
Sometimes, some extremely common words which would appear to be of little value in getting usuful information about documents are excluded from the vocabulary entirely. These words are called stop words. Though stop words usually refer to the most common words in a language, there is no single universal list of stop words used by all natural
language processing tools.
NLTK
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike.
Terminology
Corpous
Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. The plural form of corpus is corpora. Some popular corpora are British National Corpus (BNC), COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus. Monolingual corpora represent only one language while bilingual corpora represent two languages.
A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples. Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.
Tokens
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.
What is a token? There is a technical definition in NLP, but we can think about them as data that represent meaningful units of text:
- Words
- Phrases
- Punctuation
- Numbers
- bi-grams
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.
A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stems", "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".
Stop Words
Sometimes, some extremely common words which would appear to be of little value in getting usuful information about documents are excluded from the vocabulary entirely. These words are called stop words. Though stop words usually refer to the most common words in a language, there is no single universal list of stop words used by all natural
language processing tools.
NLTK
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike.
Language processing task
|
NLTK modules
|
Functionality
|
---|---|---|
Accessing corpora
|
nltk.corpus
|
Standardized interfaces to corpora and lexicons
|
String processing
|
nltk.tokenize, nltk.stem
|
Tokenizers, sentence tokenizers, stemmers
|
Collocation discovery
|
nltk.collocations
|
t-test, chi-squared, point-wise mutual information
|
Part-of-speech tagging
|
nltk.tag
|
n-gram, backoff, Brill, HMM, TnT
|
Classification
|
nltk.classify, nltk.cluster
|
Decision tree, maximum entropy, naive Bayes, EM, k-means
|
Chunking
|
nltk.chunk
|
Regular expression, n-gram, named entity
|
Parsing
|
nltk.parse
|
Chart, feature-based, unification, probabilistic, dependency
|
Semantic interpretation
|
nltk.sem, nltk.inference
|
Lambda calculus, first-order logic, model checking
|
Evaluation metrics
|
nltk.metrics
|
Precision, recall, agreement coefficients
|
Probability and estimation
|
nltk.probability
|
Frequency distributions, smoothed probability distributions
|
Applications
|
nltk.app, nltk.chat
|
Graphical concordancer, parsers, WordNet browser, chatbots
|
Linguistic fieldwork
|
nltk.toolbox
|
Manipulate data in SIL Toolbox format
|
NLTK was designed with four primary goals in mind:
Simplicity To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data
Consistency To provide a uniform framework with consistent interfaces and data structures, and easily guessable method names
Extensibility To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task
Modularity To provide components that can be used independently without needing to understand the rest of the toolkit
Lets see how to work with NLTK libraries. NLTK is available with the default installation of Anaconda. We import the nltk package and if required we can download the required libraries, here we read the text file to a variable.
Let us try to explorer the tokenizer, stop words and Stem features
Lets see how to work with NLTK libraries. NLTK is available with the default installation of Anaconda. We import the nltk package and if required we can download the required libraries, here we read the text file to a variable.
Let us try to explorer the tokenizer, stop words and Stem features
No comments:
Post a Comment
Note: only a member of this blog may post a comment.