Natural Language Processing with Python | NLTK


Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation.

Terminology
Corpous
Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. The plural form of corpus is corpora. Some popular corpora are British National Corpus (BNC), COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus. Monolingual corpora represent only one language while bilingual corpora represent two languages.
A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples. Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.
Tokens
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.
What is a token? There is a technical definition in NLP, but we can think about them as data that represent meaningful units of text:
  • Words
  • Phrases
  • Punctuation
  • Numbers
  • bi-grams
Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.
A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stems", "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".

Stop Words
Sometimes, some extremely common words which would appear to be of little value in getting usuful information about documents are excluded from the vocabulary entirely. These words are called stop words. Though stop words usually refer to the most common words in a language, there is no single universal list of stop words used by all natural 
language processing tools.
NLTK
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike.


Language processing task
NLTK modules
Functionality
Accessing corpora
nltk.corpus
Standardized interfaces to corpora and lexicons
String processing
nltk.tokenize, nltk.stem
Tokenizers, sentence tokenizers, stemmers
Collocation discovery
nltk.collocations
t-test, chi-squared, point-wise mutual information
Part-of-speech tagging
nltk.tag
n-gram, backoff, Brill, HMM, TnT
Classification
nltk.classify, nltk.cluster
Decision tree, maximum entropy, naive Bayes, EM, k-means
Chunking
nltk.chunk
Regular expression, n-gram, named entity
Parsing
nltk.parse
Chart, feature-based, unification, probabilistic, dependency
Semantic interpretation
nltk.sem, nltk.inference
Lambda calculus, first-order logic, model checking
Evaluation metrics
nltk.metrics
Precision, recall, agreement coefficients
Probability and estimation
nltk.probability
Frequency distributions, smoothed probability distributions
Applications
nltk.app, nltk.chat
Graphical concordancer, parsers, WordNet browser, chatbots
Linguistic fieldwork
nltk.toolbox
Manipulate data in SIL Toolbox format

NLTK was designed with four primary goals in mind:
Simplicity To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data
Consistency To provide a uniform framework with consistent interfaces and data structures, and easily guessable method names
Extensibility To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task
Modularity To provide components that can be used independently without needing to understand the rest of the toolkit

Lets see how to work  with NLTK libraries. NLTK is available with the default installation of Anaconda. We import the nltk package and if required we can download the required libraries, here we read the text file to a variable.


Let us try to explorer the tokenizer, stop words and Stem features

No comments:

Post a Comment

Note: only a member of this blog may post a comment.