Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
- Tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
- Counting the occurrences of tokens in each document.
- Normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
- Bag of words
- TF-IDF
- Binary representation
The Bag of Words representation:
In this scheme, features and samples are defined as follows:
- Each individual token occurrence frequency (normalized or not) is treated as a feature.
- The vector of all the token frequencies for a given document is considered a multivariate sample.
This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
CountVectorizer
implements both tokenization and occurrence counting in a single class.
In general (not only with bag of words) there are 2 steps to produce a vector representation for each document:
- Learn the vocubulary in the corpus: this done using the
fit
method. - Use that vocabulary to produce the vector representation for each document: this is done using the
transform
method.
Scikit learn provides the
fit_transform
method to perfom the 2 steps at the same time.
Sparcity As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually. In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the
scipy.sparse
package.
The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer. Now the vectorizer object has been "trained" with the vocabulary from the corpus. Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method
Note both document 1 and 4 have the same features representation but we lose the information that the last document is an interrogative form.
Tf–idf term weighting
In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms. In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
tf–idf means term-frequency times inverse document-frequency.
td-idf is intended to reflect how important a word is to a document in a collection or corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
Binary representation
While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. This can be achieved by using the binary parameter of CountVectorizer.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.