Comprehensive Guide on Text Vectorization
Start your free 7-days trial now!
Colab Notebook
You can run all the code snippets throughout this guide in my Colab Notebook
What is text vectorization?
Text vectorization involves transforming non-numeric data such as text into numerical vectors. There are two reasons for doing so:
machine learning models require numerical input and therefore we need to transform non-numeric data (e.g. text and categories) into vectors first.
we can visualize words (refer to the section on Visualizing words in our Comprehensive Guide on PCA).
There are many ways to vectorize text:
One-hot encoding
Dummy encoding
Bag-of-words
TF-IDF (Terms Frequency - Inverse Document Frequency)
Word embedding
In this guide, we will go through the theory as well as the Python implementation of these techniques. As always, feel free to hop onto our Discord to ask questions or leave feedback - much appreciated!
One-hot encoding
One-hot encoding is perhaps the easiest way to vectorize text. We begin with a quick example.
Consider the following example:
You say goodbye and I say hello
There are 6 unique words in our tiny corpus. We assign an unique incremental ID to each token:
ID | Token |
---|---|
1 | You |
2 | say |
3 | goodbye |
4 | and |
5 | I |
6 | hello |
Here, the vocabulary size is 6, that is, there are 6 unique terms in our corpus. This allows us to form one-hot vector representation of each word that appears in the corpus. For instance, the one-hot vector for the word You
is:
The one-hot vector for the word goodbye
is:
The approach of one-hot vectorisation is naive in the sense that the semantics of the input is completely discarded, that is, the "meaning" of the tokens are not encoded. This is because, mathematically, each unique token is merely an orthogonal representation in another dimension.
The application of one-hot vector is not unique to text - categorical data (e.g. economy class, business class, first class) are often converted to one-hot vectors as well. Moreover, since one-hot vector creates a new dimension for every unique token, the dimensions of the one-hot vector will become extremely large for a large corpus - thereby resulting in the curse of dimensionality.
Implementing one-hot vectors using Python
We can easily perform one-hot encoding using the method get_dummies(~)
in Python's Pandas library. Please refer to our in-depth documentation about this method here.
As an example, consider the data:
import pandas as pd
'name':['alex','bob','cathy','doge'], 'nationality':['korean','canadian','french','canadian']})df
name nationality0 alex korea1 bob canadian2 cathy french3 doge canadian
To encode the nationality
feature as one-hot vectors:
name nationality_canadian nationality_french nationality_korean0 alex 0 0 11 bob 1 0 02 cathy 0 1 03 doge 1 0 0
Here, notice how we now have 3 new columns - one for each category.
Dummy encoding
Dummy encoding is just like one-hot encoding except we use one less column to perform the encoding.
Consider the same dataset again:
import pandas as pd
'name':['alex','bob','cathy','doge'], 'nationality':['korean','canadian','french','canadian']})df
name nationality0 alex korean1 bob canadian2 cathy french3 doge canadian
To perform dummy encoding, use Pandas' get_dummies(~)
again but with the argument drop_first=True
:
name nationality_french nationality_korea0 alex 0 11 bob 0 02 cathy 1 03 doge 0 0
Notice how we only have 2 columns instead of 3 as in the one-hot encoding case. Bob's nationality is filled with 0
, which means that he is neither French nor Korean - he is Canadian by default. We say that Canadian is a reference category. The key here is that even with with one less column, we are not losing any information at all!
The advantage of dummy encoding is that we can reduce the number of features by one without information loss!
Bag-of-words
Bag-of-words, or BoW, is another numerical representation of textual data that captures the number of times a word occurs within a document.
Just like for one-hot vector, an unique token is represented by the position within the vector. Consider the following corpus once more:
You say goodbye and I say hellohello world
Once again, here is the table of unique words and their ID:
ID | Token |
---|---|
0 | You |
1 | say |
2 | goodbye |
3 | and |
4 | I |
5 | hello |
6 | world |
Since the vocabulary size is 7, each word will be represented by a vector of size 7.
The bag-of-words representation for each word will be as follows:
You say goodbye and I say hello: [1, 2, 1, 1, 1, 1, 0]hello world: [0, 0, 0, 0, 0, 1, 1]
Here, we have 2 for the second element of the first data item because the word say
occurs twice in this data item.
The term "bag" in "bag-of-words" means that the order of words are discarded. For instance, consider the following two text:
hello worldworld hello
The two vector representation would be equivalent, that is, the information about the ordering of the words is not encoded:
hello world: [0, 0, 0, 0, 0, 1, 1]world hello: [0, 0, 0, 0, 0, 1, 1]
Implementing bag-of-words using Python's Scikit-learn
In Python, we can obtain the bag-of-words representation by using the library CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
text = ["You say goodbye and I say hello", "hello world"]vectorizer = CountVectorizer()vectorizer.fit(text)vector = vectorizer.transform(text)
The vocabulary is represented by a simple map where the key is the token, and the value is the index of the token in the vector:
{'you': 5, 'say': 3, 'goodbye': 1, 'and': 0, 'hello': 2, 'world': 4}
Since we have two data items, and each data item is represented using a data size of 7, we have:
(2, 6)
The reason why the data size is 6 instead of 7 is that CountVectorizer
only treats words that contain at least 2 alphanumerics as a token. Therefore, the word "I"
is not counted as a token.
We can view the actual bag-of-words representation like so:
[[1 1 1 2 0 1] [0 0 1 0 1 0]]
Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF, or terms frequency-inverse document frequency, is a statistical measure of how relevant a word is with respect to a document in a collection of documents. As the name suggests, the TF-IDF of a word is comprised of two components:
Term frequency (TF) - the number of times the word occurs in the document
Inverse Document Frequency (IDF) - a weight to make rare words more prominent and ignore commonly occurring terms (e.g. stop words such as
"a"
and"is"
).
Mathematically, TF-IDF is the product of TF and IDF:
Where $\mathrm{tf}(t,d)$ computes the term frequency:
And $\mathrm{idf}(t,D)$ computes the inverse document frequency:
Let's now go through a simple example together to intuitively understand TF-IDF at a deeper level.
Simple example of computing TF-IDF
Consider the following collection of documents $D$:
1: "I say hello and you say goodbye"2: "hello world hello"3: "hello goodbye"
I'll show the calculation of TF, IDF and TF-IDF using the table below:
TF | TF | TF | IDF | TF-IDF | TF-IDF | TF-IDF | |
---|---|---|---|---|---|---|---|
Term | Doc 1 | Doc 2 | Doc 3 | Doc 1 | Doc 2 | Doc 3 | |
I | 1/7 | 0 | 0 | log(3/1)=0.477 | 0.068 | 0 | 0 |
say | 2/7 | 0 | 0 | log(3/1)=0.477 | 0.136 | 0 | 0 |
hello | 1/7 | 2/3 | 1/2 | log(3/3)=0 | 0 | 0 | 0 |
and | 1/7 | 0 | 0 | log(3/1)=0.477 | 0.068 | 0 | 0 |
you | 1/7 | 0 | 0 | log(3/1)=0.477 | 0.068 | 0 | 0 |
goodbye | 1/7 | 0 | 1/2 | log(3/2)=0.176 | 0.025 | 0 | 0.088 |
world | 0 | 1/3 | 0 | log(3/1)=0.477 | 0 | 0.159 | 0 |
Let's understand how the numbers in the table are computed, and in particular how we can interpret them. To compute the term frequency (TF) of the term say
for document one:
From the formula of $\mathrm{tf}$, we can see that a high TF value of a term is obtained if:
the term occurs more frequently in the document
the number of terms in the document is small
For example, suppose an article contains the word cars
many times throughout the article. If the length of the article is short (high TF value), then it is reasonable to assume that the article is about cars. However, if the length of the article is long (low TF value), then perhaps the article simply mentioned the term cars
but the article's topic or theme is actually about something else entirely. A high TF value of a term therefore suggests that the theme of the document revolves around this term. Obviously, words that do not appear at all in the document (e.g. the term world
in document one) are definitely not related to the theme of the document, and hence receive a TF value of 0.
Now, let's move on to IDF. To compute the IDF of the term say
:
From the formula, we can see that a term will receive a high IDF value if:
the number of documents is large
the number of documents containing the term is low
IDF is extremely useful in penalising stop-words (e.g. a
, is
). For example, suppose we had a collection of articles. You would expect the term frequency of commonly occurring words such as a
and is
to be extremely high. Even if an article is about cars, the term frequency of these stop-words will most likely be much higher than keywords such as cars
. Does this mean that a
is more important and relevant than cars
in guessing what the article is about? Obviously not - so here is where IDF comes into play.
You would expect stop-words such as a
and is
to appear in just about any article regardless of its theme. This means that the denominator of IDF (i.e. the number of documents containing the term a
and is
in our list of articles) is extremely high. In fact, you would expect these stop-words to appear in every single article, which essentially means that:
Since TF-IDF is a product of TF and IDF, the TF-IDF would also end up being 0 for the stop-words! However, if an article is exclusively about cars, then the term car would most likely only appear in this article and not other articles. In such cases, the number of documents (articles) containing the term car
is low, and therefore the IDF value of car
would be relatively high.
The key take-away from this simple example is that a term's TF-IDF in a document measures how relevant or important that term is in capturing the theme of the document.
Finally, with the TF-IDF computed for each term of each document, we can now convert each term into a vector. For instance, the term goodbye
can be represented by the following 3-dimensional vector:
Implementing TF-IDF using Python's Scikit-learn
Scikit-learn's implementation of TF-IDF makes two slight modifications to the official formula (sourceopen_in_new). Firstly, IDF is calculated like so:
Here, you can see that we're just adding 1 to the numerator and the denominator. This is a common technique to avoid the computational problem of dividing by 0. We also need to add one to the numerator to balance the effect of adding 1 to the denominator.
The second modification made is as follows:
They are adding one to IDF such that a zero value of IDF will not suppress TF-IDF completely.
The good news is that, despite these minor adjustments, the interpretation of TF-IDF remains exactly the same.
Consider the following corpus (collection of documents $D$):
documents = [ 'You say hello and I say goodbye', 'I think you are right', 'I love cars but you dont right']
We can vectorize each term via TF-IDF with the TfidfVectorizer
module in sklearn
:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()tf_idf = vectorizer.fit_transform(documents)tf_idf
<3x12 sparse matrix of type '<class 'numpy.float64'>' with 15 stored elements in Compressed Sparse Row format>
The tf_idf
here is a sparse NumPy array. This is because tf_idf
usually contains many zeros, so using a sparse matrix is memory-efficient. To pretty-print our results, we can convert this NumPy matrix into a Pandas DataFrame like so:
and are but cars dont goodbye hello love right say think you0 0.37 0.00 0.00 0.00 0.00 0.37 0.37 0.00 0.00 0.74 0.00 0.221 0.00 0.58 0.00 0.00 0.00 0.00 0.00 0.00 0.44 0.00 0.58 0.352 0.00 0.00 0.45 0.45 0.45 0.00 0.00 0.45 0.34 0.00 0.00 0.27
Here, notice the following:
some terms such as
I
is missing. This is because, by default,TfidfVectorizer
ignores characters of length one (e.g.a
).all the tokens have been lower-cased by default.
The TfidfVectorizer
has over 10 parameters that you can tinker with - please check out the official docs hereopen_in_new.
Sklearn also has a related module called TfidfTransformer
(as opposed to TfidfVectorizer
), but the transformer version requires you to feed in the fitted output of CountVectorizer
explained above. In almost all cases, you can directly use TfidfVectorizer
.
Word embedding
The problem with one-hot vectorization is that the semantics of the token are not encoded in its vector representation. For instance, the words "hello" and "hi" share a similar meaning, but one-hot encoding does not take this into account and ignores meaning of the words.
In contrast, the approach of word embedding vectorises a token in which tokens that have a similar meaning will be closer to each other. For instance, the tokens "mobile" and "smartphone" would share a similar vector representation. This means that word embedding encodes the meaning of the token in the vector.
Implementing word embedding using Python's Genism library
In order to obtain word embeddings, we must first train a neural network using a large corpus such as Wikipedia. Fortunately, Python's Genism library provides a number of word embeddings that you can use directly without any training. For the list of available word embeddings, please visit hereopen_in_new.
For our demonstration, we will be using the word embedding obtained from neural network trained on Wikipedia (as of 2014) and Gigaword, which is the world's largest corpus of English news documents. The word embedding contains the vector representation of over 6 billion tokens.
Firstly, let's load our word-embedding
import gensim.downloader as api
model = api.load('glove-wiki-gigaword-50')model
Here, note the following:
even though the size of the corpus on which the neural network was trained is huge, the word embedding itself is only 64MB
running this code for the first time will download the word embedding onto your local machine. Running this code again from the second time onwards would use the downloaded word embedding instead.
the
-50
means that each word is represented by a 50-length vector. Genism offers-100
,-200
and-300
.
We can obtain the vector representation of a word like so:
model['hello']
array([-0.38497 , 0.80092 , 0.064106, -0.28355 , -0.026759, -0.34532 , -0.64253 , -0.11729 , -0.33257 , 0.55243 , -0.087813, 0.9035 , 0.47102 , 0.56657 , 0.6985 , -0.35229 , -0.86542 , 0.90573 , 0.03576 , -0.071705, -0.12327 , 0.54923 , 0.47005 , 0.35572 , 1.2611 , -0.67581 , -0.94983 , 0.68666 , 0.3871 , -1.3492 , 0.63512 , 0.46416 , -0.48814 , 0.83827 , -0.9246 , -0.33722 , 0.53741 , -1.0616 , -0.081403, -0.67111 , 0.30923 , -0.3923 , -0.55002 , -0.68827 , 0.58049 , -0.11626 , 0.013139, -0.57654 , 0.048833, 0.67204 ], dtype=float32)
Note the following:
the word
'hello'
is represented by a vector of size 50.this numerical vector captures the semantics of the word
'hello'
With these word embeddings, we can perform some interesting NLP tasks. For instance, to find the top 5 similar words to the word 'hello':
model.most_similar('hello', topn=5)
[('goodbye', 0.8537959456443787), ('hey', 0.8074296116828918), ('!', 0.7951388359069824), ('kiss', 0.7892292737960815), ('wow', 0.7641353011131287)]
Note the following:
the numbers represent the similarity score (a number from 0 to 1)
'goodbye'
is the antonym of'hello'
, but is still considered to be the most similar. This is because similarity in terms of Word2Vec is measured by the context in which the word appears. The reason'goodbye'
is similar to'hello'
is that the words surrounding'goodbye'
and'hello'
are alike.
To obtain the similarity score of two words:
model.similarity(w1='hello', w2='hey')
0.8074296
If two words rarely appear in the same context, they would have a low similarity score:
model.similarity(w1='hello', w2='car')
0.22838566
To obtain the word that is least similar to the other words in a list:
model.doesnt_match(['hi','hello','car','goodbye'])
'car'
Word embeddings capture the semantics of words such that the vector representation of similar words (e.g. 'hi'
and 'hello'
) are close to each other, while unrelated words (e.g. 'hi'
and 'car'
) are far away from each other. For a visual demonstration, please refer to the section on Visualizing words in our Comprehensive Guide on PCA.
Arithmetics with word embedding
Consider the following classic example:
(king - man) + woman = queen
We can intuitively understand that the above arithmetics make sense, and what's remarkable is the fact that the above holds true when performing the arithmetics on word embeddings!
To demonstrate this, we will be using the word embedding trained using the Google News dataset. The embedding can be downloaded from this official Google Driveopen_in_new. The file that contains the embeddings is 1.5GB in size, so you might just want to follow along here without trying on your own machine.
After downloading this data, we can initialize our model with Genism
again like so:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
(300,)
Here, we can see that each word is represented by a vector of size 300.
To perform the classic (king - man) + woman
:
model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.7118192911148071), ('monarch', 0.6189674735069275), ('princess', 0.5902431011199951), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321243286133), ('kings', 0.5236844420433044), ('Queen_Consort', 0.5235945582389832), ('queens', 0.5181134343147278), ('sultan', 0.5098593235015869), ('monarchy', 0.5087411403656006)]
We can see that the best result for the arithmetics is indeed queen! This result is astonishing because the vector representation of the words look completely random to us humans, but they somehow manage to capture the essence and the semantics of the words!
Connection with one-hot vector
Word embedding as a matrix
Word embeddings are typically structured in the form of a matrix like so:
Here, note the following:
each row is a vector representation of a word
in this case, each word is represented by a 50-dimensional vector - just like in our above example
the number of rows is equal to the number of tokens. In our Genism example, we had 6 billion tokens so the number of rows would be 6 billion. I've only shown 1000 here for brevity.
Product of one-hot vector and word embedding
Consider the following text once again:
You say goodbye and I say hello
Recall the one-hot vector representation involves assigning an incremental unique ID to each unique token:
ID | Token |
---|---|
1 | You |
2 | say |
3 | goodbye |
4 | and |
5 | I |
6 | hello |
For instance, the one-hot vector for the word say
is:
The word embedding matrix for this example might look like the following:
Here, we are assuming that each word is represented by a vector in $\mathbb{R}^3$. The first row contains the word embedding for the first word 'You', the second row contains the word embedding for the second word 'say', and so on.
The product between the one-hot vector and the word embedding matrix results in simply knocking out one row of the weight matrix. For instance, consider the product between the one-hot vector of 'say' (the second word) and the word embedding matrix:
You may be wondering how we can obtain these word embeddings in the first place. They are actually obtained by training a neural network with one hidden layer with the objective of either:
using surrounding words - the context - to predict the target word (CBOW approach)
using the target word to predict the context (Skip-Gram approach)
I will write another comprehensive guide on these approaches, and how exactly we can obtain these word embeddings that somehow magically capture the semantics of the words. To be notified when I publish this guide, please either register an accountopen_in_new or join our Discord.