Introduction to text mining with word2vec and tf-idf along with an example to match YouTube videos based on input query.
In previous article, we have collected more than 400,000 rows of video data from YouTube API. We have also introduced how soft cosine similarity score works in matching between vectors (see here). In this article, we will first introduction some text mining technique including word2vec and tf-idf. At last, we will show an example on how to generate a matching model using the soft cosine similarity score to find the best matching videos from some string queries. Below shows an example on using the model.
= "ukraine war, conflict, clash, confrontation, russia"
q = match_videos(q)
result
result
# channel_id video_id title score
# 225877 UCfdNM3NAhaBOXCafH7krzrA 6Awx8MvT7qk What Will Ukraine Do After The ... 0.613843
# 253822 UCeY0bbntWzzVIaj2z3QigXg qtA-zamctYA One year of #Russia's ... 0.613843
# 266182 UCeY0bbntWzzVIaj2z3QigXg kV90vvVmAfY What Is The Background To Conflic ... 0.603664
# 225744 UCfdNM3NAhaBOXCafH7krzrA WXzdTxYrOSo What Russia Will Do After the War ... 0.600225
# 89191 UCef1-8eOpJgud7szVPlZQAQ HAb1pFRsh-4 Russia Vs Ukraine War Update LIVE ... 0.585112
# 225448 UCfdNM3NAhaBOXCafH7krzrA TXTS2R04Uu8 How Putin Almost Won th ... 0.576693
# 105578 UCef1-8eOpJgud7szVPlZQAQ Hb-MjE8DtQ4 Russia Ukraine War | Russia Attac ... 0.575848
# 97898 UCef1-8eOpJgud7szVPlZQAQ 1E7vTp7tezg Russia Ukraine War | Russian Parl ... 0.569067
# 95272 UCef1-8eOpJgud7szVPlZQAQ N5fJwSp7N6Y Russia Ukraine War | Wagner Will ... 0.557974
# 225341 UCfdNM3NAhaBOXCafH7krzrA GVLGPXuWoGs Russia's Invasion of Ukraine is a ... 0.556573
Before analysing textual data, a series of preprocessing stages are required to convert the text into a more structured format. These stages include sentence segmentation, word tokenisation, lowercasing, stop word removal, stemming or lemmatisation, etc.
Segmentation is a process to break up the text into corresponding sentences, while tokenisation is a process to further break up the sentences into words.
Stop words are words that do no have much meaning in a sentence, but are commonly used. For example, in English, “the”, “a”, “and”, etc. We will remove these words along with punctuations.
Stemming is a technique used to truncate a word into its word stem. Its algorithm is based on the commonly used prefixes and suffixes to cut off the beginning or the end of the word. Below shows an example for stemming.
from gensim.parsing.porter import PorterStemmer
= PorterStemmer()
p
= ["writter", "writting", "written", "wrote", "writtings"]
example_words for x in example_words]
[p.stem(x)
# ['writter', 'writ', 'written', 'wrote', 'writ']
Lemmatisation is another method to root a word. It does not simply truncate a word like stemming. It involves standardising all inflections of a given word into it’s base dictionary form. Below shows the same example as above but using lemmatisation.
from nltk.stem import WordNetLemmatizer
= WordNetLemmatizer()
wnl
= ["writter", "writting", "written", "wrote", "writtings"]
example_words ="v") for x in example_words]
[wnl.lemmatize(x, pos
# ['writter', 'writting', 'write', 'write', 'writtings']
Since computers use (binary) numbers for processing, it will be very desiring if we can convert words into numbers. Word embedding is an approach with which we can represent a word, and hence a document, in a real valued vector.
Word2Vec is a method to construct these embedding vectors. There are two similar models: Continuous Bag Of Words (CBOW) and Skip-gram. Both models origin from the same idea. We first assign a number for the window size (\(\omega\)) and our goal is to learn the meaning of a word from the previous \(\omega\) words and the coming \(\omega\) words.
Suppose we want to learn the embeddings of the following words:
a quantitative methodological approach to understanding meaning in observed language
We first assign a number, say 2, as the window size. Then for learning the word
methodological
, we will look at the 2 words before and after it. In this case,
they are [a, quantitative, approach, to]. We call the word methodological
target word and the 4 surrounding words context words.
The CBOW model aims to train a neural network with 1 hidden layer to take the 4 context words as input and predict the target word. Along the process, the hidden layer is expected to contains the embedded information of the target word. The goal of this model is to extract the hidden layer as the embedding vectors of words.
The skip-gram model uses similar approach to find the embedding vectors. Yet, instead of predicting the target word by the context words, the skip-gram model use the target word to maximise the optimisation to find the context words.
Let \(D\) (dictionary) be the set of all available vocabularies with cardinality \(|D| = d\), \(\mathbf{p}_t \in D\) be a word in the dictionary encoded using 1-of-\(D\) coding. Also let \(s\) be the number of dimensions for the word embeddings.
Suppose we want to find out the word embeddings for \(\mathbf{p}_t\). The CBOW model takes words surrounding \(\mathbf{p}_t\) with a window of size \(\omega\) and train a neural network with one hidden layer and an output layer. The output layer is the vector of \(\mathbf{p}_t\), meaning the neural network aims to predict the word \(\mathbf{p}_t\) from its surrounding words \(\mathbf{p}_{t - \omega}, \dots, \mathbf{p}_{t - 1}, \mathbf{p}_{t + 1}, \dots, \mathbf{p}_{t + \omega}\). Then the hidden layers must contain some meaningful information about the word \(\mathbf{p}_t\) in the form of vector. We take this hidden layer as the word embedding vector \(\mathbf{e}_t\) of \(\mathbf{p}_t\). Hence, the hidden layer is an \(s\)-dimensional vector.
Notice that the weight matrix \(W\) between the input layer is an \(s \times d\) matrix and
\[\mathbf{e}_t = \frac{1}{2\omega} \sum_{i = 1}^\omega W(\mathbf{p}_{t - i} + \mathbf{p}_{t _ i}),\]
which is the average value of the projected vectors from the surrounding words.
Finally, when the model is trained. the word embedding vector \(\mathbf{e}_j\) for \(\mathbf{p}_j\) is given by
\[\mathbf{e}_j = W\mathbf{p}_j,\]
or the \(j\)-th column of \(W\) since \(\mathbf{p}_j\)’s entries are all \(0\) except \(1\) at the \(j\)-th entry.
The skip-gram model is similar to the CBOW model, but instead of predicting the target word \(\mathbf{p}_t\), it takes the target word to predict the context words \(\mathbf{p}_{t - \omega}, \dots, \mathbf{p}_{t - 1}, \mathbf{p}_{t + 1}, \dots \mathbf{p}_{t + \omega}\). The first weight matrix \(W\) is an \(s \times d\) matrix transforming the target word \(\mathbf{p}_t\) to the embedding vector \(\mathbf{e}_j\). The second weight matrix \(W'\) is a \(d \times s\) matrix transforming the embedding vector to one of the context word \(\mathbf{p}_{t + i}\), where \(i \in {-\omega, \dots, -1, 1, \dots, \omega}\). All context words within the window share the same weight matrix \(W'\), meaning that the model is required to through \(2 \omega\) times for each target word.
When the training is over, the word embedding vector for each words are given by
\[\mathbf{e}_j = W\mathbf{p}_j,\]
or the \(j\)-th column of the weight matrix \(W\).
Term Frequency–Inverse Document Frequency (TF-IDF) is used to measure the importance of a token (or word) to the document it belongs relative to the importance of the same token to the whole collection of corpus.
In general, the value is the product of two terms:
That is,
\[\text{tfidf}(t, d, D) = \text{tf}(t, d) \cdot \text{idf}(t, D),\]
where \(t\) is a particular token (or word or term), \(d\) is a document and \(D\) is the collection of corpus.
There are different ways to calculate \(\text{tf}(t, d)\) and \(\text{idf}(t, D)\). Below shows some variants of \(\text{tf}(t, d)\) and \(\text{idf}(t, D)\).
\(\text{tf}\) variants | \(\text{idf}\) variants | |||||
---|---|---|---|---|---|---|
raw count | \(f_{t, d}\) | inverse document frequency | \(\log_b \dfrac{N}{n_t}\) | |||
term frequency | \(f_{t, d}\left/ \sum\limits_{t' \in d} f_{t', d} \right.\) | inverse document frequency smooth | \(\log_b \left( \dfrac{N}{1 + n_t} \right) + 1\) | |||
log normalisation | \(\log_b(1 + f_{t, d})\) | probabilistic inverse document frequency | \(\log_b \dfrac{N - n_t}{n_t}\) |
where \(f_{t, d}\) is the frequency of token \(t\) in document \(d\), \(b\) is the base number for logarithm, \(N = |D|\) is the total number of documents, and \(n_t = |\{d \in D | t \in d\}|\) is the number of documents contain token \(t\).
In previous article, we have discussed the soft cosine similarity. It modified the generic cosine similarity by putting a correlation matrix between the dot product to define a new inner product and hence a new metric for measuring the difference or similarity between two vectors. Such calculation no longer assume each dimension is completely independent to each other but having some correlation.
For any two \(n\)-dimensional vectors \(\mathbf{a}, \mathbf{b} \in \mathbb{R}^n\), the soft cosine similarity between these two vectors with respect to the correlation matrix \(S\) is given by
\[ \text{soft-cosine} = \frac{\mathbf{a}^T S\, \mathbf{b}}{\sqrt{\mathbf{a}^T S \, \mathbf{a}} \cdot \sqrt{\mathbf{b}^T S \, \mathbf{b}}}. \]
In this section, we will first preprocess the text from the title and description of the videos. Then we will train a word2vec word embeddings from this corpus. With these word embeddings, we can calculate the similarity matrix between each pair of words. After that, we will calculate the importance of each word in a document using the tfidf scores. With any two word importance vectors along with the words similarity matrix, we can calculate the soft cosine similarity between these two vectors.
In previous article, we have shown how to download data of YouTube videos. In this article, we will make use of the data collected from this method to match similar videos based on their titles.
We have downloaded more than 400,000 videos’ data from the API. We first have a look on the dataset.
import pandas as pd
= pd.read_csv("./data.csv")
df
df
# channel_id video_id title
# 0 UCSZbXT5TLLW_i-5W8FZpFsg DFZnYR-5vnk HIGHLIGHTS: Colorado Rapids vs. Seattle Sounde...
# 1 UC9k-yiEpRHMNVOnOi_aQK8w tilUNQJ5_WE True Story Behind Festival Featured in 'Midsom...
# 2 UCSHLoG-bXj1aVA2T5y8t84A _zZdcckm1vk Ankahee - Official Full Song - Lootera
# 3 UCvQECJukTDE2i6aCoMnS-Vg jn70KsBit3c Sylvester James Gates, Jr.: Scientific Literacy
# 4 UC8-Th83bH_thdKZDJCrn88g 7f_0UpRijm0 Queen Latifah's Dad Is One Tough Dude (Late Ni...
# ... ... ... ...
# 457036 UCQD3awTLw9i8Xzh85FKsuJA mCWQ90YdtLo Random 'Dead by Daylight' Bullshittery
# 457037 UCeY0bbntWzzVIaj2z3QigXg zCNZgK7OXPY Wisconsin Senate Debate: Johnson, Barnes Are A...
# 457038 UCvQECJukTDE2i6aCoMnS-Vg J-jhR84F_9o Mohammad's Message Dalia Mogahed | Big Think
# 457039 UC9k-yiEpRHMNVOnOi_aQK8w mj1Qs4z_AzA Farm Boy Has Trouble Closing Gates #shorts
# 457040 UCwiTOchWeKjrJZw7S1H__1g nojaKlA7MoU Johnnie's Iconic Italian Beef Is A Delicious M...
import re
from nltk.corpus import stopwords
from urllib.request import urlopen
from gensim.parsing.porter import PorterStemmer
= stopwords.words('english')
stop_words = "https://raw.githubusercontent.com/Alir3z4/stop-words/master/english.txt"
stopwords_url = urlopen(stopwords_url).read().decode("utf-8").split()
full_stopwords
= stop_words + full_stopwords
stop_words = list(set([x.replace("'", "") for x in stop_words] + stop_words))
stop_words
= PorterStemmer()
p def remove_stopwords(x):
= p.stem_sentence(x)
x return " ".join([w for w in x.lower().split() if w not in stop_words])
"title_processed"] = df["title"].apply(remove_stopwords)
df["title_processed"] = df["title_processed"]\
df[str.replace("[^\\w\\s]", "", regex=True)\
.str.replace('\\s+', ' ', regex=True)\
.apply(lambda x: x.split())
.
"description_processed"] = df["description"].apply(remove_stopwords)
df["description_processed"] = df["description_processed"]\
df[str.replace("[^\\w\\s]", "", regex=True)\
.str.replace('\\s+', ' ', regex=True)\
.apply(lambda x: x.split())
.
"title_processed", "description_processed"]]
df[[
# title_processed description_processed
# 0 [highlights, colorado, rapid, vs, seattl, soun... [colorado, rapid, continu, hot, streak, thirds...
# 1 [true, stori, festiv, featur, midsommar, film] [film, scari, twist, swedish, tradition, midso...
# 2 [ankahe, offici, song, lootera] [here, romant, track, ankahee, upcom, movi, lo...
# 3 [sylvest, jame, gates, jr, scientif, literaci] [watch, video, think, httpsbigthinknewvideo, e...
# 4 [queen, latifah, dad, tough, dude, late, night... [queen, latifah, jimmi, live, ha, hurt, winter...
# ... ... ...
# 457036 [random, dead, daylight, bullshitteri] [random, moment, bullshit, fuck, clan, stream,...
# 457037 [wisconsin, senat, debate, johnson, barn, oppo... [conclus, wisconsin, us, senat, debate, ron, j...
# 457038 [mohammad, messag, dalia, mogah] [mohammad, messag, video, daily, httpsbigthink...
# 457039 [farm, boi, ha, troubl, close, gate, short] [5yearold, boi, kentucki, troubl, close, gate,...
# 457040 [johnnie, icon, italian, beef, delici, mess, b... [ubiquit, italian, beef, sandwich, chicago, it...
from gensim.models import Word2Vec
= list(df["title_processed"])
documents = documents + list(df["description_processed"])
model_documents = pd.Series([" ".join(x) for x in model_documents])
model_documents = model_documents.drop_duplicates().apply(lambda x: x.split())
model_documents = list(model_documents)
model_documents
= Word2Vec(
model =model_documents,
sentences=200,
vector_size=5,
window=3,
min_count=4,
workers=3
epochs
)
# Or we can simply download some pretrained models
# import gensim.downloader as api
# model = api.load('word2vec-google-news-300')
"./youtube.model")
model.save(= Word2Vec.load("youtube.model") model
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
= Dictionary(documents)
dictionary "bows"] = [dictionary.doc2bow(x) for x in documents]
df[= TfidfModel(list(df["bows"])) tfidf
Correlation matrix from word2vec model ordered by tfidf index
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex
= WordEmbeddingSimilarityIndex(model.wv)
termsim_index = SparseTermSimilarityMatrix(termsim_index, dictionary, tfidf) termsim_matrix
def process_q(q: str):
= remove_stopwords(q)
processed_q = re.sub("[^\\w\\s]", "", processed_q)
processed_q = re.sub("\\s+", " ", processed_q).split()
processed_q return processed_q
def match_videos(q):
= process_q(q)
q_list = dictionary.doc2bow(q_list)
q_bow
= pd.Series([termsim_matrix.inner_product(
result
tfidf[q_bow],
tfidf[x], =(True, True)
normalizedfor x in df["bows"]])
)
= result.sort_values(ascending=False)[:50].copy()
result_top50 = result_top50.index
idx
= df.iloc[df.index[idx],:].copy()
output "score"] = result_top50
output[
return output[["channel_id", "video_id", "title", "score"]]
= "ukraine war, conflict, clash, confrontation, russia"
q = match_videos(q)
result
result
# channel_id video_id title score
# 225877 UCfdNM3NAhaBOXCafH7krzrA 6Awx8MvT7qk What Will Ukraine Do After The War With Russia 0.613843
# 253822 UCeY0bbntWzzVIaj2z3QigXg qtA-zamctYA One year of #Russia's war in #Ukraine 0.613843
# 266182 UCeY0bbntWzzVIaj2z3QigXg kV90vvVmAfY What Is The Background To Conflict Between Rus... 0.603664
# 225744 UCfdNM3NAhaBOXCafH7krzrA WXzdTxYrOSo What Russia Will Do After the War in Ukraine E... 0.600225
# 89191 UCef1-8eOpJgud7szVPlZQAQ HAb1pFRsh-4 Russia Vs Ukraine War Update LIVE | Is Russia ... 0.585112
# 225448 UCfdNM3NAhaBOXCafH7krzrA TXTS2R04Uu8 How Putin Almost Won the War in Ukraine 0.576693
# 105578 UCef1-8eOpJgud7szVPlZQAQ Hb-MjE8DtQ4 Russia Ukraine War | Russia Attacks Ukraine Ye... 0.575848
# 97898 UCef1-8eOpJgud7szVPlZQAQ 1E7vTp7tezg Russia Ukraine War | Russian Parliament Backs ... 0.569067
# 95272 UCef1-8eOpJgud7szVPlZQAQ N5fJwSp7N6Y Russia Ukraine War | Wagner Will Not Fight In ... 0.557974
# 225341 UCfdNM3NAhaBOXCafH7krzrA GVLGPXuWoGs Russia's Invasion of Ukraine is a Disaster for... 0.556573