Vectorizing Text

Machines only understand numerical numbers. In this section, we will explore how to convert text data into a numerical representation.![image]

Bog-of-Words (BOW)

Bag-of-words (BOW) converts text data into numbers. BOW does this conversion by creating a vocabulary from all the words in all the documents. It then calculates the occurrences.

  • BOW creates a vocabulary from the words in all documents.
  • BOW Calculates the occurrences of words:
    • Binary (present or not) - Indicates whether the word is present or not present.
    • Word counts – Counts how many times the word appears in the text.
    • Frequencies – Provides a count of the words, normalized across the document.
 acatdogisitmynotoldwolf
“It is a dog.”101110000
“my cat is old”010101010
“It is not a dog, it is a wolf.”101110101


One challenge of BOW is how to handle new words that were not part of the original vocabulary. This issue is known as out-of-vocabulary (OOV), and you will come back to it later in this section.


Term Frequency (TF)

Term frequency (TF): Increases the weight for common words in a document.
𝑡𝑓(𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐)= (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐)/(𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐)

 acatdogisitmynotoldwolf
“It is a dog.”0.2500.250.250.250000
“my cat is old”00.2500.2500.2500.250
“It is not a dog, it is a wolf.”0.2200.110.220.2200.1100.11


Inverse Document Frequency (IDF)

Inverse document frequency (IDF): Decreases the weights for commonly used words, and increases weights for rare words that are in the vocabulary.
𝑖𝑑𝑓(𝑡𝑒𝑟𝑚)=log⁡(𝑛_𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠/(𝑛_(𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚)+1))+1

TermIDF Calculation
alog(3/3) + 1 = 1
catlog(3/2) + 1 = 1.18
doglog(3/3) + 1 = 1
islog(3/4) + 1 = 0.87
itlog(3/3) + 1 = 1
mylog(3/2) + 1 = 1.18
notlog(3/2) + 1 = 1.18
oldlog(3/2) + 1 = 1.18
wolflog(3/2) + 1 = 1.18


TF-IDF

Term frequency-inverse document frequency (TF-IDF): Combines term frequency and inverse document frequency.

𝑡𝑓_𝑖𝑑𝑓 (𝑡𝑒𝑟𝑚,𝑑𝑜𝑐)=𝑡𝑓(𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐)∗𝑖𝑑𝑓(𝑡𝑒𝑟𝑚)

 acatdogisitmynotoldwolf
“It is a dog.”0.2500.250.220.250000
“my cat is old”00.300.2200.300.30
“It is not a dog, it is a wolf.”0.2200.110.190.2200.1300.13


To add: N-gram Word2Vec Bert embeddings