Introduction

Machine Learning is completely based on numerical values and mathematical  calculations. Text data is needed to be converted into numerical values. So that machine learning models can understand and learn from Text data.

Techniques for conversion

1. Bag of Words

2. Term Frequency and Inverse Document Frequency ( Tf-IDF )

3. Word2vec techniques

Bag of Words

Bag of words is based on which type of words ( like +ve or -ve words ) are present in a sentence and how many times that word is repeated. Bag of words converts each document or sentence into equal size numerical values, irrespective to size of document or sentence.

Steps:

Step 1: Building Dictionary

Before computing numerical values from texts, we need to collect every word or required words from whole data. And store it for further usage. This dictionary represents 'bag' that contains multiple words.

Step 2: Converting text to numerical

Calculate number of times a word is repeated in a sentence or document. lets compute numerical values by taking an example sentence - "The quick brown fox jumps over the lazy dog".  In above sentence each word appeared only single time except "the" word appeared two times.

Then frequency table is:

Word the quick brown fox jumps over lazy dog
Frequency 2 1 1 1 1 1 1 1

Conversion of above sentence into numerical values: ( Taking example dictionary )

Words in Dictionary Present in Sentence word frequency
. . .
. . .
. . .
hello no 0
how no 0
the yes 2
cow no 0
quick yes 1
rapid no 0
race no 0
brown yes 1
lazy yes 1
jumps yes 1
crown no 0
fox yes 1
over yes 1
. . .
. . .
. . .

The values in word frequency column is considered as numerical representation of above sentence.

"The quick brown fox jumps over the lazy dog" ⇔ [ ..... 0 0 2 0 1 0 0 1 1 1 0 1 1 .........]

If a word present in sentence but does not present in word dictionary, just ignore that word and continue.

Code for implementing bag of words:

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()
bow_data = bow.fit_transform(text_data)

code reference: CountVectorizer reference

For further reference: Bag of words on Wikipedia

Tf-IDF

TF-IDF is short form for Term Frequency and Inverse Document Frequency. TF-IDF is designed to reflect how important a word to the document or sentence. TF-IDF is used for information retrieval, text mining etc.

Steps:

Step 1: Building Dictionary

Before computing numerical values from texts, we need to collect every word or required words from whole data. And store it for further usage.

Step 2: Calculating IDF values

Calculate IDF values for each word in a dictionary. For example take a word 'internet'. And considering an example data.

Sentences or Documents
in whole data
Is word present in this
Sentence or Document
I prefer going to the cinema because it's more interesting. no
Thank you for using State Bank of India internet Banking yes
Sometimes the most difficult questions have the simplest solutions! no
The quick brown fox jumps over the lazy dog no
The internet has brought communities
across the globe close together through instant communication
yes

The word 'internet' appeared in two documents of above data. IDF value of 'internet' calculated by the formula:

$$ IDF = \log(\frac{N}{E})$$

N : Total Number of Sentences or Documents in data,
E : Number of sentences or documents in which the word is present

So IDF value of 'internet' is equal to log(5/2) = 0.397. In above example, if word 'internet' appears only once value of IDF value will be much higher. Which indicates, the rarest word get larger values. Eventually giving more importance to rare words.

logarithm is used because by using N/E value directly gives large value sometimes. For example a data contains 1000 sentences and a word is present in 10 sentences, then N/E value of that will be of value 100.

Step 3: Calculating Tf-IDF values

TF ( Term Frequency ) refers to number of times a word is present in a sentence or document. lets compute TF values by taking an example sentence - "The quick brown fox jumps over the lazy dog".  In above sentence each word appeared only single time except "the" word appeared two times.

Then frequency table is:

Word the quick brown fox jumps over lazy dog
TF value 2 1 1 1 1 1 1 1

And Tf-IDF value of a word in a sentence is multiplication of TF value of that word respective to a sentence and IDF value of same word respective to whole data.

$$ TF-IDF = IF(w, d) * TFIDF(w, D) $$

Conversion of above sentence : ( Taking example dictionary )

Words in Dictionary IDF value word frequency( TF value ) Tf-IDF value
. . . .
. . . .
. . . .
hello 0.21 0 0
internet 0.34 0 0
the 0.01 2 0.02
cow 0.74 0 0
quick 0.45 1 0.45
rapid 0.54 0 0
race 0.49 0 0
brown 0.49 1 0.49
lazy 0.51 1 0.51
jumps 0.34 1 0.34
crown 0.69 0 0
fox 0.58 1 0.58
over 0.21 1 0.21
. . . .
. . . .
. . . .

The values in Tf-IDF column is considered as numerical representation of above sentence.

"The quick brown fox jumps over the lazy dog" ⇔ [ ..... 0 0 0.02 0 0.45 0 0 0.49 0.51 0.34 0 0.58 0.21 .........]

If a word present in sentence but does not present in word dictionary, just ignore that word and continue.

Code for implementing bag of words:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf_data = tfidf.fit_transform(text_data)

code reference: TfidfVectorizer reference

For further reference: Tf-IDF on Wikipedia