Big Data: set similarity : TF/IDF scores and Feature vectors to devaluate terms common in other documents

Term Frequency, Inverse Document Frequency and Feature Vectors. Another great concept which is easy when you start doing it.

Which two of these strings are most likely about the same real world entity?

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Apple

I believe you see: first and third. Why do you think so? Because of Apple there? Don’t you see the Corporation, Corp and USA? You somehow understand these are ancillary words, less meaningful.

How to teach computer to recognize that context? If we look at string level, Edit distance or Jaro-winkler, Affine Gap or Smith-Waterman, noone will solve it, they all will consider these are more similar:

  • Apple Corporation, USA
  • IBM (USA) Corporation

Well, Jaccard distance would spot the answer we are looking for but it is nice to learn another sophisticated method with added value – recognizing frequently used terms.

Today’s topic: two sets are similar if they share many frequent terms unless these terms are common in other strings.

TF score will show us how frequently is this term used.

IDF score will who us is this terms common in other strings.

Feature vectors will reveal the most similar strings. A feature vector is a vector that contains information describing an object’s important characteristics – in our case these characteristics (measurable properties) will be TF and IDF scores.

Let’s start with some data cleaning.

1) remove commas and brackets (I might also remove a, the, and, semicolons etc)

  • Apple Corporation USA
  • IBM USA Corporation
  • Apple Corp

2) convert all to lowercase

3) tokenize (split strings into words) and apply Smith-Waterman and/or Jaro-Winkler to calculate that Corp is similar to Corporation – I’ll assume that these words are equal (I might have set Corporation in my example but I wanted to make this example more challenging to show that data cleaning and applying rules is a normal step)

4) convert ‘corp’ to ‘corporation’ for me not to explain each time that I have a function which recognizes similarity of these terms

Let’s tokenize our collection into a bag of terms and call them documents x,y,z.

  • x = {apple,corporation,usa}
  • y = {ibm,usa,corporation}
  • z = {apple,corporation}

Term set T is {apple,corporation,usa,ibm}

Frequency (TF)

is the number of times each term appears in each document.

TF(apple,x) = 1 (see, document x contains one time term 'apple')
TF(apple,y) = 0 (this document does not contain 'apple')
TF(apple,z) = 1
TF(corporation,x) = 1 
TF(corporation,y) = 1 
TF(corporation,z) = 1 
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0

From this we can also read that if this document contains this term, the score is >0, otherwise 0. TF score can’t be negative.

Inverse Document Frequency (IDF)

is the number of documents in collection (N) divided by documents containing this term (Nd)


IDF(apple) = 3/2 (two of three contain apple) = 1.5
IDF(corporation) = 3/3 (all three contain corporation) = 1
IDF(usa) = 3/2 (two of three contain usa) = 1.5
IDF(ibm) = 3/2 (two of three contain ibm) = 1.5

Well, now we have numbers. And?

Feature vectors

Do you remember vectors from school? It was that something like an arrow. Plane flying to North with speed 850km/h is like one vector and the wind blowing North-West is like another vector. If wind’s velocity is higher the plane from the ground might seem to be slipping sideways a little.

Vectors are used in computing very, very often. Example: v = [R; G; B]; is a feature vector containing color components of a pixel or an object.

In our case

  • documents x,y,z will be the vectors
  • Term set T {apple,corporation,usa,ibm} represent the dimensions (yeah, if thousand of terms than we have thousand-dimensional vector, uhh!)
  • features will be calculated from TF and IDF scores.
  • we will find if vectors are blowing the same direction (their angles are close). If yes – we will consider these documents similar

Formula to calculate features for each of documents and each of terms is:


Now let’s calculate features to draw a table (matrix). Note that all our features are greater or equal to 0 (because negatives are not possible).

F(apple, x) = TF(apple, x) * IDF(apple) = 1 * 1.5 = 1.5
F(apple,y) =  0 * 1.5 = 0
F(apple,z) = 1 * 1.5 = 1.5
F(corporation,x) = 1 * 1 = 1
F(corporation,y) = 1 * 1 = 1
F(corporation,z) = 1 * 1 = 1
F(usa,x) = 1 * 1.5 = 1.5
F(usa,y) = 1 * 1.5 = 1.5
F(usa,z) = 0 * 1.5 = 0
F(ibm,x) = 0 * 1.5 = 0
F(ibm,x) = 1 * 1.5 = 1.5
F(ibm,x) = 0 * 1.5 = 0


These are coordinates of  our vector in four-dimensional space (as we have four terms).

X (1.5,1,1.5,0)

Our vectors can have features only greater or equal to 0 (because we can’t have negative term frequencies in a document) and it means they will never have angle greater than 90 and cosine will never be less than 0. See my example visualisation for a three three-dimensional vectors (


We have to find our three documens x,y,z feature vector pair which makes the closest angle. You remember trigonometry, don’t you? The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one.

The official formula – uh, looks impressive, doesnt it? (I wrote it using

Cosine_formula.PNGIt is not as complex: sum of scalar products divided by sum of vectors length (magnitude).

  • vector x = {apple,corporation,usa} coordinates (1.5,1,1.5,0)
  • vector y = {ibm,usa,corporation} (0,1,1.5,1.5)
  • vector z = {apple,corporation} (1.5,1,0,0)
  • Term set T is {apple,corporation,usa,ibm}

vector pair x,y scalar product and magnitude


Cosine of angle between vectors x,y is


vector pair xz


vector pair y,z


The larger cosine, the closer angle. And winner most similart pair, having the closest angle is…

  • Apple Corporation, USA
  • Corp. Apple


P.S. You might play also using this Feature formula:



This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Mans viedoklis:

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: