Archive for the ‘Big Data’ Category

Big Data studies: having fun rolling virtual dice in R


I devoted a day to learn and play with RStudio to generate random values. Just a cool stuff to know later when more serious probabilities will be analysed.

I wrote my self-made functions for rolling and also draw simple charts with results.

Rolling 6 sided dice

This is how it looks when just run

dice.roll <- function(n){sample(1:n, size = 1)}
dice.roll(6)
for (i in c(1:10)){
print(dice.roll(6))
}

[1] 4
[1] 1
[1] 5
[1] 4
[1] 1
[1] 2
[1] 6
[1] 2
[1] 3
[1] 1

and this is how it looks like when chart drawn. Remember to vectorize function.

dice.roll <- function(n){sample(1:6, size = 1)}
dice.roll <- Vectorize(dice.roll)
plot(dice.roll(1:100), type="o", col="blue", xlab="Times rolled", ylab="Value rolled")
title(main="Rolling dice", col.main="red", font.main=4)

Diceroll

Rolling 12 sided dice

> for (i in c(1:10)){
print(dice.roll(12))
}

[1] 8
[1] 1
[1] 12
[1] 3
[1] 2
[1] 10
[1] 9
[1] 10
[1] 4
[1] 1

Chart

dice.roll <- function(n){sample(1:12, size = 1)}
dice.roll <- Vectorize(dice.roll)
plot(dice.roll(1:100), type="o", pch=23, lty=2, col="green", xlab="Times rolled", ylab="Value rolled")
title(main="Rolling 12-sided dice", col.main="brown", font.main=2)

Dice-12-sided

2-sided – like tossing a coin

> for (i in c(1:10)){
print(dice.roll(2))
}

[1] 1
[1] 2
[1] 1
[1] 1
[1] 1
[1] 2
[1] 1
[1] 2
[1] 1
[1] 1

Chart

dice.roll <- function(n){sample(1:2, size = 1)}
dice.roll <- Vectorize(dice.roll)
plot(dice.roll(1:100), type="l", pch=20, lty=1, col="violet", xlab="Times tossed", ylab="Value tossed")
title(main="Tossing a coin", col.main="dark blue", font.main=2)

Coin.png

Let’s walk through some very simple probability examples.

Q1: What is the probability to roll equal numbers rolling 2 dices one time?

To find that I am going to understand the space of good rolls and probability then is good vs all.

c <- expand.grid(x=1:6, y=1:6)
cat("total possible combinations for two dice are ", nrow(c))
cat("I can have equal numbers rolled in ", length(which(c[1] == c[2])), " ways")
pA <- length(which(c[1] == c[2]))/nrow(c)
cat("Probability p(A) to roll equal numbers is: ", pA)

total possible combinations for two dice are 36
I can have equal numbers rolled in 6 ways
Probability p(A) to roll equal numbers is: 0.1666667

Q2: What is the probability to roll sum between 7 and 10 when rolling 2 dices?

cat("combinations to have sum between 7 and 10 are ",length(which(rowSums(c) >= 7 & rowSums(c) <= 10)))
pB <- length(which(rowSums(c) >= 7 & rowSums(c) <= 10))/nrow(c)
cat("Probability p(B) to roll sum between 7 and 10, is: ", pB)

combinations to have sum between 7 and 10 are 18
Probability p(B) to roll sum between 7 and 10, is: 0.5

Q3: What is the probability to roll sum 2 or 7 or 8 when rolling 2 dices?

cat("sum is 2 or 7 or 8 combinations are ",length(which(rowSums(c) == 2 | rowSums(c) == 7 | rowSums(c) == 8)))
pC <- length(which(rowSums(c) == 2 | rowSums(c) == 7 | rowSums(c) == 8)) / nrow(c)
cat("Probability p(C) to get sum 2 or 7 or 8, is: ", pC)

sum is 2 or 7 or 8 combinations are 12
Probability p(C) to get sum 2 or 7 or 8, is: 0.3333333

Draw probabilities of sequential Heads for coin or Six for dice

coins <- function(x)
{
 1/2^(x-1)
}

dice <- function(x)
{
 1/6^(x-1)
}

x<-2:15

plot(x,coins(x),type="l",ylim=c(0,1),col="red",
 lwd=3,lty=3,main="Probability of sequential Head for coin or six for dice", ylab="probability",
 xlab="rolls count", log="x", axes=F)
axis(2,at=seq(0,1,0.1),labels=T)
axis(1,at=seq(0,max(x),1),labels=T)
lines(x,dice(x),type="l",col="blue",lwd=2,lty=2)

legend(2, 1, legend=c("Coins", "Dice"),
 col=c("red", "blue"), lty=3:2, cex=0.9)

CoinsDices.png

Let’s roll three dice ten times

dice.roll <- function(x,n){sample(1:x, size = n, replace=TRUE)}
dice.roll(6,3)

for (i in c(1:10)){
 print(dice.roll(6,3))
}

[1] 3 3 3
[1] 2 2 5
[1] 6 5 1
[1] 6 4 6
[1] 4 1 5
[1] 2 1 1
[1] 6 3 6
[1] 5 6 1
[1] 5 6 4
[1] 3 6 1

Q4: When three dice rolled, what is the probability to have at least one “1” at a condition that different numbers are rolled?

Conditional probability is easy: P(A|B) = P(A ∩ B) / P(B)

  • A is “have at least one 1”
  • B is “roll different numbers” – not always happens, so a probability to be found

I’ll do now like engineers do – just calculate sample sizes :) without formulas

c <- expand.grid(x=1:6, y=1:6, z=1:6)
cat("P(A ∩ B) Combination count when at least one is 1 and different numbers rolled",length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3]&(c[1] == 1 | c[2] == 1 | c[3]==1))),
 " and total count of combinations is ", nrow(c))
pAB <- length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3]&(c[1] == 1 | c[2] == 1 | c[3]==1)))/nrow(c)
cat("P(B) = combinations count when different numbers rolled ",length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3])),
 " and total count of combinations is", nrow(c))
pB <- length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3]))/nrow(c)
cat("P(A|B) = P(A ∩ B) / P(B) = ", pAB / pB)

P(A ∩ B) Combination count when at least one is 1 and different numbers rolled 60 and total count of combinations is 216
P(B) = combinations count when different numbers rolled 120 and total count of combinations is 216
P(A|B) = P(A ∩ B) / P(B) = 0.5

Next time – some of binomials. I wish I had 48 hours a day.

 

Advertisements

Mathematical Statistics: the more I learn, the more I realize how much I don’t know


There are two courses scheduled in the Big Data analytics module this semester:

Warehousing I dare to think I know quite well – however formal requirements of course are much harder one could expect – presentations, researches and a real DWH developed – and I love it as it means I will learn new things, not just reuse existing.

I wish I could share the same optimism about statistics theory and formulas. I faced the sad truth:

  1. I have forgotten how to calculate integrals and I deeply regret also other skills like Poisson Approximation of Binomial Probabilities have left my memories…
  2. I have no experience with programming in R.

The first lecture started.

Professor told us the plan. Well, this course will be quite tough, I thought.

Imagine my face when I realized this was not a plan of the course. This was a plan for the first intro lecture. Kind of the easy one.

EG, the topic “binomial distributions” was covered for some seconds – “Binomial distributions you all obviously know”. Click, next slide – “Central Limit Theorem, you all know that also”.

Mamma Mia, where am I and where are my belongings? Why we had no the intro about the emergency exit and jackets with oxygen falling for those who are in panic??

Calm down, dude, calm down.

The teacher, prof. Janis Valeinis is the type of teachers I value the most – he has a sparkle is his eyes. He loves the topic, he lives in the topic and statistics is his passion. It actually is a honor for me to study here and have this amazing chance to learn from a professor like he. After all, this is master’s degree level studies and this is just normal students are presumed to be skilled beforehand.

Look around, you see, each student is sitting with a poker face and I will also. I WILL!!! I was here 20 years ago as a successful student in love with differential equations and algebra of sets. I will domesticate integrals and Poisson again, and even more, I will put a bridle on R too.

So my mission is, should I choose to accept it:

  1. remember a lot of basics, from combinatorics to integrals
  2. be able to follow the study topics presuming all students know the concepts
  3. accelerate my R skills from 0 to 100
  4. do homeworks, pass quizzes and exam
  5. ah, yes, and keep working full day as DWH analyst – programmer.

I will not mention my other roles like servant of our cats. No discounts anyway.

Here comes my plan. NB: I do not know yet will it work :)

Step 1. Install RStudio (free, open source)

rst

Step 2. Buy 1,001 Statistics Practice Problems For Dummies and set a goal solve 60 questions per day – I love learning by doing.

stat

Example of question:

dumm_example

And answer explained:

dumm_answer

(I like the idea and keep in waiting list ‘R for dummies’ and ‘Statistical Analysis with R For Dummies’)

Step 3. Subscribe in Coursera to Introduction to Probability and Data. If you choose “audit mode”, it is for free because you can listen but will not get any certificate of completion etc.

coursera

  • Install Coursera app – I can access course better via app. When using browser I see week is locked till day x but via app I can listen to any topic.
  • Download topics for offline usage. I am listening them in headphones on my way to work, to lunch and at any suitable moment.

ausis

Step 4. Plan and mark in you calendar at least 2 hours per day as learning hours – I use one hour in early mornings and two after work in evening. Weekends I devote for studying about 8 hours a day.

I just love doing it like others do knitting. Thanks to my family accepting my forever learning way of life.

Enjoy the happy moments when the feeling “I did it” rises like a phoenix.

Professor started the course with a so called birthday paradox.

How many students in class do you need to do a bet that at least two of them will have birthday in the same day of year, assumed that all 365 possible birthdays are equally likely?

365 students? 181 students? 100?

Hah, in room with 23 people there is a chance 50:50 two will share the same birthday.

And we learned the mathematics behind it.

First person always has a birthday and any day suits, so we can ignore it.

Probability that second random person

  • has the same birthday as first is 1/365 (0.3%).
  • does not have the same birthday as first is 364/365 (99.7%) as it may have any birthday other than the birthday of first person.

Probability that third person

  • has the same birthday as the first person is 1/365 * 1/365 and the same calculations if is has the same birthday as the second person (and probability would be 1/365*1/365*1/365 if we’d be looking all three of them having birthdays the same day). This number of possible birthday match combinations keeps growing with each next person in sequence, that’s why in “at least one”  task easier is to calculate from opposite to the probability that there are no two people in the room having the same birthday and subtract from 1 thus getting the probability of at least one matching birthday.
  • does not have the same birthday as the first and second is 363/365 as third person may have any of the birthdays not already taken by first and second persons

Probability that fourth person does not have the same birthday is 362/365 etc.

Total probability that

  • second person does not have the same birthday as first
  • AND third person does not have the same birthday as first and second
  • AND fourth person does not have the same birthday as first and second and third

is multiplication of probabilities.

365/365 * 364/365 * 363/365*…how many persons do we want to consider.

Here I will confess I started using R with an assumption this is just another C++ and started writing recursive loops like

birthday <- function(n)
 {
 1-(365-n+1)/365
 }


y <- 1;
for (n in 1:10)
 {
 y <- y*birthday(n);
 return(y);
 }

Happily I did not succeed because if I did I might be adapted to a wrong mindset. The key to understanding for me was this sentence I found in one of blogs about R:

### In R you need to switch your mentality to thinking in vectors instead of for loops.###

When I realized that concept I became friends with R (moreover, now I am in love with R). Thanks, God and University degree, I am quite familiar with vectors.

In R everything is a vector. Single number is a length-one vector. You can choose when to calculate over a vector and when as a number. This little example helped me:

#simple function to multiply given number with the next in sequence
> test <- function(n)
{n*(n+1)}

# calculate when n is 1
> n <- 1 
> test(n)
[1] 2
#when n is 2
> n <- 2
> test(n)
[1] 6
# when n is 3
> n <- 3
> test(n)
[1] 12
# and now one of the powers of R - no looping needed:
> n <- 1:3
> test(n)
[1] 2 6 12

One more example of different usages

> testn <- function(n)
{prod(n*(n+1))}
> testn(1:3)
# now you will have the results multiplied 2 * 6 * 12 = 144
[1] 144
# vectorize it and repeat - completely different result
> testn <- Vectorize(testn)
> testn(1:3)
[1] 2 6 12

After getting this idea I immediately deleted my clumsy loops and voila! my new R-life began. Learned by playing.

> prod(3)
[1] 3
> prod(3):1
[1] 3 2 1
> prod(3:1)
[1] 6

And birthdays now are easy – we instruct loop through set of given persons in class, like to calculate for 2 persons, for three, 4, 5, … as many as we want.

I must note here that The Pigeon-hole Principle (Dirihlē princips) clearly says that if we have more than 365 persons, eg, 366 persons, it guarantees that at lest two persons will have the same birthday. But it is not today’s topic. Today we’ll see that we mathematically don’t need 366 persons to have a match :)

birthdays <- function(n)
{
 1-prod((365-n+1):365)/365^n
}

This vector stuff is not obvious, especially if you don’t have background understanding. I’ll show more examples. Let’s forget to vectorize function.

birthdays <- function(n)
{
 1-prod((365-n+1):365)/365^n

n <- 1:3
birthdays(n)
[1] 0.0000000 0.9972603 0.9999925
Warning message:
In (365 - n + 1):365 :
 numerical expression has 3 elements: only the first used
>

What happened? Is it really that 2 persons in a class have 99.7% chance to match birthdays?  And three even 99%?

C’mon, something definitely wrong here.

We passed too much values to our built-in R loop, see this example:

So, what happened and why for n 1 2 3 (to calculate the needed 365,364,363)

we got crazy results 0.0000000 0.9972603 0.9999925?

Because in the first part of formula we try to get three elements up to 365:

365 to 365 = 365

364 to 365 = 364, 365

363 to 365 = 363, 364, 365

It is not possible to have all at once by my current code. Thus why R warned only the first value will be used of n = 1, resulting ‘365’ is taken as he result set for the first part of formula for any n

0.0000000 (this is for n=1) is the result of formula (1 – (365)/365^1) this is ok as only one occurrence here

0.9972603 (this is for n=2) is the result of formula 1 – (365)/365^2 while we need the result to be 1-(365/365)*(364/365) = 1-(365*364)/365^2 = 0.00273973

0.9999925 (this is for n=3) is the result of formula (1 – (365)/365^3) while we need the result to be 1-(365/365)*(364/365)*(363/365) = 1-(365*364*363)/365^3 = 0.00820417

Let’s have even more crazy example and follow up to getting their values:

n <- 3:4;n
[1] 3 4
> birthdays(n)
[1] 0.008204166 0.997282751
Warning message:
In (365 – n + 1):365 :
numerical expression has 2 elements: only the first used

now we have n 3 4 and after calculations (365-n+1) we have 363, 362 and as per our conditions we want loops

{363,364,365}

{362,363,364,365}

But again, we can loop only one and R says it will use the first.

0.008204166 (this is for n=3) is the result of formula 1 – (363*364*365)/365^3

0.997282751 (this is for n=4) is the result of formula 1 – (363*364*365)/365^4

Now we have more motivation to remember to vectorize function :) and

voila! we have The Results!

> n <- 1:4;n
[1] 1 2 3 4
> birthdays <- Vectorize(birthdays)
> birthdays(n)
[1] 0.000000000 0.002739726 0.008204166 0.016355912

Why it started working? Because now the function will work with products (364*365), (363*364*365) etc instead of previous approach to loop through one set.

Here:

0.000000000 is the probability to match birthday in class with one person (1-365/365). Of course, it is not possible as there is no other person to match with.

0.002739726 is the probability to match birthday when 2 persons in class (0.3% or 1-364/365*365/365)

0.008204166 is for 3 persons in class (0.8% or 1-((363/365)*(364/365)*(365/365))

0.016355912 is for 4 persons (1.6% or 1-((362/365)*(363/365)*(364/365)*(365/365)

Now let’s draw it as a chart and search for the person count when we reach probability 50:50 or 0.5

birthdays <- function(n)
{
 1-prod((365-n+1):365)/365^n
}
n <- 1:50
birthdays <- Vectorize(birthdays)

plot(n,birthdays(n),type="l",ylim=c(0,1),col="dark red",
 lwd=2,lty=3,main="Dzimšanas dienas", ylab="varbūtība, ka vismaz diviem ir vienā dienā",
 xlab="cilvēku skaits", axes=F)

axis(2,at=seq(0,1,0.1),labels=T)
axis(1,at=seq(0,max(n),1),labels=T)

# draw line ar probability 0.5
abline(h=0.5,lwd=1,col="brown")

legend(1, 1, legend="Varbūtības grafiks", col="dark red", lwd=2,lty=3, cex=0.6)

and enjoy the result.

Birthdays

We see that probability becomes 50:50 at person count 23. Let’s perform probability calculations for these two values:

birthdays(22)
[1] 0.4756953
> birthdays(23)
[1] 0.5072972

You might ask to perform the calculations in opposite way:

if we set the desired probability like 0.5, how do we find the count of persons when it occurs?

This type of calculations is done by approximations because the opposite formula you have to find the n is complex as n is the time of multiplication, decreasing and raising to the power.

In our situation the solution will be like engineers do :)

Option a: do the calculations, store in array and retrieve the nearest value by index. Read below how many persons to invite in class if you want 50% or 99% probability:

> # calculate values for 1 to 100 persons and store in an array
> answer <- birthdays(1:100); answer
 [1] 0.000000000 0.002739726 0.008204166 0.016355912 0.027135574 0.040462484 0.056235703 0.074335292 0.094623834
 [10] 0.116948178 0.141141378 0.167024789 0.194410275 0.223102512 0.252901320 0.283604005 0.315007665 0.346911418
 [19] 0.379118526 0.411438384 0.443688335 0.475695308 0.507297234 0.538344258 0.568699704 0.598240820 0.626859282
 [28] 0.654461472 0.680968537 0.706316243 0.730454634 0.753347528 0.774971854 0.795316865 0.814383239 0.832182106
 [37] 0.848734008 0.864067821 0.878219664 0.891231810 0.903151611 0.914030472 0.923922856 0.932885369 0.940975899
 [46] 0.948252843 0.954774403 0.960597973 0.965779609 0.970373580 0.974431993 0.978004509 0.981138113 0.983876963
 [55] 0.986262289 0.988332355 0.990122459 0.991664979 0.992989448 0.994122661 0.995088799 0.995909575 0.996604387
 [64] 0.997190479 0.997683107 0.998095705 0.998440043 0.998726391 0.998963666 0.999159576 0.999320753 0.999452881
 [73] 0.999560806 0.999648644 0.999719878 0.999777437 0.999823779 0.999860955 0.999890668 0.999914332 0.999933109
 [82] 0.999947953 0.999959646 0.999968822 0.999975997 0.999981587 0.999985925 0.999989280 0.999991865 0.999993848
 [91] 0.999995365 0.999996521 0.999997398 0.999998061 0.999998560 0.999998935 0.999999215 0.999999424 0.999999578
[100] 0.999999693
> which(answer > readline(prompt="Enter desired probability: "))[1]
Enter desired probability: 0.5
[1] 23
> which(answer > readline(prompt="Enter desired probability: "))[1]
Enter desired probability: 0.99
[1] 57

Option b: use R function uniroot to do actually the same – loop through values for the hit, but I haven’t learned it yet.

Now let’ s have some fun: let’s generate random birthdays and see what happens.

I assigned numbers 1 to 365 to the days, then I generate two random sample sets and take the intersected values and convert them to date of year 1970 (just for fun).

Let’s start with probability 50:50 or 23 persons:

smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
23

sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))

And the results of 10 randoms for 23 people (probability to get a match in a single trial is 50%):

  • no match: 1 time
  • one common birthday: 4 times
  • two common birthdays: 3 times
  • three common birthdays: 2 times
> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 23
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "1970-09-18" "1970-11-25"
[1] "Date of length 0" #no match
[1] "1970-03-11" "1970-04-03" "1970-07-25"
[1] "1970-01-07" "1970-12-29"
[1] "1970-01-25"
[1] "1970-01-26"
[1] "1970-04-01"
[1] "1970-01-23" "1970-06-06" "1970-08-27"
[1] "1970-09-30"
[1] "1970-08-16" "1970-08-27"

Let’s do 10 randoms for 10 people in class (probability to get a match in a single trial is 12%):

  • no match: 7 times
  • 1 common birthday:  2 times
  • 2 common birthdays: 1 time
> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 10
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "1970-12-19"
[1] "Date of length 0"
[1] "1970-07-04"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "1970-02-06" "1970-11-09"

Let’s do 10 randoms for 5 people in class (probability to get a match in a single trial is 3%):

  • no match: 9 times
  • 1 match: 1 time
> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 5
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "1970-09-07"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"

Let’s do 10 randoms for 2 people in class (probability to get a match in a single trial is 0.1%):

  • no matches
> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 2
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"

And now one more fun: let’s add another chart:

probability to have the same day of a month in a class of n persons

(assuming for simplicity there are 31 days)

birthdays <- function(n)
{
 1-prod((365-n+1):365)/365^n
}

birthdays <- Vectorize(birthdays)

birthdates <- function(n)
{
 1-prod((31-n+1):31)/31^n
}

birthdates <- Vectorize(birthdates)

n <- 1:50

plot(n,birthdays(n),type="l",ylim=c(0,1),col="dark red",
 lwd=2,lty=3,main="Matching days two at least", ylab="probability",
 xlab="person count", axes=F)

axis(2,at=seq(0,1,0.1),labels=T)
axis(1,at=seq(0,max(n),1),labels=T)

abline(v=23,lwd=1,col=5)
abline(h=0.5,lwd=1,col="brown")

lines(n,birthdates(n),type="l",col="blue",lwd=2,lty=3)
abline(v=6.8,lwd=1,col="violet")

legend(1, 1, legend=c("Days of year", "Days of month"), col=c("dark red", "blue"), lwd=2,lty=3, cex=0.8)

And we see that probability 50:50 the same day of month is reached at 7 persons.

dates

Next blog expected to be about probabilities and binomial distribution, discussing solutions for tasks like

# Q1. A coin is tossed 20 times. What is the probability of getting exactly 3 heads?
# Q2. A die is rolled 20 times. What is the probability of getting exactly 10 six?

# Q3: 80% of people who purchase pet insurance are women. If 9 pet insurance owners are randomly selected, find the probability that exactly 6 are women.

P.S. Meanwhile – behind the scene:

Cili2Cill

Big Data: Phonetic Similarity : Soundex – words are similar if they sound the same


I guess you have seen surnames like Meier and Mayer or Smith, Smyth, Smithe, Smeth, Smeeth. These might be as well correct as misspelt surnames – if you’d dictate them to me by phone, who knows what I’d write.

So far I was blogging about similarity algorithms based on string writing. Today let’s discuss finding a match if the words sound the same. The most commonly used phonetic measure is Soundex. It has several extensions and my today’s topic is

Soundex for English language.

Soundex calculates a four character code from a word  based upon the pronunciation and considers two words as similar if their codes are equal. The idea is that similar sounding letters have are assigned the same soundex code. Widely used in genealogy, in archives, searching ancestors, relatives, families, heirs.

  1. The first character is the starting letter of a word. (in a variation called “Reverse Soundex” prefixes the last letter instead of the first)
  2. Drop all other occurrences of a, e, i, o, u, y, h, w.
  3. Replace consonants after the first letter with digits as follows:
    • b, f, p, v → 1
    • c, g, j, k, q, s, x, z → 2
    • d, t → 3
    • l → 4
    • m, n → 5
    • r → 6
    • If two or more letters with the same number were adjacent in the original name (before step 2), or adjacent except for any intervening h and w, then omit all but the first.
    • Return the first four padded with 0 (padding means replace blanks with 0, like ‘Ahoi’ will have code A000, ‘Dude’ will have D300 – always four characters code).

Let’s have an example set – surnames. I used Oracle RDBMS this time.

Soundex_table.PNG

Now let’s compare similarities by three methods: Edit distance, Jaro-WInkler, Soundex.

Soundex_select.PNG

and here are the results. Notice the combinations we have: if I set a similarity threshold by Edit distance or Jaro-Winkler to 50% then we have several combinations. including false positives and false negatives:

  • all three methods match – like ‘Mirhe’
  • Jaro-Winkler and soundex match, but Edit distance doesn’t – like ‘Meiyar’
  • Jaro-Winkler match but Soundex doesn’t – like ‘Mayes’
  • Edit distance and Jaro-Winkler match but Soundex doesn’t – like ‘Mimre’ or ‘Mirfe’

Soundex_results.PNG

You see, Soundex is not a silver bullet and, as I have always been writing, we must try and test, test ad try.

I’ll show you one more weakness of Soundex:

Scwarz.PNG

From the three approaches I used Soundex is the only one which did not find similarity :)

Scwarz_res

Some of Soundex variants

  • The New York State Identification and Intelligence System (NYSIIS) algorithm maintains relative vowel positioning, while Soundex does not.
  • Daitch–Mokotoff Soundex (D–M Soundex) adaptation to Jews with Germanic or Slavic surnames, sometimes referred as “Eastern European Soundex”. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999, calculation is much more complex than Soundex.
  • Metaphone, Double Metaphone, Metaphone 3. Powerful and customisable rule set, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations.

I googled online Metaphone calculator, they say It’s more accurate than soundex – hardly can agree:

  • The metaphone code for Schwarzenegger is SXWRSNKR.
  • The metaphone code for Schvartzeneger is SXFRTSNJR.
  • These surnames do not have the same metaphone code.

Then I tried for one of my Soundex similarities and – again

  • The metaphone code for Meiyar is MYR.
  • The metaphone code for Mire is MR.
  • These surnames do not have the same metaphone code.

I was also searching for Soundex Latvian edition – I am quite sure it exists. I found this: http://www.lzp.gov.lv/images/stories/dokumenti/Zin_rezult_2008.pdf

2008. g. izstrādāts un pilveidots elastīgs universālas leksikona sistēmas datubāzes
modelis, kas paredz vienotas infrastruktūras (kopīgu indeksēšanas un atgriezeniskās
saites mehānismu u.c.) un funkcionalitātes (šķirkļu izvērstas meklēšanas un
konfigurējamas atainošanas u.c.) pieejamību visām datubāzē izvietotajām vārdnīcām
neatkarīgi no to šķirkļu shēmām. Attiecībā uz indeksēšanu un meklēšanu, latviešu
valodai tika pielāgots Soundex algoritms, lai nodrošinātu neprecīzi ievadītu, bet pēc
izrunas līdzīgu vārdu atrašanu. (A. Spektors)

P.s. Tiem, kas lasa arī latviešu valodā – šeit ir maziņš un mīlīgs foruma ieraksts, kā cilvēks cenšas izveidot meklēšanas ieteikumu rīku (“vai jūs domājāt XXYZZX?”)  https://exs.lv/say/16261/1441002-so-nakti-pavadiju-veidojot

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: hybrid similarity measure: the Soft TF/IDF Measure to deal with misspelt words


Some days ago I covered the topic about finding which two of these strings are most likely about the same real world entity – by recognizing requently used words and assigning them lower impact, thus first and third options were found as most similar:

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Apple

Let’s add a challenge: misspell the word

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Aple

And, as you might imagine, TF/IDF measure is not effective anymore because it cannot recognize Apple is similar to Aple as in classic implementation is looking for equality, not similarity. Today, similar like we did in generalised Jaccard, we will replace the requirement for words to be equal with requirement to be similar.

As I did before with TF/IDF, let’s remove commas, brackets and tokenize our collection into a bag of terms – documents x,y,z.

  • x = {apple,corporation,usa}
  • y = {ibm,usa,corporation}
  • z = {aple,corp}

Full term set T = {apple,corporation,usa,ibm,aple,corp}

Now we will use any string similarity measure of our choice (and business need). I have chosen today Needleman-Wunsch and Jaro-Winkler to illustrate the differences, eg, s(apple,aple) = 0.8 by Needleman-Wunsch and s(apple,aple) = 0.93 by Jaro Winkler. I used calculator https://asecuritysite.com/forensics/simstring

Let’s choose similarity (it is my choice, it could be any other value 0..1)

  • threshold k = 0.55 for Needleman-Wunsch measure
  • threshold k = 0.45 for Jaro-Winkler measure

soft_idf_scores.PNG

Now we will reveal these terms which have similar term in the other document.

Soft TF/IDF based on Needleman-Wunsch

Initial steps are the same as per classic TF/IDEf. Calculate:

Frequency (TF)

is the number of times each term appears in each document.

Full term set T = {apple,corporation,usa,ibm,aple,corp} (as you see we will have six dimension vectors)

TF(apple,x) = 1 (document x has one time apple)
TF(apple,y) = 0
TF(apple,z) = 0
TF(corporation,x) = 1 
TF(corporation,y) = 1 
TF(corporation,z) = 0 
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0
TF(aple,x) = 0
TF(aple,y) = 0
TF(aple,z) = 1
TF(corp,x) = 0
TF(corp,y) = 0
TF(corp,z) = 1

Inverse Document Frequency

IDF(apple) = 3/1 (one of three documents contains apple) = 3
IDF(corporation) = 3/2 = 1.5
IDF(usa) = 3/2 = 1.5
IDF(ibm) = 3/1 = 3
IDF(aple) = 3/1 = 3
IDF(corp) = 3/1 = 3

Feature vectors (Feature = TF*IDF)

softIDF-unnorm.PNG

Let’s normalise these vectors: remember from trigonometry, it means getting the unit vector of length 1: divide the coordinates by the length of the vector.

length vector document x = sqrt((3*3) + (1.5*1.5) + (1.5*1.5)) = 3.67
length y = sqrt((1.5*1.5) + (1.5*1.5) + (3*3)) = 3.67
length z = sqrt((3*3) + (3*3)) = 4.24

Normalised vectors for documents (each coordinate divided by vector’s length):

softIDF-norm.PNG

Now Needleman-Wunsch scores come into a game.

We compute the close terms.

Close(x,y,0.55) = {(apple is not because its corporation in Document y is similar by 0.55 but there is another word – corporation which bonds by 1 and thus blocks apple’s similarity), corporation (because it has strongest bond with word corporation in document y), usa (because it has strongest bond to usa in document y)}

SoftIDFxy.png

  • close(x,y,0.55) = {corporation,usa}
  • close(x,z,0.55) = {apple}
  • close(y,z,0.55) = {} (noone pair passed threshold)

SoftIDFyz.png

In the final step we compute features but giving a weight to each component to the TF/IDF formula. We are looking for the most closest vectors. The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one. As our vectors are normalised, cosine formula now is simple computing the dot (scalar) product.

  • similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Needleman-Wunsch similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
  • similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.8 = 0.46 = 46%
  • similarity(y,z) = 0% (remember, that words pairs, incl. corp and corporation did not pass our threshhold)

Voila! By Needleman-Wunsch the most similar strings are

  • Apple Corporation, USA
  • Corp. Aple

Now let’s recalculate soft TF/IDF using Jaro-Winkler.

Threshold k = 0.45 for Jaro-Winkler (I have set different from Needleman-Wunsch just for more fun to learn differences better).

See in pictures how do we get to the closest results by keeping the strongest bonds:

SoftIDFJaroxy.png

SoftIDFJaroxz.jpg

SoftIDFJaroyz.jpg

  • close(x,y,0.45) = {corporation,usa}
  • close(x,z,0.45) = {apple,corporation}
  • close(y,z,0.45) = {corporation}

Now let’s calculate Features from normalized vectors (the same vectors we calculated)

softIDF-norm

As these vectors are normalised, I’ll remind cosine formula is simple computing the dot (scalar) product.

  • similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Jaro-Winkler similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
  • similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.93 + x corporation 0.41 * z corp 0.71 * 0.87 = 0.79 = 79%
  • similarity(y,z) = y corporation 0.41 * z corp 0.71 * 0.87 = 0.25 = 25%

Voila! By Jaro-Winkler the most similar strings again are

  • Apple Corporation, USA
  • Corp. Aple

Do you feel the power of the idea of this hybrid (string & set combined) similarity calculation method?

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: combining string and set matching methods. One of hybrid similarity measures – generalised Jaccard index


It’s time to start to combine string and set matching methods – let’s have a look at one of Hybrid Similarity measures.

Classic Jaccard measure considers overlapping tokens (words, q-grams). To be considered as overlapped, the token must be identical. Jaccard index works very well when names are in mixed order like “Food and drinks”, “Drinks&Food”, “Serving foods, sweets, drinks”. However pure Jaccard is too restrictive when text contains errors.

I had an example in my Jaccard blog entry comparing classes by pupils’ names – Martins, Marta, Katrina, Ance etc. What if some of names were written with errors?

Generalized Jaccard measure helps.

First of all, we as usual convert the comparable string into tokens. I’ll reuse the example and put some errors in names

Class one pupils x = {Kartina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils y = {Annija,Martins,Matra,Karloina,Ance}

Today we’ll learn also soft overlap – using the most matching pairs.

First step. compare each pair

To compare we need a similarity measure s which returns values 0..1 (the closer to 1 the more similar).

For more fun let’s apply two for the same pairs- Edit distance (Levenshtein distance) and Jaro-Winkler measure – see, the result differs? :) I used https://asecuritysite.com/forensics/simstring (sorry, this page has a bug in Jaro-Winkler algorithm – because it is not true (janis,martins) has 0 by JW (it should be 0.67) – but I could not find any other online calculator and for our experiment this bug is acceptable and let’s use it as an example how easy is to misuse method when we simply believe to somebody’s programmed result without understanding)

Sim_scores.PNG

Second step.

Choose threshold and keep only those who exceed. I have chosen threshold 0.5.

Threshold for Edit Distance

Threshold_edit_distance.PNG

Threshold for Jaro-Winkler

Threshold_Jaro_Winkler.PNG

Third step. Keep only the strongest bond.

To find that I draw all the upper threshold bonds at first.

Edit_distance_bonds

Martins and Martins are of course bonded. It means no any other bonds possible from or to.

Karlina to Karloina has the next strongest remaining bond. Again, no other bonds from/to.

We have left Anna to Annija because all other bonds relate to “engaged” entities. For examle, Kartina cannot bond to Karloina with 0.71 because Karlina bonded with 0.88.

We calculate then weight of matching by adding all the match scores (2.55) and divide by the (all name in class X plus all names in class B minus matchinmg pairs) = 0.319 = 32% similarity when we hybridise Jaccard with Edit Distance.

Edit_distance_bond.jpg

Now let’s do the same for Jaro-Winkler. First of all, all bonds upper than threshold:

Jaro-bonds.jpg

and keep only the strongest bonds. again you see, for example, Kartina cannot bond 0.88 with Martins because Martins bonded to Martins with 1. Kartina cannot also bond with 0.91 to Karloina because Karlina bonded to Karloina with 0.98.

And formula again – matching weight divided by sum of entities minues mathing pairs and – voila! = 0.707 = 71% similarity when we hybridise Jaccard with Jaro-Winkler measure.

Jaro_bond

I’ll remind that in my previous blog entry explaining Jaccard measure I showed that

  • similarity with correctly spelled names and requirement for name equality was 10%
  • similarity using bigrams was 50%
  • similarity with trigrams was 26%

Today we calculated (on a slightly modified set – errors added)

  • similarity with hybrid method with Edit distance was 32%
  • similarity with hybrid method with Jaro-Winkler measure was 71%

Isn’t it funny – five different results? There is no silver bullet in string and set matching. Try and test, try and test, try and test… repeat.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: set similarity : TF/IDF scores and Feature vectors to devaluate terms common in other documents


Term Frequency, Inverse Document Frequency and Feature Vectors. Another great concept which is easy when you start doing it.

Which two of these strings are most likely about the same real world entity?

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Apple

I believe you see: first and third. Why do you think so? Because of Apple there? Don’t you see the Corporation, Corp and USA? You somehow understand these are ancillary words, less meaningful.

How to teach computer to recognize that context? If we look at string level, Edit distance or Jaro-winkler, Affine Gap or Smith-Waterman, noone will solve it, they all will consider these are more similar:

  • Apple Corporation, USA
  • IBM (USA) Corporation

Well, Jaccard distance would spot the answer we are looking for but it is nice to learn another sophisticated method with added value – recognizing frequently used terms.

Today’s topic: two sets are similar if they share many frequent terms unless these terms are common in other strings.

TF score will show us how frequently is this term used.

IDF score will who us is this terms common in other strings.

Feature vectors will reveal the most similar strings. A feature vector is a vector that contains information describing an object’s important characteristics – in our case these characteristics (measurable properties) will be TF and IDF scores.

Let’s start with some data cleaning.

1) remove commas and brackets (I might also remove a, the, and, semicolons etc)

  • Apple Corporation USA
  • IBM USA Corporation
  • Apple Corp

2) convert all to lowercase

3) tokenize (split strings into words) and apply Smith-Waterman and/or Jaro-Winkler to calculate that Corp is similar to Corporation – I’ll assume that these words are equal (I might have set Corporation in my example but I wanted to make this example more challenging to show that data cleaning and applying rules is a normal step)

4) convert ‘corp’ to ‘corporation’ for me not to explain each time that I have a function which recognizes similarity of these terms

Let’s tokenize our collection into a bag of terms and call them documents x,y,z.

  • x = {apple,corporation,usa}
  • y = {ibm,usa,corporation}
  • z = {apple,corporation}

Term set T is {apple,corporation,usa,ibm}

Frequency (TF)

is the number of times each term appears in each document.

TF(apple,x) = 1 (see, document x contains one time term 'apple')
TF(apple,y) = 0 (this document does not contain 'apple')
TF(apple,z) = 1
TF(corporation,x) = 1 
TF(corporation,y) = 1 
TF(corporation,z) = 1 
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0

From this we can also read that if this document contains this term, the score is >0, otherwise 0. TF score can’t be negative.

Inverse Document Frequency (IDF)

is the number of documents in collection (N) divided by documents containing this term (Nd)

IDF_formula.PNG

IDF(apple) = 3/2 (two of three contain apple) = 1.5
IDF(corporation) = 3/3 (all three contain corporation) = 1
IDF(usa) = 3/2 (two of three contain usa) = 1.5
IDF(ibm) = 3/2 (two of three contain ibm) = 1.5

Well, now we have numbers. And?

Feature vectors

Do you remember vectors from school? It was that something like an arrow. Plane flying to North with speed 850km/h is like one vector and the wind blowing North-West is like another vector. If wind’s velocity is higher the plane from the ground might seem to be slipping sideways a little.

Vectors are used in computing very, very often. Example: v = [R; G; B]; is a feature vector containing color components of a pixel or an object.

In our case

  • documents x,y,z will be the vectors
  • Term set T {apple,corporation,usa,ibm} represent the dimensions (yeah, if thousand of terms than we have thousand-dimensional vector, uhh!)
  • features will be calculated from TF and IDF scores.
  • we will find if vectors are blowing the same direction (their angles are close). If yes – we will consider these documents similar

Formula to calculate features for each of documents and each of terms is:

Featureformula.PNG

Now let’s calculate features to draw a table (matrix). Note that all our features are greater or equal to 0 (because negatives are not possible).

F(apple, x) = TF(apple, x) * IDF(apple) = 1 * 1.5 = 1.5
F(apple,y) =  0 * 1.5 = 0
F(apple,z) = 1 * 1.5 = 1.5
F(corporation,x) = 1 * 1 = 1
F(corporation,y) = 1 * 1 = 1
F(corporation,z) = 1 * 1 = 1
F(usa,x) = 1 * 1.5 = 1.5
F(usa,y) = 1 * 1.5 = 1.5
F(usa,z) = 0 * 1.5 = 0
F(ibm,x) = 0 * 1.5 = 0
F(ibm,x) = 1 * 1.5 = 1.5
F(ibm,x) = 0 * 1.5 = 0

Feature_matrix.PNG

These are coordinates of  our vector in four-dimensional space (as we have four terms).

X (1.5,1,1.5,0)
Y(0,1,1.5,1.5)
Z(1.5,1,0,0)

Our vectors can have features only greater or equal to 0 (because we can’t have negative term frequencies in a document) and it means they will never have angle greater than 90 and cosine will never be less than 0. See my example visualisation for a three three-dimensional vectors (https://academo.org/demos/3d-vector-plotter/)

Vectors.png

We have to find our three documens x,y,z feature vector pair which makes the closest angle. You remember trigonometry, don’t you? The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one.

The official formula – uh, looks impressive, doesnt it? (I wrote it using http://math.typeit.org/)

Cosine_formula.PNGIt is not as complex: sum of scalar products divided by sum of vectors length (magnitude).

  • vector x = {apple,corporation,usa} coordinates (1.5,1,1.5,0)
  • vector y = {ibm,usa,corporation} (0,1,1.5,1.5)
  • vector z = {apple,corporation} (1.5,1,0,0)
  • Term set T is {apple,corporation,usa,ibm}

vector pair x,y scalar product and magnitude

Vectorxy

Cosine of angle between vectors x,y is

cosalfa.png

vector pair xz

Vectorxz.png

vector pair y,z

Vectorsyz

The larger cosine, the closer angle. And winner most similart pair, having the closest angle is…

  • Apple Corporation, USA
  • Corp. Apple

Voila!

P.S. You might play also using this Feature formula:

logr.PNG

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: set similarity : q-grams, Overlap measure, Jaccard index, Jaccard distance


Today’s topic: is John Smith similar to Smith John? As you see, Edit distance 10 will measure as not similar, so as Needleman-Wunsch (score = – 5). Smith Waterman would just point to the two similar substrings. Hmm, are these strings really not similar?

I asked my kid: how could we find guys of which class have the most similar names to your class? He answered: well, we could take name by name and calculate Edit distance score for each and then sum results together?

Let’s have a look closer on some official methods.

Class one pupils {Katrina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils {Annija,Martins,Marta,Karolina,Ance}

First of all, let’s learn to obtain the set of comparable items – tokens.

Let’s start having names as our tokens. Set similarity operates with tokens, not “words” because words sound like are something meaningful for us. Tokens are… just tokens :) anything we have agree how to split the set – you’ll see several approaches.

Simple: if we use word a s token then string “David Alexander Smith” consists of three tokens: david, alexander, smith.

A bit more complex: we might agree that we during tokenisation remove common stop words like “of, the, and” or ” Mr, Ms, fon, da” etc.

Then string “Mr. David Alexander fon Smith” again consists of three tokens: david, alexander, smith.

q-grams or n-grams

This impressive name just means that we split the string in substrings of length q. I’ll use David and show you two approaches used – which approach you choose  later depends on your needs and further usage of results.

q=1 (unigram):

  • {D,a,v,i,d}

q=2 (bigram, sometimes called digram):

  • first approach: {#d,da,av,vi,id,d#} (I used the special character # to pad for the length)
  • second approach: {da,av,vi,id}

q=3 (trigram):

  • {##d,#da,dav,avi,vid,id#,d##}
  • {dav,avi,vid}

q=4 (four-gram):

  • {###d,##da,#dav,davi,avid,vid#,id##,d###}
  • {davi,avid}

q=5 (five-gram):

  • {####d,###da,##dav,#davi,david,#avid,##vid,###id,####d}
  • {david}

I used R for some tests, see it does the second approach:

R_qgrams

Similar strings have many common q-grams.

You might find it interesting that translation software most likely detects language by comparing q-grams to the statistical results of languages. This my blog entry could be split in bigrams and would have result “English” returned.

Q-gramming may be applied to word level also.

The quick brown fox jumps over a lazy dog.

bigram:

  • {# the,the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog,dog #}
  • {the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog}

trigram

  • {# # the, # the quick,the quick brown,quick brown fox,brown fox jumps,fox jumps over,jumps over a,over a lazy,a lazy dog,lazy dog #,dog # #}

etc.

Note it is a analyst’s (my, yours) decision if the method used requires some data cleaning like removal of punctuation, whitespaces, letter de-capitalising etc.

The overlap measure

Let’s do some simply examples for learning the idea: string 1 will be Carlo, string 2 Carol.

If I use bigrams for tokenisation.

Let S1 = set of tokens of string 1.

  • {#c,ca,ar,rl,lo,o#}

Let S2 = set of tokens of string 2

  • {#c,ca,ar,ro,ol,l#}

Overlap(S1,S2), O(S1,S2) = number of common tokens.

In our case common tokens are #c,ca,ar

O(S1,S2) = 3

If I use trigrams for tokenisation.

  • {##c,#ca,car,arl,rlo,lo#,o##}
  • {##c,#ca,car,aro,rol,ol#,l##}

O(S1,S2) = 3

The Jaccard index a.k.a coefficient a.k.a measure

Well, you saw the Overlap measure. What is it? It actually has no value itself without context. We need to make it usable. Let’s have an index always 0..1 – the more to 1, the higher is set similarity. I’ll also multiply by 100 to have percents.

Jaccard(S1,S2) = count of common tokens  in strings S1,S2 / count of all unique tokens in both S1,S2

Or rephrasing a bit more scientific:

Jaccard(S1,S2) = intersect of tokens S1,S2 / union of tokens S1,S2

Note: ‘union’, not ‘union all’. Union are all unique, while union all are all.

Jaccard(S1,S2) = overlap of tokens / total count of unique tokens

Knockout now with The Formula :)

Jaccard

String 1 Carlo, string 2 Carol.

If I use bigrams for tokenisation:

  • S1 = {#c,ca,ar,rl,lo,o#}
  • S2 = {#c,ca,ar,ro,ol,l#}
  • Overlap(S1,S2) = 3
  • total count of unique tokens = 9

Jaccard(S1,S2) = 3/9 = 0.33 = 33%

Visualisation from http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

Jaccard_online

If I use trigrams for tokenisation:

  • S1 = {##c,#ca,car,arl,rlo,lo#,o##}
  • S2 = {##c,#ca,car,aro,rol,ol#,l##}
  • Overlap(S1,S2) = 3
  • total count of unique tokens = 11

Jaccard(S1,S2) = 3/11 = 0.27 = 27%

Interpreting the result:

  • Two sets having all tokens as common will be 100% similar. The closer to 100%, the more similarity (e.g. 90% is more similar than 89%).
  • If they have no common tokens, they are 0% similar.
  • The half –  50% – means that the two sets have common half of the members.

The Jaccard distance

Measure of how different or dissimilar two sets are. It is the complement of the Jaccard index and can be found by subtracting the Jaccard index/measure/coefficient from 1 (or percentage from 100%).

For the above example with bigrams, the Jaccard distance is 1 – 33% = 67%.

Trigrams have Jaccards distance = 100% – 25% = 75%

So what?

Here you might ask: hey, but you showed they are not usable at all! Carlo IS similar to Carol!

Let me remind I was showing the idea on a very simplified example and we are talking about set similarity, not string similarity.

To treat as a string, let’s use Jaccard for Carlo and Carol having q=1

  • {c,a,r,l,o}
  • {c,a,r,o,l}

Voila! Jaccard measure = 1, Jaccard distance = 0%.

Because THE SETS OF SYMBOLS are equal!

Jaccard_online2.PNG

Now back to sets and names in classes.

Tokenisation method 1 – names are tokens:

S1={Katrina,Janis,Karlis,Martins,Anna,Karlina}

S2={Annija,Martins,Marta,Karolina,Ance}

O = 1 (Martins)

Total count of unique tokens = 10

Jaccard measure = 1/10 = 0.1 = 10%

Jaccard distance = 90% (remember: not similarity, but dissimilarity)

Conclusion: by method 1 names in these classes are not similar.

And now let’s calculate the same using bigrams :)

Tokenisation method 2 – bigrams are tokens:

I used R for splitting to bigrams

R-trigrams.PNG

and Excel for visualisation

Bigrams.PNG

Jaccard measure = 15/30 = 0.5 = 50%

Jaccard distance = 100% – 50% = 50%

Conclusion: by method 2 names in these classes are half similar.

Tokenisation method 3 – trigrams are tokens:

Trigrams.PNG

Jaccard measure = 11/43 = 0.26 = 26%

Jaccard distance = 100% – 26% = 74%

Conclusion: by method 3 names in these classes are quite a little similar.

Have you got the idea? Now you can apply the same method on different classes and have a reliable method to measure similarity.

Which one to choose?

The method which suits YOU the best. I can’t predict, will you consider Marta and Martins similar or dissimilar in YOUR business case.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

%d bloggers like this: