## Archive for the ‘Mācību lietas’ Category

### Big Data studies: having fun rolling virtual dice in R

I devoted a day to learn and play with RStudio to generate random values. Just a cool stuff to know later when more serious probabilities will be analysed.

I wrote my self-made functions for rolling and also draw simple charts with results.

### Rolling 6 sided dice

This is how it looks when just run

```dice.roll <- function(n){sample(1:n, size = 1)}
dice.roll(6)
for (i in c(1:10)){
print(dice.roll(6))
}```

[1] 4
[1] 1
[1] 5
[1] 4
[1] 1
[1] 2
[1] 6
[1] 2
[1] 3
[1] 1

and this is how it looks like when chart drawn. Remember to vectorize function.

```dice.roll <- function(n){sample(1:6, size = 1)}
dice.roll <- Vectorize(dice.roll)
plot(dice.roll(1:100), type="o", col="blue", xlab="Times rolled", ylab="Value rolled")
title(main="Rolling dice", col.main="red", font.main=4)```

### Rolling 12 sided dice

```> for (i in c(1:10)){
print(dice.roll(12))
}```

[1] 8
[1] 1
[1] 12
[1] 3
[1] 2
[1] 10
[1] 9
[1] 10
[1] 4
[1] 1

Chart

```dice.roll <- function(n){sample(1:12, size = 1)}
dice.roll <- Vectorize(dice.roll)
plot(dice.roll(1:100), type="o", pch=23, lty=2, col="green", xlab="Times rolled", ylab="Value rolled")
title(main="Rolling 12-sided dice", col.main="brown", font.main=2)```

### 2-sided – like tossing a coin

```> for (i in c(1:10)){
print(dice.roll(2))
}```

[1] 1
[1] 2
[1] 1
[1] 1
[1] 1
[1] 2
[1] 1
[1] 2
[1] 1
[1] 1

Chart

```dice.roll <- function(n){sample(1:2, size = 1)}
dice.roll <- Vectorize(dice.roll)
plot(dice.roll(1:100), type="l", pch=20, lty=1, col="violet", xlab="Times tossed", ylab="Value tossed")
title(main="Tossing a coin", col.main="dark blue", font.main=2)```

## Let’s walk through some very simple probability examples.

### Q1: What is the probability to roll equal numbers rolling 2 dices one time?

To find that I am going to understand the space of good rolls and probability then is good vs all.

```c <- expand.grid(x=1:6, y=1:6)
cat("total possible combinations for two dice are ", nrow(c))
cat("I can have equal numbers rolled in ", length(which(c[1] == c[2])), " ways")
pA <- length(which(c[1] == c[2]))/nrow(c)
cat("Probability p(A) to roll equal numbers is: ", pA)```

total possible combinations for two dice are 36
I can have equal numbers rolled in 6 ways
Probability p(A) to roll equal numbers is: 0.1666667

### Q2: What is the probability to roll sum between 7 and 10 when rolling 2 dices?

```cat("combinations to have sum between 7 and 10 are ",length(which(rowSums(c) >= 7 & rowSums(c) <= 10)))
pB <- length(which(rowSums(c) >= 7 & rowSums(c) <= 10))/nrow(c)
cat("Probability p(B) to roll sum between 7 and 10, is: ", pB)```

combinations to have sum between 7 and 10 are 18
Probability p(B) to roll sum between 7 and 10, is: 0.5

### Q3: What is the probability to roll sum 2 or 7 or 8 when rolling 2 dices?

```cat("sum is 2 or 7 or 8 combinations are ",length(which(rowSums(c) == 2 | rowSums(c) == 7 | rowSums(c) == 8)))
pC <- length(which(rowSums(c) == 2 | rowSums(c) == 7 | rowSums(c) == 8)) / nrow(c)
cat("Probability p(C) to get sum 2 or 7 or 8, is: ", pC)```

sum is 2 or 7 or 8 combinations are 12
Probability p(C) to get sum 2 or 7 or 8, is: 0.3333333

### Draw probabilities of sequential Heads for coin or Six for dice

```coins <- function(x)
{
1/2^(x-1)
}

dice <- function(x)
{
1/6^(x-1)
}

x<-2:15

plot(x,coins(x),type="l",ylim=c(0,1),col="red",
lwd=3,lty=3,main="Probability of sequential Head for coin or six for dice", ylab="probability",
xlab="rolls count", log="x", axes=F)
axis(2,at=seq(0,1,0.1),labels=T)
axis(1,at=seq(0,max(x),1),labels=T)
lines(x,dice(x),type="l",col="blue",lwd=2,lty=2)

legend(2, 1, legend=c("Coins", "Dice"),
col=c("red", "blue"), lty=3:2, cex=0.9)```

### Let’s roll three dice ten times

```dice.roll <- function(x,n){sample(1:x, size = n, replace=TRUE)}
dice.roll(6,3)

for (i in c(1:10)){
print(dice.roll(6,3))
}```

[1] 3 3 3
[1] 2 2 5
[1] 6 5 1
[1] 6 4 6
[1] 4 1 5
[1] 2 1 1
[1] 6 3 6
[1] 5 6 1
[1] 5 6 4
[1] 3 6 1

Q4: When three dice rolled, what is the probability to have at least one “1” at a condition that different numbers are rolled?

Conditional probability is easy: P(A|B) = P(A ∩ B) / P(B)

• A is “have at least one 1”
• B is “roll different numbers” – not always happens, so a probability to be found

I’ll do now like engineers do – just calculate sample sizes :) without formulas

```c <- expand.grid(x=1:6, y=1:6, z=1:6)
cat("P(A ∩ B) Combination count when at least one is 1 and different numbers rolled",length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3]&(c[1] == 1 | c[2] == 1 | c[3]==1))),
" and total count of combinations is ", nrow(c))
pAB <- length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3]&(c[1] == 1 | c[2] == 1 | c[3]==1)))/nrow(c)
cat("P(B) = combinations count when different numbers rolled ",length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3])),
" and total count of combinations is", nrow(c))
pB <- length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3]))/nrow(c)
cat("P(A|B) = P(A ∩ B) / P(B) = ", pAB / pB)```

P(A ∩ B) Combination count when at least one is 1 and different numbers rolled 60 and total count of combinations is 216
P(B) = combinations count when different numbers rolled 120 and total count of combinations is 216
P(A|B) = P(A ∩ B) / P(B) = 0.5

Next time – some of binomials. I wish I had 48 hours a day.

### Mathematical Statistics: the more I learn, the more I realize how much I don’t know

There are two courses scheduled in the Big Data analytics module this semester:

Warehousing I dare to think I know quite well – however formal requirements of course are much harder one could expect – presentations, researches and a real DWH developed – and I love it as it means I will learn new things, not just reuse existing.

I wish I could share the same optimism about statistics theory and formulas. I faced the sad truth:

1. I have forgotten how to calculate integrals and I deeply regret also other skills like Poisson Approximation of Binomial Probabilities have left my memories…
2. I have no experience with programming in R.

### The first lecture started.

Professor told us the plan. Well, this course will be quite tough, I thought.

Imagine my face when I realized this was not a plan of the course. This was a plan for the first intro lecture. Kind of the easy one.

EG, the topic “binomial distributions” was covered for some seconds – “Binomial distributions you all obviously know”. Click, next slide – “Central Limit Theorem, you all know that also”.

Mamma Mia, where am I and where are my belongings? Why we had no the intro about the emergency exit and jackets with oxygen falling for those who are in panic??

Calm down, dude, calm down.

The teacher, prof. Janis Valeinis is the type of teachers I value the most – he has a sparkle is his eyes. He loves the topic, he lives in the topic and statistics is his passion. It actually is a honor for me to study here and have this amazing chance to learn from a professor like he. After all, this is master’s degree level studies and this is just normal students are presumed to be skilled beforehand.

Look around, you see, each student is sitting with a poker face and I will also. I WILL!!! I was here 20 years ago as a successful student in love with differential equations and algebra of sets. I will domesticate integrals and Poisson again, and even more, I will put a bridle on R too.

So my mission is, should I choose to accept it:

1. remember a lot of basics, from combinatorics to integrals
2. be able to follow the study topics presuming all students know the concepts
3. accelerate my R skills from 0 to 100
4. do homeworks, pass quizzes and exam
5. ah, yes, and keep working full day as DWH analyst – programmer.

I will not mention my other roles like servant of our cats. No discounts anyway.

## Here comes my plan. NB: I do not know yet will it work :)

Step 1. Install RStudio (free, open source)

Step 2. Buy 1,001 Statistics Practice Problems For Dummies and set a goal solve 60 questions per day – I love learning by doing.

Example of question:

(I like the idea and keep in waiting list ‘R for dummies’ and ‘Statistical Analysis with R For Dummies’)

Step 3. Subscribe in Coursera to Introduction to Probability and Data. If you choose “audit mode”, it is for free because you can listen but will not get any certificate of completion etc.

• Install Coursera app – I can access course better via app. When using browser I see week is locked till day x but via app I can listen to any topic.
• Download topics for offline usage. I am listening them in headphones on my way to work, to lunch and at any suitable moment.

Step 4. Plan and mark in you calendar at least 2 hours per day as learning hours – I use one hour in early mornings and two after work in evening. Weekends I devote for studying about 8 hours a day.

I just love doing it like others do knitting. Thanks to my family accepting my forever learning way of life.

Enjoy the happy moments when the feeling “I did it” rises like a phoenix.

## Professor started the course with a so called birthday paradox.

How many students in class do you need to do a bet that at least two of them will have birthday in the same day of year, assumed that all 365 possible birthdays are equally likely?

365 students? 181 students? 100?

Hah, in room with 23 people there is a chance 50:50 two will share the same birthday.

And we learned the mathematics behind it.

First person always has a birthday and any day suits, so we can ignore it.

Probability that second random person

• has the same birthday as first is 1/365 (0.3%).
• does not have the same birthday as first is 364/365 (99.7%) as it may have any birthday other than the birthday of first person.

Probability that third person

• has the same birthday as the first person is 1/365 * 1/365 and the same calculations if is has the same birthday as the second person (and probability would be 1/365*1/365*1/365 if we’d be looking all three of them having birthdays the same day). This number of possible birthday match combinations keeps growing with each next person in sequence, that’s why in “at least one”  task easier is to calculate from opposite to the probability that there are no two people in the room having the same birthday and subtract from 1 thus getting the probability of at least one matching birthday.
• does not have the same birthday as the first and second is 363/365 as third person may have any of the birthdays not already taken by first and second persons

Probability that fourth person does not have the same birthday is 362/365 etc.

Total probability that

• second person does not have the same birthday as first
• AND third person does not have the same birthday as first and second
• AND fourth person does not have the same birthday as first and second and third

is multiplication of probabilities.

365/365 * 364/365 * 363/365*…how many persons do we want to consider.

Here I will confess I started using R with an assumption this is just another C++ and started writing recursive loops like

```birthday <- function(n)
{
1-(365-n+1)/365
}

y <- 1;
for (n in 1:10)
{
y <- y*birthday(n);
return(y);
}```

Happily I did not succeed because if I did I might be adapted to a wrong mindset. The key to understanding for me was this sentence I found in one of blogs about R:

# ### In R you need to switch your mentality to thinking in vectors instead of for loops.###

When I realized that concept I became friends with R (moreover, now I am in love with R). Thanks, God and University degree, I am quite familiar with vectors.

In R everything is a vector. Single number is a length-one vector. You can choose when to calculate over a vector and when as a number. This little example helped me:

```#simple function to multiply given number with the next in sequence
> test <- function(n)
{n*(n+1)}

# calculate when n is 1
> n <- 1
> test(n)
[1] 2
#when n is 2
> n <- 2
> test(n)
[1] 6
# when n is 3
> n <- 3
> test(n)
[1] 12
# and now one of the powers of R - no looping needed:
> n <- 1:3
> test(n)
[1] 2 6 12```

One more example of different usages

```> testn <- function(n)
{prod(n*(n+1))}
> testn(1:3)
# now you will have the results multiplied 2 * 6 * 12 = 144
[1] 144
# vectorize it and repeat - completely different result
> testn <- Vectorize(testn)
> testn(1:3)
[1] 2 6 12```

After getting this idea I immediately deleted my clumsy loops and voila! my new R-life began. Learned by playing.

```> prod(3)
[1] 3
> prod(3):1
[1] 3 2 1
> prod(3:1)
[1] 6```

And birthdays now are easy – we instruct loop through set of given persons in class, like to calculate for 2 persons, for three, 4, 5, … as many as we want.

I must note here that The Pigeon-hole Principle (Dirihlē princips) clearly says that if we have more than 365 persons, eg, 366 persons, it guarantees that at lest two persons will have the same birthday. But it is not today’s topic. Today we’ll see that we mathematically don’t need 366 persons to have a match :)

```birthdays <- function(n)
{
1-prod((365-n+1):365)/365^n
}
```

This vector stuff is not obvious, especially if you don’t have background understanding. I’ll show more examples. Let’s forget to vectorize function.

```birthdays <- function(n)
{
1-prod((365-n+1):365)/365^n

n <- 1:3
birthdays(n)
[1] 0.0000000 0.9972603 0.9999925
Warning message:
In (365 - n + 1):365 :
numerical expression has 3 elements: only the first used
>```

What happened? Is it really that 2 persons in a class have 99.7% chance to match birthdays?  And three even 99%?

### C’mon, something definitely wrong here.

We passed too much values to our built-in R loop, see this example:

So, what happened and why for n 1 2 3 (to calculate the needed 365,364,363)

we got crazy results 0.0000000 0.9972603 0.9999925?

Because in the first part of formula we try to get three elements up to 365:

365 to 365 = 365

364 to 365 = 364, 365

363 to 365 = 363, 364, 365

It is not possible to have all at once by my current code. Thus why R warned only the first value will be used of n = 1, resulting ‘365’ is taken as he result set for the first part of formula for any n

0.0000000 (this is for n=1) is the result of formula (1 – (365)/365^1) this is ok as only one occurrence here

0.9972603 (this is for n=2) is the result of formula 1 – (365)/365^2 while we need the result to be 1-(365/365)*(364/365) = 1-(365*364)/365^2 = 0.00273973

0.9999925 (this is for n=3) is the result of formula (1 – (365)/365^3) while we need the result to be 1-(365/365)*(364/365)*(363/365) = 1-(365*364*363)/365^3 = 0.00820417

### Let’s have even more crazy example and follow up to getting their values:

n <- 3:4;n
[1] 3 4
> birthdays(n)
[1] 0.008204166 0.997282751
Warning message:
In (365 – n + 1):365 :
numerical expression has 2 elements: only the first used

now we have n 3 4 and after calculations (365-n+1) we have 363, 362 and as per our conditions we want loops

{363,364,365}

{362,363,364,365}

But again, we can loop only one and R says it will use the first.

0.008204166 (this is for n=3) is the result of formula 1 – (363*364*365)/365^3

0.997282751 (this is for n=4) is the result of formula 1 – (363*364*365)/365^4

Now we have more motivation to remember to vectorize function :) and

## voila! we have The Results!

```> n <- 1:4;n
[1] 1 2 3 4
> birthdays <- Vectorize(birthdays)
> birthdays(n)
[1] 0.000000000 0.002739726 0.008204166 0.016355912```

Why it started working? Because now the function will work with products (364*365), (363*364*365) etc instead of previous approach to loop through one set.

Here:

0.000000000 is the probability to match birthday in class with one person (1-365/365). Of course, it is not possible as there is no other person to match with.

0.002739726 is the probability to match birthday when 2 persons in class (0.3% or 1-364/365*365/365)

0.008204166 is for 3 persons in class (0.8% or 1-((363/365)*(364/365)*(365/365))

0.016355912 is for 4 persons (1.6% or 1-((362/365)*(363/365)*(364/365)*(365/365)

Now let’s draw it as a chart and search for the person count when we reach probability 50:50 or 0.5

```birthdays <- function(n)
{
1-prod((365-n+1):365)/365^n
}
n <- 1:50
birthdays <- Vectorize(birthdays)

plot(n,birthdays(n),type="l",ylim=c(0,1),col="dark red",
lwd=2,lty=3,main="Dzimšanas dienas", ylab="varbūtība, ka vismaz diviem ir vienā dienā",
xlab="cilvēku skaits", axes=F)

axis(2,at=seq(0,1,0.1),labels=T)
axis(1,at=seq(0,max(n),1),labels=T)

# draw line ar probability 0.5
abline(h=0.5,lwd=1,col="brown")

legend(1, 1, legend="Varbūtības grafiks", col="dark red", lwd=2,lty=3, cex=0.6)```

and enjoy the result.

We see that probability becomes 50:50 at person count 23. Let’s perform probability calculations for these two values:

birthdays(22)
[1] 0.4756953
> birthdays(23)
[1] 0.5072972

### You might ask to perform the calculations in opposite way:

if we set the desired probability like 0.5, how do we find the count of persons when it occurs?

This type of calculations is done by approximations because the opposite formula you have to find the n is complex as n is the time of multiplication, decreasing and raising to the power.

In our situation the solution will be like engineers do :)

Option a: do the calculations, store in array and retrieve the nearest value by index. Read below how many persons to invite in class if you want 50% or 99% probability:

```> # calculate values for 1 to 100 persons and store in an array
[1] 0.000000000 0.002739726 0.008204166 0.016355912 0.027135574 0.040462484 0.056235703 0.074335292 0.094623834
[10] 0.116948178 0.141141378 0.167024789 0.194410275 0.223102512 0.252901320 0.283604005 0.315007665 0.346911418
[19] 0.379118526 0.411438384 0.443688335 0.475695308 0.507297234 0.538344258 0.568699704 0.598240820 0.626859282
[28] 0.654461472 0.680968537 0.706316243 0.730454634 0.753347528 0.774971854 0.795316865 0.814383239 0.832182106
[37] 0.848734008 0.864067821 0.878219664 0.891231810 0.903151611 0.914030472 0.923922856 0.932885369 0.940975899
[46] 0.948252843 0.954774403 0.960597973 0.965779609 0.970373580 0.974431993 0.978004509 0.981138113 0.983876963
[55] 0.986262289 0.988332355 0.990122459 0.991664979 0.992989448 0.994122661 0.995088799 0.995909575 0.996604387
[64] 0.997190479 0.997683107 0.998095705 0.998440043 0.998726391 0.998963666 0.999159576 0.999320753 0.999452881
[73] 0.999560806 0.999648644 0.999719878 0.999777437 0.999823779 0.999860955 0.999890668 0.999914332 0.999933109
[82] 0.999947953 0.999959646 0.999968822 0.999975997 0.999981587 0.999985925 0.999989280 0.999991865 0.999993848
[91] 0.999995365 0.999996521 0.999997398 0.999998061 0.999998560 0.999998935 0.999999215 0.999999424 0.999999578
[100] 0.999999693
Enter desired probability: 0.5
[1] 23
Enter desired probability: 0.99
[1] 57```

Option b: use R function uniroot to do actually the same – loop through values for the hit, but I haven’t learned it yet.

## Now let’ s have some fun: let’s generate random birthdays and see what happens.

I assigned numbers 1 to 365 to the days, then I generate two random sample sets and take the intersected values and convert them to date of year 1970 (just for fun).

```smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
23

sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))```

### And the results of 10 randoms for 23 people (probability to get a match in a single trial is 50%):

• no match: 1 time
• one common birthday: 4 times
• two common birthdays: 3 times
• three common birthdays: 2 times
```> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 23
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "1970-09-18" "1970-11-25"
[1] "Date of length 0" #no match
[1] "1970-03-11" "1970-04-03" "1970-07-25"
[1] "1970-01-07" "1970-12-29"
[1] "1970-01-25"
[1] "1970-01-26"
[1] "1970-04-01"
[1] "1970-01-23" "1970-06-06" "1970-08-27"
[1] "1970-09-30"
[1] "1970-08-16" "1970-08-27"```

### Let’s do 10 randoms for 10 people in class (probability to get a match in a single trial is 12%):

• no match: 7 times
• 1 common birthday:  2 times
• 2 common birthdays: 1 time
```> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 10
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "1970-12-19"
[1] "Date of length 0"
[1] "1970-07-04"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "1970-02-06" "1970-11-09"```

### Let’s do 10 randoms for 5 people in class (probability to get a match in a single trial is 3%):

• no match: 9 times
• 1 match: 1 time
```> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 5
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "1970-09-07"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"```

### Let’s do 10 randoms for 2 people in class (probability to get a match in a single trial is 0.1%):

• no matches
```> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 2
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"```

And now one more fun: let’s add another chart:

### probability to have the same day of a month in a class of n persons

(assuming for simplicity there are 31 days)

```birthdays <- function(n)
{
1-prod((365-n+1):365)/365^n
}

birthdays <- Vectorize(birthdays)

birthdates <- function(n)
{
1-prod((31-n+1):31)/31^n
}

birthdates <- Vectorize(birthdates)

n <- 1:50

plot(n,birthdays(n),type="l",ylim=c(0,1),col="dark red",
lwd=2,lty=3,main="Matching days two at least", ylab="probability",
xlab="person count", axes=F)

axis(2,at=seq(0,1,0.1),labels=T)
axis(1,at=seq(0,max(n),1),labels=T)

abline(v=23,lwd=1,col=5)
abline(h=0.5,lwd=1,col="brown")

lines(n,birthdates(n),type="l",col="blue",lwd=2,lty=3)
abline(v=6.8,lwd=1,col="violet")

legend(1, 1, legend=c("Days of year", "Days of month"), col=c("dark red", "blue"), lwd=2,lty=3, cex=0.8)```

And we see that probability 50:50 the same day of month is reached at 7 persons.

Next blog expected to be about probabilities and binomial distribution, discussing solutions for tasks like

# Q1. A coin is tossed 20 times. What is the probability of getting exactly 3 heads?
# Q2. A die is rolled 20 times. What is the probability of getting exactly 10 six?

# Q3: 80% of people who purchase pet insurance are women. If 9 pet insurance owners are randomly selected, find the probability that exactly 6 are women.

P.S. Meanwhile – behind the scene:

### My Oracle certifications what-ifs and lessons learned: have good skills and you’ll not need luck

My takeaway from the Big Data course “Data processing systems” I was blogging about is passing the course exam and reusing the knowledge for an Oracle certification.

I passed course exam in the end of December and I had a month till the next semester and I decided to use it for additional learning to apply for Oracle certification Oracle Big Data 2017 Certification Implementation Specialist.

I will return to Big Data topic when I have the next Big Data course in following semesters. This semester my challenges are

• Data warehousing
• Statistics.

I hope to start blogging about them soon.

## Meanwhile in this blog entry may I share some my lessons learned from Oracle certifications,

they have about 200 different certifications in 10 categories. You can browse:

–Certifications and exams required (NB: combinations available)

–Exams and what certification this exam is useful for

I have done four (I know a guy holding 32) and it makes me think I have some things to say about the process. I have

Credentials are granted based on a combination of passing exams, training and performance-based assignments, depending on the level of certification.

#### Oracle has a good marketing:

• “Get the job you want – earn an Oracle Certification”
• “86% of hiring managers surveyed say that IT certifications are a priority during the candidate evaluation process” etc.

#### How to choose from exam set?

It depends on

• your plans towards certification path
• value (eg passing the exam 1Z0-071 suits for 4 certifications!)
• project requirements
• your new job or career change dreams (I did is several years ago and I know my OCA certificate was the trump when I, at that time Informix developer, was confirmed at my current Oracle expert position)
• your current skills (why not go for some ‘done’? – I do that a lot)
• Oracle environment available
• Oracle versions covered in exam
• exam price
• accessibility (proctored (classroom)/non proctored (online))
• exam length and passing score
• for fresh exams there might be less training materials available
• your attitude and readiness to learn and challenge yourself (I do that a lot)

If exams are similar, then search like : ‘1Z0-071 vs 1Z0-061’ etc. I usually choose the hardest one to learn more.

Some time ago SQL was only a tool for me. Like hammer, partially because I was working in Informix and had no feeling I can do miracles in SQL and Informix SPL (apologies for liking Oracle SQL and PLSQL better).

Little by little I realized I want more and decided to study Oracle in parallel to get ready for changing job profile. I read about the Order of the Wooden Pretzel, inspired by the famous Steven Feuerstein quote “SQL is not a complete language. Some people can perform seeming miracles with straight SQL, but the statements can end up looking like pretzels created by someone who is experimenting with hallucinogens”. To prove that, international SQL challenges were held and winner becomes Knight of the Order of the Wooden Pretzel.

May the quickest, most entertaining, most educational, most creative, and somewhat readable solution prevail!

In recognition that SQL is not the only language in which enterprising developers can create pretzels on hallucinogens, the challenge is also open to NoSQL solutions. Well, it seems this has been put on hold for a while but maybe I was not googling long enough.

### Formalities

#### Oracle strongly prohibits cheating

You sign an agreement before exam (example – http://education.oracle.com/education/pdf/ocp_candidate_agreement.pdf) including

• you can accidentally stumble into brain dumps forums/topics (and you may not even know they are…) – just be aware
• practice tests you buy might appear attractive wrapped dumps

Good news:

• authorised practice tests available for most exams (but not for all, for example, Big Data exam does not have).

### Some characteristics of the exams in SQL and PLSQL:

Exam day

• Shortly: pay for exam and book time via PearsonVUE in BDA
• Read confirmation email what you need to bring with you
• Note, you must have TWO person documents with you:
• passport and driver licence, for example
• Arrive ~20 mins earlier (there is coffee and cookies in BDA)
• You will be photographed AND face will be compared to previous exams if any (so might be several photos taken until matches)
• You will place all belongings in a lockable wardrobe
• They’ll give
• a special pen like marker
• 1-2 sheets of laminated A4 paper (depends on exam); you may not erase your notes
• ear plugs
• You will be guided to a computer room and set until you see [Start test]
• You will be videomonitored all the time
• You may leave the room to WC, time is not stopped

Exam software

Before starting exam you have option to choose intro.

Interface changes between versions but the idea stays the same:

• Question content (may be with [Exhibit] or [View] buttons for popups)
• Checkbox [Mark] for later review
• Buttons [Previous] and [Next] – you always can browse
• Button [Review] (sub-options – marked ones or all)
• Button [End exam] (you may press it any time; re-asks before quit)
• Time counter showing how much time left like (00:46:32)
• Question counter like ‘Question 17 of 63’

Do not leave marked unanswered question – points are not negated for qwrong answer and guessing comes for free – because exam time flies very fast.

English language, no dictionary available, no time discount.

#### Question types

• May contain [Exhibit] or [View] buttons for popups, usually showing table descriptions, rarely questioned target report template
• Radio buttons (one from four options usually)
• Checkboxes, usually 4 to 6
• Choose two (or three)
• Choose all that apply
• May contain options ‘all of above’ and/or ‘none of the above’
• May be combined question statements and options like
1. A-4, B-1, C-3, D-1, E-1, F-3
2. A-3, B-4, C-1, D-1, E-1, F-3
3. A-1, B-2, C-4, D-2, E-2, F-1
4. A-1, B-3, C-3, D-2, E-2, F-2
5. E. A-1, B-1, C-1, D-2, E-1, F-3

#### After exam

1. You will be given a printout that your results will be available online within certview.oracle.com in about 30 min
2. Usually there are within 10 minutes (if you have issue with CertView account you can some hours later call Oracle support and by your testing ID they will tell results by phone)

• Earlier they used to send printed Certificates and cards but now are saving environment and does not send

Associate level certification

Professional level certification

### Learning – what?

• I found Gints Plivna blog to be very useful, especially NULL: https://datubazes.wordpress.com/sql-pamati/
• Stay real and stay calm. Don’t try to memorize REGEXP
• Enjoy the process and have fun
• I subscribed to various a la ‘Daily SQL challenge’
• via email

### Learning – how?

• Despite experience it takes time, if you target to add value for yourself
• depends on experience. Eg, OCA about a month daily 1-2 hours
• read topic by topic in manuals
• play a lot in Oracle
• drill practice test
• Set mindset to do exercises and daily tasks correct at once
• Consider buying practice tests and drilling daily by portions
• Remember: passing score is never 100%! You MAY afford to have mistakes
• Helpful: fast recognising of obvious errors to lessen options
• Questions containing functions may serve as hints for others (like syntax of NVL2 or TO_CHAR or SUBSTR)
• Internet full of incorrect examples and wrong answers, so it is crucial to distinguish right from wrong and test, test, test
• if this is not Oracle page and there is a question and answer like ‘Correct: B’ without explanation it sounds stolen dump – be aware
• often accompanied with note ‘any fool can see that correct is B’Many people enjoy writing blogs about preparation, I was reading them also.

## WHAT IF

Practice tests – see education.oracle.com

 Oracle Database 11g: SQL Fundamentals I 1Z0-051 https://www.transcender.com/practice-exam/oracle/1z0-051.kap (it was easier then real exam) 30 days online access 99 \$ 146 questions Oracle Database 11g: Program with PL/SQL 1Z0-144 https://www.transcender.com/practice-exam/oracle/1z0-144.kap 30 days online access 99 \$ 147 questions Oracle Database 11g: Advanced PL/SQL 1Z0-146 https://www.transcender.com/practice-exam/oracle/1z0-146.kap (again – it was easier then exam) 30 days online access 99 \$ 224 questions

There are plenty of practice tests if you google. Remember: You sign agreement to not use unauthorised materials

– Practice Exam:Oracle authorized practice exam from Transcender

## Examples of challenges

• There is always a chance something you do not know appears
• There might be questions like ‘what is the 3rd parameter of DBMS_RLS.REFRESH_GROUPED_POLICY’
• And, eg, which of the answers can be produced by a specific built in package (DBMS_REDEFINITION, DBMS_HPROF, DBMS_LOB, UTL_COLL etc)
• Don’t panic – your background is good enough to do a good guess. Remember, you may afford to have mistakes

And, after all, even failing is not the end or world. Very many people fail. One of my friends passed PLSQL exam in 4th try.

Good skills and good luck!

### Big Data: Phonetic Similarity : Soundex – words are similar if they sound the same

I guess you have seen surnames like Meier and Mayer or Smith, Smyth, Smithe, Smeth, Smeeth. These might be as well correct as misspelt surnames – if you’d dictate them to me by phone, who knows what I’d write.

So far I was blogging about similarity algorithms based on string writing. Today let’s discuss finding a match if the words sound the same. The most commonly used phonetic measure is Soundex. It has several extensions and my today’s topic is

### Soundex for English language.

Soundex calculates a four character code from a word  based upon the pronunciation and considers two words as similar if their codes are equal. The idea is that similar sounding letters have are assigned the same soundex code. Widely used in genealogy, in archives, searching ancestors, relatives, families, heirs.

1. The first character is the starting letter of a word. (in a variation called “Reverse Soundex” prefixes the last letter instead of the first)
2. Drop all other occurrences of a, e, i, o, u, y, h, w.
3. Replace consonants after the first letter with digits as follows:
• b, f, p, v → 1
• c, g, j, k, q, s, x, z → 2
• d, t → 3
• l → 4
• m, n → 5
• r → 6
• If two or more letters with the same number were adjacent in the original name (before step 2), or adjacent except for any intervening h and w, then omit all but the first.
• Return the first four padded with 0 (padding means replace blanks with 0, like ‘Ahoi’ will have code A000, ‘Dude’ will have D300 – always four characters code).

Let’s have an example set – surnames. I used Oracle RDBMS this time.

Now let’s compare similarities by three methods: Edit distance, Jaro-WInkler, Soundex.

and here are the results. Notice the combinations we have: if I set a similarity threshold by Edit distance or Jaro-Winkler to 50% then we have several combinations. including false positives and false negatives:

• all three methods match – like ‘Mirhe’
• Jaro-Winkler and soundex match, but Edit distance doesn’t – like ‘Meiyar’
• Jaro-Winkler match but Soundex doesn’t – like ‘Mayes’
• Edit distance and Jaro-Winkler match but Soundex doesn’t – like ‘Mimre’ or ‘Mirfe’

You see, Soundex is not a silver bullet and, as I have always been writing, we must try and test, test ad try.

I’ll show you one more weakness of Soundex:

From the three approaches I used Soundex is the only one which did not find similarity :)

### Some of Soundex variants

• The New York State Identification and Intelligence System (NYSIIS) algorithm maintains relative vowel positioning, while Soundex does not.
• Daitch–Mokotoff Soundex (D–M Soundex) adaptation to Jews with Germanic or Slavic surnames, sometimes referred as “Eastern European Soundex”. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999, calculation is much more complex than Soundex.
• Metaphone, Double Metaphone, Metaphone 3. Powerful and customisable rule set, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations.

I googled online Metaphone calculator, they say It’s more accurate than soundex – hardly can agree:

• The metaphone code for Schwarzenegger is SXWRSNKR.
• The metaphone code for Schvartzeneger is SXFRTSNJR.
• These surnames do not have the same metaphone code.

Then I tried for one of my Soundex similarities and – again

• The metaphone code for Meiyar is MYR.
• The metaphone code for Mire is MR.
• These surnames do not have the same metaphone code.

I was also searching for Soundex Latvian edition – I am quite sure it exists. I found this: http://www.lzp.gov.lv/images/stories/dokumenti/Zin_rezult_2008.pdf

2008. g. izstrādāts un pilveidots elastīgs universālas leksikona sistēmas datubāzes
modelis, kas paredz vienotas infrastruktūras (kopīgu indeksēšanas un atgriezeniskās
saites mehānismu u.c.) un funkcionalitātes (šķirkļu izvērstas meklēšanas un
konfigurējamas atainošanas u.c.) pieejamību visām datubāzē izvietotajām vārdnīcām
neatkarīgi no to šķirkļu shēmām. Attiecībā uz indeksēšanu un meklēšanu, latviešu
valodai tika pielāgots Soundex algoritms, lai nodrošinātu neprecīzi ievadītu, bet pēc
izrunas līdzīgu vārdu atrašanu. (A. Spektors)

P.s. Tiem, kas lasa arī latviešu valodā – šeit ir maziņš un mīlīgs foruma ieraksts, kā cilvēks cenšas izveidot meklēšanas ieteikumu rīku (“vai jūs domājāt XXYZZX?”)  https://exs.lv/say/16261/1441002-so-nakti-pavadiju-veidojot

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

### Big Data: hybrid similarity measure: the Soft TF/IDF Measure to deal with misspelt words

Some days ago I covered the topic about finding which two of these strings are most likely about the same real world entity – by recognizing requently used words and assigning them lower impact, thus first and third options were found as most similar:

• Apple Corporation, USA
• IBM (USA) Corporation
• Corp. Apple

Let’s add a challenge: misspell the word

• Apple Corporation, USA
• IBM (USA) Corporation
• Corp. Aple

And, as you might imagine, TF/IDF measure is not effective anymore because it cannot recognize Apple is similar to Aple as in classic implementation is looking for equality, not similarity. Today, similar like we did in generalised Jaccard, we will replace the requirement for words to be equal with requirement to be similar.

As I did before with TF/IDF, let’s remove commas, brackets and tokenize our collection into a bag of terms – documents x,y,z.

• x = {apple,corporation,usa}
• y = {ibm,usa,corporation}
• z = {aple,corp}

Full term set T = {apple,corporation,usa,ibm,aple,corp}

Now we will use any string similarity measure of our choice (and business need). I have chosen today Needleman-Wunsch and Jaro-Winkler to illustrate the differences, eg, s(apple,aple) = 0.8 by Needleman-Wunsch and s(apple,aple) = 0.93 by Jaro Winkler. I used calculator https://asecuritysite.com/forensics/simstring

Let’s choose similarity (it is my choice, it could be any other value 0..1)

• threshold k = 0.55 for Needleman-Wunsch measure
• threshold k = 0.45 for Jaro-Winkler measure

Now we will reveal these terms which have similar term in the other document.

## Soft TF/IDF based on Needleman-Wunsch

Initial steps are the same as per classic TF/IDEf. Calculate:

### Frequency (TF)

is the number of times each term appears in each document.

Full term set T = {apple,corporation,usa,ibm,aple,corp} (as you see we will have six dimension vectors)

```TF(apple,x) = 1 (document x has one time apple)
TF(apple,y) = 0
TF(apple,z) = 0
TF(corporation,x) = 1
TF(corporation,y) = 1
TF(corporation,z) = 0
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0
TF(aple,x) = 0
TF(aple,y) = 0
TF(aple,z) = 1
TF(corp,x) = 0
TF(corp,y) = 0
TF(corp,z) = 1
```

### Inverse Document Frequency

IDF(apple) = 3/1 (one of three documents contains apple) = 3
IDF(corporation) = 3/2 = 1.5
IDF(usa) = 3/2 = 1.5
IDF(ibm) = 3/1 = 3
IDF(aple) = 3/1 = 3
IDF(corp) = 3/1 = 3

### Feature vectors (Feature = TF*IDF)

Let’s normalise these vectors: remember from trigonometry, it means getting the unit vector of length 1: divide the coordinates by the length of the vector.

```length vector document x = sqrt((3*3) + (1.5*1.5) + (1.5*1.5)) = 3.67
length y = sqrt((1.5*1.5) + (1.5*1.5) + (3*3)) = 3.67
length z = sqrt((3*3) + (3*3)) = 4.24```

Normalised vectors for documents (each coordinate divided by vector’s length):

### Now Needleman-Wunsch scores come into a game.

We compute the close terms.

Close(x,y,0.55) = {(apple is not because its corporation in Document y is similar by 0.55 but there is another word – corporation which bonds by 1 and thus blocks apple’s similarity), corporation (because it has strongest bond with word corporation in document y), usa (because it has strongest bond to usa in document y)}

• close(x,y,0.55) = {corporation,usa}
• close(x,z,0.55) = {apple}
• close(y,z,0.55) = {} (noone pair passed threshold)

In the final step we compute features but giving a weight to each component to the TF/IDF formula. We are looking for the most closest vectors. The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one. As our vectors are normalised, cosine formula now is simple computing the dot (scalar) product.

• similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Needleman-Wunsch similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
• similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.8 = 0.46 = 46%
• similarity(y,z) = 0% (remember, that words pairs, incl. corp and corporation did not pass our threshhold)

Voila! By Needleman-Wunsch the most similar strings are

• Apple Corporation, USA
• Corp. Aple

## Now let’s recalculate soft TF/IDF using Jaro-Winkler.

Threshold k = 0.45 for Jaro-Winkler (I have set different from Needleman-Wunsch just for more fun to learn differences better).

See in pictures how do we get to the closest results by keeping the strongest bonds:

• close(x,y,0.45) = {corporation,usa}
• close(x,z,0.45) = {apple,corporation}
• close(y,z,0.45) = {corporation}

Now let’s calculate Features from normalized vectors (the same vectors we calculated)

As these vectors are normalised, I’ll remind cosine formula is simple computing the dot (scalar) product.

• similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Jaro-Winkler similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
• similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.93 + x corporation 0.41 * z corp 0.71 * 0.87 = 0.79 = 79%
• similarity(y,z) = y corporation 0.41 * z corp 0.71 * 0.87 = 0.25 = 25%

Voila! By Jaro-Winkler the most similar strings again are

• Apple Corporation, USA
• Corp. Aple

Do you feel the power of the idea of this hybrid (string & set combined) similarity calculation method?

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

### Big Data: combining string and set matching methods. One of hybrid similarity measures – generalised Jaccard index

It’s time to start to combine string and set matching methods – let’s have a look at one of Hybrid Similarity measures.

Classic Jaccard measure considers overlapping tokens (words, q-grams). To be considered as overlapped, the token must be identical. Jaccard index works very well when names are in mixed order like “Food and drinks”, “Drinks&Food”, “Serving foods, sweets, drinks”. However pure Jaccard is too restrictive when text contains errors.

I had an example in my Jaccard blog entry comparing classes by pupils’ names – Martins, Marta, Katrina, Ance etc. What if some of names were written with errors?

## Generalized Jaccard measure helps.

First of all, we as usual convert the comparable string into tokens. I’ll reuse the example and put some errors in names

Class one pupils x = {Kartina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils y = {Annija,Martins,Matra,Karloina,Ance}

Today we’ll learn also soft overlap – using the most matching pairs.

### First step. compare each pair

To compare we need a similarity measure s which returns values 0..1 (the closer to 1 the more similar).

For more fun let’s apply two for the same pairs- Edit distance (Levenshtein distance) and Jaro-Winkler measure – see, the result differs? :) I used https://asecuritysite.com/forensics/simstring (sorry, this page has a bug in Jaro-Winkler algorithm – because it is not true (janis,martins) has 0 by JW (it should be 0.67) – but I could not find any other online calculator and for our experiment this bug is acceptable and let’s use it as an example how easy is to misuse method when we simply believe to somebody’s programmed result without understanding)

## Second step.

Choose threshold and keep only those who exceed. I have chosen threshold 0.5.

### Third step. Keep only the strongest bond.

To find that I draw all the upper threshold bonds at first.

Martins and Martins are of course bonded. It means no any other bonds possible from or to.

Karlina to Karloina has the next strongest remaining bond. Again, no other bonds from/to.

We have left Anna to Annija because all other bonds relate to “engaged” entities. For examle, Kartina cannot bond to Karloina with 0.71 because Karlina bonded with 0.88.

We calculate then weight of matching by adding all the match scores (2.55) and divide by the (all name in class X plus all names in class B minus matchinmg pairs) = 0.319 = 32% similarity when we hybridise Jaccard with Edit Distance.

Now let’s do the same for Jaro-Winkler. First of all, all bonds upper than threshold:

and keep only the strongest bonds. again you see, for example, Kartina cannot bond 0.88 with Martins because Martins bonded to Martins with 1. Kartina cannot also bond with 0.91 to Karloina because Karlina bonded to Karloina with 0.98.

And formula again – matching weight divided by sum of entities minues mathing pairs and – voila! = 0.707 = 71% similarity when we hybridise Jaccard with Jaro-Winkler measure.

I’ll remind that in my previous blog entry explaining Jaccard measure I showed that

• similarity with correctly spelled names and requirement for name equality was 10%
• similarity using bigrams was 50%
• similarity with trigrams was 26%

Today we calculated (on a slightly modified set – errors added)

• similarity with hybrid method with Edit distance was 32%
• similarity with hybrid method with Jaro-Winkler measure was 71%

Isn’t it funny – five different results? There is no silver bullet in string and set matching. Try and test, try and test, try and test… repeat.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

### Big Data: set similarity : TF/IDF scores and Feature vectors to devaluate terms common in other documents

Term Frequency, Inverse Document Frequency and Feature Vectors. Another great concept which is easy when you start doing it.

Which two of these strings are most likely about the same real world entity?

• Apple Corporation, USA
• IBM (USA) Corporation
• Corp. Apple

I believe you see: first and third. Why do you think so? Because of Apple there? Don’t you see the Corporation, Corp and USA? You somehow understand these are ancillary words, less meaningful.

How to teach computer to recognize that context? If we look at string level, Edit distance or Jaro-winkler, Affine Gap or Smith-Waterman, noone will solve it, they all will consider these are more similar:

• Apple Corporation, USA
• IBM (USA) Corporation

Well, Jaccard distance would spot the answer we are looking for but it is nice to learn another sophisticated method with added value – recognizing frequently used terms.

### Today’s topic: two sets are similar if they share many frequent terms unless these terms are common in other strings.

TF score will show us how frequently is this term used.

IDF score will who us is this terms common in other strings.

Feature vectors will reveal the most similar strings. A feature vector is a vector that contains information describing an object’s important characteristics – in our case these characteristics (measurable properties) will be TF and IDF scores.

1) remove commas and brackets (I might also remove a, the, and, semicolons etc)

• Apple Corporation USA
• IBM USA Corporation
• Apple Corp

2) convert all to lowercase

3) tokenize (split strings into words) and apply Smith-Waterman and/or Jaro-Winkler to calculate that Corp is similar to Corporation – I’ll assume that these words are equal (I might have set Corporation in my example but I wanted to make this example more challenging to show that data cleaning and applying rules is a normal step)

4) convert ‘corp’ to ‘corporation’ for me not to explain each time that I have a function which recognizes similarity of these terms

Let’s tokenize our collection into a bag of terms and call them documents x,y,z.

• x = {apple,corporation,usa}
• y = {ibm,usa,corporation}
• z = {apple,corporation}

Term set T is {apple,corporation,usa,ibm}

### Frequency (TF)

is the number of times each term appears in each document.

```TF(apple,x) = 1 (see, document x contains one time term 'apple')
TF(apple,y) = 0 (this document does not contain 'apple')
TF(apple,z) = 1
TF(corporation,x) = 1
TF(corporation,y) = 1
TF(corporation,z) = 1
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0```

From this we can also read that if this document contains this term, the score is >0, otherwise 0. TF score can’t be negative.

### Inverse Document Frequency (IDF)

is the number of documents in collection (N) divided by documents containing this term (Nd)

```IDF(apple) = 3/2 (two of three contain apple) = 1.5
IDF(corporation) = 3/3 (all three contain corporation) = 1
IDF(usa) = 3/2 (two of three contain usa) = 1.5
IDF(ibm) = 3/2 (two of three contain ibm) = 1.5```

Well, now we have numbers. And?

## Feature vectors

Do you remember vectors from school? It was that something like an arrow. Plane flying to North with speed 850km/h is like one vector and the wind blowing North-West is like another vector. If wind’s velocity is higher the plane from the ground might seem to be slipping sideways a little.

Vectors are used in computing very, very often. Example: v = [R; G; B]; is a feature vector containing color components of a pixel or an object.

In our case

• documents x,y,z will be the vectors
• Term set T {apple,corporation,usa,ibm} represent the dimensions (yeah, if thousand of terms than we have thousand-dimensional vector, uhh!)
• features will be calculated from TF and IDF scores.
• we will find if vectors are blowing the same direction (their angles are close). If yes – we will consider these documents similar

Formula to calculate features for each of documents and each of terms is:

Now let’s calculate features to draw a table (matrix). Note that all our features are greater or equal to 0 (because negatives are not possible).

```F(apple, x) = TF(apple, x) * IDF(apple) = 1 * 1.5 = 1.5
F(apple,y) =  0 * 1.5 = 0
F(apple,z) = 1 * 1.5 = 1.5
F(corporation,x) = 1 * 1 = 1
F(corporation,y) = 1 * 1 = 1
F(corporation,z) = 1 * 1 = 1
F(usa,x) = 1 * 1.5 = 1.5
F(usa,y) = 1 * 1.5 = 1.5
F(usa,z) = 0 * 1.5 = 0
F(ibm,x) = 0 * 1.5 = 0
F(ibm,x) = 1 * 1.5 = 1.5
F(ibm,x) = 0 * 1.5 = 0```

These are coordinates of  our vector in four-dimensional space (as we have four terms).

```X (1.5,1,1.5,0)
Y(0,1,1.5,1.5)
Z(1.5,1,0,0)```

Our vectors can have features only greater or equal to 0 (because we can’t have negative term frequencies in a document) and it means they will never have angle greater than 90 and cosine will never be less than 0. See my example visualisation for a three three-dimensional vectors (https://academo.org/demos/3d-vector-plotter/)

We have to find our three documens x,y,z feature vector pair which makes the closest angle. You remember trigonometry, don’t you? The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one.

The official formula – uh, looks impressive, doesnt it? (I wrote it using http://math.typeit.org/)

It is not as complex: sum of scalar products divided by sum of vectors length (magnitude).

• vector x = {apple,corporation,usa} coordinates (1.5,1,1.5,0)
• vector y = {ibm,usa,corporation} (0,1,1.5,1.5)
• vector z = {apple,corporation} (1.5,1,0,0)
• Term set T is {apple,corporation,usa,ibm}

### vector pair x,y scalar product and magnitude

Cosine of angle between vectors x,y is

### vector pair y,z

The larger cosine, the closer angle. And winner most similart pair, having the closest angle is…

• Apple Corporation, USA
• Corp. Apple

## Voila!

P.S. You might play also using this Feature formula:

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.