Do SQL and PLSQL in your own free database in Oracle Cloud

Hi, my dear followers. May I share my recent discovery which I am using now when training SQL and PLSQL in workshops or bootcamps. Why SQL? Because I am sure these skills are must have when you are in IT. They are like driving license – you can live without it however it is so comforting to have it for lifetime.

It has always been a challenge for students and teachers to have an environment for self learning and training. Lucky us who are working with relational databases and can exploit them also for learning. For others – till my recent discovery – I had been suggesting the to open a SQL session keeping in mind all the data created by you will be lost after you close the session. So the only option we had was to save our scripts locally and copy paste run them every time.

Luckily now we have a progressive and long term solution – our own autonomous Oracle Cloud database which we can connect to with a lot of different tools. I will use SQL Developer as an example here.

Set up Oracle account and starting the database takes some time (well, 15 minutes?) but this is a once off task so we invest a bit of time for a happy future together 24/7 :) This will be your “personal” database – do there any of study projects, bachelor thesis, learning, experimenting, anything. I am using my two Cloud databases when sharing SQL and PLSQL tips and tricks.

  1. Create Oracle Cloud account: qpen and press ‘Sign Up for Free Cloud Tier’

Fill in the data required and press ‘Verify my email’

You will receive an email asking to confirm your Oracle cloud account. By clicking on the link received you will be redirected to registration form. Enter required information (I have set namesurname as Oracle Account name), choose ‘Home Region’ as ‘Germany Central (Frankfurt)’ and press ‘Continue’.

In next page you will be asked to provide address information and also your payment card details.

!!! Don’t worry – Oracle requests this for identification purposes as lots of software providers do. 1 EUR will be reserved or charged from your bank account and returned within a few minutes.

When done press ‘Start my free trial.’ If everything went successfully you will see Oracle Cloud start page similar to one in screenshot below.

2. Create and start our own free transaction database. Press ‘Create an ATP Database‘. You will see quite a long form where you can keep all the defaults and the only must-have is the password for your database ADMIN account. Fill it (and save somewhere nearby because you will soon need this password when setting connection from SQL Developer)

and press ‘Create Autonomous Database

Within some seconds you will have your database up and running. The metrics window will open

So – we have started our own free Oracle Cloud transactional database. Voila!

Now let’s connect to this database with Oracle SQL Developer.

3. Download and start Oracle SQL developer from Choose version appropriate to OS on your computer.

When downloaded, extract the zip file and launch sqldeveloper application file (run sqldeveloper.exe), no need to install anything additional. If everything went successfully you will see program window similar to one in screenshot below.

Let’s configure SQL Developer to connect to our Oracle Cloud database. To link them, we will use Oracle Wallet file.

4. Create a new Wallet file: open the database in the Oracle Cloud page and click ‘DB Connection

NB: here’s a quick recap: Open -> Log in -> click on ‘hamburger’ icon (three horizontal lines in the upper left corner) -> Choose ‘Oracle Databases’ -> Autonomous database -> Click on hyperlink in ‘Display names’ columns, something like DB20211234567

You will have the Details window opened where click please ‘DB Connection

A new Wallet window will open and you should click ‘Download Wallet‘ there:

enter and re-enter a new wallet admin password (this is not the same Admin password which you set when creating the database – however you can reuse it if you want to)

and choose a local folder to download the wallet to

5. Configure SQL Developer connection to our Oracle Cloud database.

Open SQL Developer (click sqldeveloper.exe) and press the green cross in the upper left corner. A new connection setup window will be opened. Choose ‘Cloud Wallet‘ as your Connection Type

and [Browse] the Wallet file you downloaded to your local folder

Fill also ‘Service Name’ (I named the connection ‘My Oracle Cloud DB2’ (because I already have DB1, this is my second free database there in the Cloud)

As ‘Username’ enter ‘ADMIN’ and enter your database admin password (not the Wallet password, but the database ADMIN password). Check also ‘Save Password’ for your comfort not to enter the password again.

Press [Test] and within a second you should have Status: Success

When you see Success, press ‘Save’. If you don’t – you can try to reach me out I will try to help. Albeit usually everyone succeeds.

Right click on the connection and choose ‘Open SQL worksheet’

Congratulations – you have connected to your Cloud database and  are ready to deep dive into the wonderful SQL and PLSQL world there! :) Do anything you want from creating tables to mastering Expert level SQL and PLSQL.

Well, speaking of deep dive –

6. I suggest these online resources for your SQL learning: the blog by Gints Plivna – best of the best ever local source (in Latvian) – did you know you can enjoy Coursera courses for free? When you are ready to click Enroll Course choose instead the small letters below the button ‘Audit course’. Well, you will not have final certificate and some classroom labs but you will access all the learning video and readings as if you were a student there.

Udemy offers only paid resources and price is about 15 EUR per course. (beginner level. I have not seen this course myself) (advanced level. I have bought this to polish Oracle window functions and liked a lot).

You might also enjoy challenging yourself here (in Russian) – a lot of different complexity challenges there. I once welcomed an intern in my project who had on his own achieved high level in this site and I must admit his SQL skills were truly impressive.

7. Expand your skills with PLSQL (Procedural Language for SQL)

When you have at least some SQL basic skills you might decide to start PLSQL studies. This also is a very interesting world, full of loops, functions, cursors, packages, variables. triggers, exception handling, dynamic SQL, built in Oracle functions and many many more.

When programming SQL and PLSQL – Sky is the limit.

P.S. You are welcome to apply for SQL and PLSQL bootcamps – upcoming in July and August ‘2021. Learn more by doing.

Data Management (July)

Data management (August)


  • structure and objects
  • queries and joins
  •  data modelling


  • functions, procedures, packages and triggers
  • variables, cursors, custom data types and arrays
  • control statements and exception handling
  • coding conventions and guidelines

There is chance I will be one of your mentors and teachers there if the project I am assigned to will let me dedicate some time for giving back to industry. See you there!

Big Data studies: having fun rolling virtual dice in R

I devoted a day to learn and play with RStudio to generate random values. Just a cool stuff to know later when more serious probabilities will be analysed.

I wrote my self-made functions for rolling and also draw simple charts with results.

Rolling 6 sided dice

This is how it looks when just run

dice.roll <- function(n){sample(1:n, size = 1)}
for (i in c(1:10)){

[1] 4
[1] 1
[1] 5
[1] 4
[1] 1
[1] 2
[1] 6
[1] 2
[1] 3
[1] 1

and this is how it looks like when chart drawn. Remember to vectorize function.

dice.roll <- function(n){sample(1:6, size = 1)}
dice.roll <- Vectorize(dice.roll)
plot(dice.roll(1:100), type="o", col="blue", xlab="Times rolled", ylab="Value rolled")
title(main="Rolling dice", col.main="red", font.main=4)


Rolling 12 sided dice

> for (i in c(1:10)){

[1] 8
[1] 1
[1] 12
[1] 3
[1] 2
[1] 10
[1] 9
[1] 10
[1] 4
[1] 1


dice.roll <- function(n){sample(1:12, size = 1)}
dice.roll <- Vectorize(dice.roll)
plot(dice.roll(1:100), type="o", pch=23, lty=2, col="green", xlab="Times rolled", ylab="Value rolled")
title(main="Rolling 12-sided dice", col.main="brown", font.main=2)


2-sided – like tossing a coin

> for (i in c(1:10)){

[1] 1
[1] 2
[1] 1
[1] 1
[1] 1
[1] 2
[1] 1
[1] 2
[1] 1
[1] 1


dice.roll <- function(n){sample(1:2, size = 1)}
dice.roll <- Vectorize(dice.roll)
plot(dice.roll(1:100), type="l", pch=20, lty=1, col="violet", xlab="Times tossed", ylab="Value tossed")
title(main="Tossing a coin", col.main="dark blue", font.main=2)


Let’s walk through some very simple probability examples.

Q1: What is the probability to roll equal numbers rolling 2 dices one time?

To find that I am going to understand the space of good rolls and probability then is good vs all.

c <- expand.grid(x=1:6, y=1:6)
cat("total possible combinations for two dice are ", nrow(c))
cat("I can have equal numbers rolled in ", length(which(c[1] == c[2])), " ways")
pA <- length(which(c[1] == c[2]))/nrow(c)
cat("Probability p(A) to roll equal numbers is: ", pA)

total possible combinations for two dice are 36
I can have equal numbers rolled in 6 ways
Probability p(A) to roll equal numbers is: 0.1666667

Q2: What is the probability to roll sum between 7 and 10 when rolling 2 dices?

cat("combinations to have sum between 7 and 10 are ",length(which(rowSums(c) >= 7 & rowSums(c) <= 10)))
pB <- length(which(rowSums(c) >= 7 & rowSums(c) <= 10))/nrow(c)
cat("Probability p(B) to roll sum between 7 and 10, is: ", pB)

combinations to have sum between 7 and 10 are 18
Probability p(B) to roll sum between 7 and 10, is: 0.5

Q3: What is the probability to roll sum 2 or 7 or 8 when rolling 2 dices?

cat("sum is 2 or 7 or 8 combinations are ",length(which(rowSums(c) == 2 | rowSums(c) == 7 | rowSums(c) == 8)))
pC <- length(which(rowSums(c) == 2 | rowSums(c) == 7 | rowSums(c) == 8)) / nrow(c)
cat("Probability p(C) to get sum 2 or 7 or 8, is: ", pC)

sum is 2 or 7 or 8 combinations are 12
Probability p(C) to get sum 2 or 7 or 8, is: 0.3333333

Draw probabilities of sequential Heads for coin or Six for dice

coins <- function(x)

dice <- function(x)


 lwd=3,lty=3,main="Probability of sequential Head for coin or six for dice", ylab="probability",
 xlab="rolls count", log="x", axes=F)

legend(2, 1, legend=c("Coins", "Dice"),
 col=c("red", "blue"), lty=3:2, cex=0.9)


Let’s roll three dice ten times

dice.roll <- function(x,n){sample(1:x, size = n, replace=TRUE)}

for (i in c(1:10)){

[1] 3 3 3
[1] 2 2 5
[1] 6 5 1
[1] 6 4 6
[1] 4 1 5
[1] 2 1 1
[1] 6 3 6
[1] 5 6 1
[1] 5 6 4
[1] 3 6 1

Q4: When three dice rolled, what is the probability to have at least one “1” at a condition that different numbers are rolled?

Conditional probability is easy: P(A|B) = P(A ∩ B) / P(B)

  • A is “have at least one 1”
  • B is “roll different numbers” – not always happens, so a probability to be found

I’ll do now like engineers do – just calculate sample sizes :) without formulas

c <- expand.grid(x=1:6, y=1:6, z=1:6)
cat("P(A ∩ B) Combination count when at least one is 1 and different numbers rolled",length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3]&(c[1] == 1 | c[2] == 1 | c[3]==1))),
 " and total count of combinations is ", nrow(c))
pAB <- length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3]&(c[1] == 1 | c[2] == 1 | c[3]==1)))/nrow(c)
cat("P(B) = combinations count when different numbers rolled ",length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3])),
 " and total count of combinations is", nrow(c))
pB <- length(which(c[1] != c[2]&c[2] != c[3]&c[1] != c[3]))/nrow(c)
cat("P(A|B) = P(A ∩ B) / P(B) = ", pAB / pB)

P(A ∩ B) Combination count when at least one is 1 and different numbers rolled 60 and total count of combinations is 216
P(B) = combinations count when different numbers rolled 120 and total count of combinations is 216
P(A|B) = P(A ∩ B) / P(B) = 0.5

Next time – some of binomials. I wish I had 48 hours a day.


Mathematical Statistics: the more I learn, the more I realize how much I don’t know

There are two courses scheduled in the Big Data analytics module this semester:

Warehousing I dare to think I know quite well – however formal requirements of course are much harder one could expect – presentations, researches and a real DWH developed – and I love it as it means I will learn new things, not just reuse existing.

I wish I could share the same optimism about statistics theory and formulas. I faced the sad truth:

  1. I have forgotten how to calculate integrals and I deeply regret also other skills like Poisson Approximation of Binomial Probabilities have left my memories…
  2. I have no experience with programming in R.

The first lecture started.

Professor told us the plan. Well, this course will be quite tough, I thought.

Imagine my face when I realized this was not a plan of the course. This was a plan for the first intro lecture. Kind of the easy one.

EG, the topic “binomial distributions” was covered for some seconds – “Binomial distributions you all obviously know”. Click, next slide – “Central Limit Theorem, you all know that also”.

Mamma Mia, where am I and where are my belongings? Why we had no the intro about the emergency exit and jackets with oxygen falling for those who are in panic??

Calm down, dude, calm down.

The teacher, prof. Janis Valeinis is the type of teachers I value the most – he has a sparkle is his eyes. He loves the topic, he lives in the topic and statistics is his passion. It actually is a honor for me to study here and have this amazing chance to learn from a professor like he. After all, this is master’s degree level studies and this is just normal students are presumed to be skilled beforehand.

Look around, you see, each student is sitting with a poker face and I will also. I WILL!!! I was here 20 years ago as a successful student in love with differential equations and algebra of sets. I will domesticate integrals and Poisson again, and even more, I will put a bridle on R too.

So my mission is, should I choose to accept it:

  1. remember a lot of basics, from combinatorics to integrals
  2. be able to follow the study topics presuming all students know the concepts
  3. accelerate my R skills from 0 to 100
  4. do homeworks, pass quizzes and exam
  5. ah, yes, and keep working full day as DWH analyst – programmer.

I will not mention my other roles like servant of our cats. No discounts anyway.

Here comes my plan. NB: I do not know yet will it work :)

Step 1. Install RStudio (free, open source)


Step 2. Buy 1,001 Statistics Practice Problems For Dummies and set a goal solve 60 questions per day – I love learning by doing.


Example of question:


And answer explained:


(I like the idea and keep in waiting list ‘R for dummies’ and ‘Statistical Analysis with R For Dummies’)

Step 3. Subscribe in Coursera to Introduction to Probability and Data. If you choose “audit mode”, it is for free because you can listen but will not get any certificate of completion etc.


  • Install Coursera app – I can access course better via app. When using browser I see week is locked till day x but via app I can listen to any topic.
  • Download topics for offline usage. I am listening them in headphones on my way to work, to lunch and at any suitable moment.


Step 4. Plan and mark in you calendar at least 2 hours per day as learning hours – I use one hour in early mornings and two after work in evening. Weekends I devote for studying about 8 hours a day.

I just love doing it like others do knitting. Thanks to my family accepting my forever learning way of life.

Enjoy the happy moments when the feeling “I did it” rises like a phoenix.

Professor started the course with a so called birthday paradox.

How many students in class do you need to do a bet that at least two of them will have birthday in the same day of year, assumed that all 365 possible birthdays are equally likely?

365 students? 181 students? 100?

Hah, in room with 23 people there is a chance 50:50 two will share the same birthday.

And we learned the mathematics behind it.

First person always has a birthday and any day suits, so we can ignore it.

Probability that second random person

  • has the same birthday as first is 1/365 (0.3%).
  • does not have the same birthday as first is 364/365 (99.7%) as it may have any birthday other than the birthday of first person.

Probability that third person

  • has the same birthday as the first person is 1/365 * 1/365 and the same calculations if is has the same birthday as the second person (and probability would be 1/365*1/365*1/365 if we’d be looking all three of them having birthdays the same day). This number of possible birthday match combinations keeps growing with each next person in sequence, that’s why in “at least one”  task easier is to calculate from opposite to the probability that there are no two people in the room having the same birthday and subtract from 1 thus getting the probability of at least one matching birthday.
  • does not have the same birthday as the first and second is 363/365 as third person may have any of the birthdays not already taken by first and second persons

Probability that fourth person does not have the same birthday is 362/365 etc.

Total probability that

  • second person does not have the same birthday as first
  • AND third person does not have the same birthday as first and second
  • AND fourth person does not have the same birthday as first and second and third

is multiplication of probabilities.

365/365 * 364/365 * 363/365*…how many persons do we want to consider.

Here I will confess I started using R with an assumption this is just another C++ and started writing recursive loops like

birthday <- function(n)

y <- 1;
for (n in 1:10)
 y <- y*birthday(n);

Happily I did not succeed because if I did I might be adapted to a wrong mindset. The key to understanding for me was this sentence I found in one of blogs about R:

### In R you need to switch your mentality to thinking in vectors instead of for loops.###

When I realized that concept I became friends with R (moreover, now I am in love with R). Thanks, God and University degree, I am quite familiar with vectors.

In R everything is a vector. Single number is a length-one vector. You can choose when to calculate over a vector and when as a number. This little example helped me:

#simple function to multiply given number with the next in sequence
> test <- function(n)

# calculate when n is 1
> n <- 1 
> test(n)
[1] 2
#when n is 2
> n <- 2
> test(n)
[1] 6
# when n is 3
> n <- 3
> test(n)
[1] 12
# and now one of the powers of R - no looping needed:
> n <- 1:3
> test(n)
[1] 2 6 12

One more example of different usages

> testn <- function(n)
> testn(1:3)
# now you will have the results multiplied 2 * 6 * 12 = 144
[1] 144
# vectorize it and repeat - completely different result
> testn <- Vectorize(testn)
> testn(1:3)
[1] 2 6 12

After getting this idea I immediately deleted my clumsy loops and voila! my new R-life began. Learned by playing.

> prod(3)
[1] 3
> prod(3):1
[1] 3 2 1
> prod(3:1)
[1] 6

And birthdays now are easy – we instruct loop through set of given persons in class, like to calculate for 2 persons, for three, 4, 5, … as many as we want.

I must note here that The Pigeon-hole Principle (Dirihlē princips) clearly says that if we have more than 365 persons, eg, 366 persons, it guarantees that at lest two persons will have the same birthday. But it is not today’s topic. Today we’ll see that we mathematically don’t need 366 persons to have a match :)

birthdays <- function(n)

This vector stuff is not obvious, especially if you don’t have background understanding. I’ll show more examples. Let’s forget to vectorize function.

birthdays <- function(n)

n <- 1:3
[1] 0.0000000 0.9972603 0.9999925
Warning message:
In (365 - n + 1):365 :
 numerical expression has 3 elements: only the first used

What happened? Is it really that 2 persons in a class have 99.7% chance to match birthdays?  And three even 99%?

C’mon, something definitely wrong here.

We passed too much values to our built-in R loop, see this example:

So, what happened and why for n 1 2 3 (to calculate the needed 365,364,363)

we got crazy results 0.0000000 0.9972603 0.9999925?

Because in the first part of formula we try to get three elements up to 365:

365 to 365 = 365

364 to 365 = 364, 365

363 to 365 = 363, 364, 365

It is not possible to have all at once by my current code. Thus why R warned only the first value will be used of n = 1, resulting ‘365’ is taken as he result set for the first part of formula for any n

0.0000000 (this is for n=1) is the result of formula (1 – (365)/365^1) this is ok as only one occurrence here

0.9972603 (this is for n=2) is the result of formula 1 – (365)/365^2 while we need the result to be 1-(365/365)*(364/365) = 1-(365*364)/365^2 = 0.00273973

0.9999925 (this is for n=3) is the result of formula (1 – (365)/365^3) while we need the result to be 1-(365/365)*(364/365)*(363/365) = 1-(365*364*363)/365^3 = 0.00820417

Let’s have even more crazy example and follow up to getting their values:

n <- 3:4;n
[1] 3 4
> birthdays(n)
[1] 0.008204166 0.997282751
Warning message:
In (365 – n + 1):365 :
numerical expression has 2 elements: only the first used

now we have n 3 4 and after calculations (365-n+1) we have 363, 362 and as per our conditions we want loops



But again, we can loop only one and R says it will use the first.

0.008204166 (this is for n=3) is the result of formula 1 – (363*364*365)/365^3

0.997282751 (this is for n=4) is the result of formula 1 – (363*364*365)/365^4

Now we have more motivation to remember to vectorize function :) and

voila! we have The Results!

> n <- 1:4;n
[1] 1 2 3 4
> birthdays <- Vectorize(birthdays)
> birthdays(n)
[1] 0.000000000 0.002739726 0.008204166 0.016355912

Why it started working? Because now the function will work with products (364*365), (363*364*365) etc instead of previous approach to loop through one set.


0.000000000 is the probability to match birthday in class with one person (1-365/365). Of course, it is not possible as there is no other person to match with.

0.002739726 is the probability to match birthday when 2 persons in class (0.3% or 1-364/365*365/365)

0.008204166 is for 3 persons in class (0.8% or 1-((363/365)*(364/365)*(365/365))

0.016355912 is for 4 persons (1.6% or 1-((362/365)*(363/365)*(364/365)*(365/365)

Now let’s draw it as a chart and search for the person count when we reach probability 50:50 or 0.5

birthdays <- function(n)
n <- 1:50
birthdays <- Vectorize(birthdays)

plot(n,birthdays(n),type="l",ylim=c(0,1),col="dark red",
 lwd=2,lty=3,main="Dzimšanas dienas", ylab="varbūtība, ka vismaz diviem ir vienā dienā",
 xlab="cilvēku skaits", axes=F)


# draw line ar probability 0.5

legend(1, 1, legend="Varbūtības grafiks", col="dark red", lwd=2,lty=3, cex=0.6)

and enjoy the result.


We see that probability becomes 50:50 at person count 23. Let’s perform probability calculations for these two values:

[1] 0.4756953
> birthdays(23)
[1] 0.5072972

You might ask to perform the calculations in opposite way:

if we set the desired probability like 0.5, how do we find the count of persons when it occurs?

This type of calculations is done by approximations because the opposite formula you have to find the n is complex as n is the time of multiplication, decreasing and raising to the power.

In our situation the solution will be like engineers do :)

Option a: do the calculations, store in array and retrieve the nearest value by index. Read below how many persons to invite in class if you want 50% or 99% probability:

> # calculate values for 1 to 100 persons and store in an array
> answer <- birthdays(1:100); answer
 [1] 0.000000000 0.002739726 0.008204166 0.016355912 0.027135574 0.040462484 0.056235703 0.074335292 0.094623834
 [10] 0.116948178 0.141141378 0.167024789 0.194410275 0.223102512 0.252901320 0.283604005 0.315007665 0.346911418
 [19] 0.379118526 0.411438384 0.443688335 0.475695308 0.507297234 0.538344258 0.568699704 0.598240820 0.626859282
 [28] 0.654461472 0.680968537 0.706316243 0.730454634 0.753347528 0.774971854 0.795316865 0.814383239 0.832182106
 [37] 0.848734008 0.864067821 0.878219664 0.891231810 0.903151611 0.914030472 0.923922856 0.932885369 0.940975899
 [46] 0.948252843 0.954774403 0.960597973 0.965779609 0.970373580 0.974431993 0.978004509 0.981138113 0.983876963
 [55] 0.986262289 0.988332355 0.990122459 0.991664979 0.992989448 0.994122661 0.995088799 0.995909575 0.996604387
 [64] 0.997190479 0.997683107 0.998095705 0.998440043 0.998726391 0.998963666 0.999159576 0.999320753 0.999452881
 [73] 0.999560806 0.999648644 0.999719878 0.999777437 0.999823779 0.999860955 0.999890668 0.999914332 0.999933109
 [82] 0.999947953 0.999959646 0.999968822 0.999975997 0.999981587 0.999985925 0.999989280 0.999991865 0.999993848
 [91] 0.999995365 0.999996521 0.999997398 0.999998061 0.999998560 0.999998935 0.999999215 0.999999424 0.999999578
[100] 0.999999693
> which(answer > readline(prompt="Enter desired probability: "))[1]
Enter desired probability: 0.5
[1] 23
> which(answer > readline(prompt="Enter desired probability: "))[1]
Enter desired probability: 0.99
[1] 57

Option b: use R function uniroot to do actually the same – loop through values for the hit, but I haven’t learned it yet.

Now let’ s have some fun: let’s generate random birthdays and see what happens.

I assigned numbers 1 to 365 to the days, then I generate two random sample sets and take the intersected values and convert them to date of year 1970 (just for fun).

Let’s start with probability 50:50 or 23 persons:

smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")

sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))

And the results of 10 randoms for 23 people (probability to get a match in a single trial is 50%):

  • no match: 1 time
  • one common birthday: 4 times
  • two common birthdays: 3 times
  • three common birthdays: 2 times
> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 23
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "1970-09-18" "1970-11-25"
[1] "Date of length 0" #no match
[1] "1970-03-11" "1970-04-03" "1970-07-25"
[1] "1970-01-07" "1970-12-29"
[1] "1970-01-25"
[1] "1970-01-26"
[1] "1970-04-01"
[1] "1970-01-23" "1970-06-06" "1970-08-27"
[1] "1970-09-30"
[1] "1970-08-16" "1970-08-27"

Let’s do 10 randoms for 10 people in class (probability to get a match in a single trial is 12%):

  • no match: 7 times
  • 1 common birthday:  2 times
  • 2 common birthdays: 1 time
> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 10
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "1970-12-19"
[1] "Date of length 0"
[1] "1970-07-04"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "1970-02-06" "1970-11-09"

Let’s do 10 randoms for 5 people in class (probability to get a match in a single trial is 3%):

  • no match: 9 times
  • 1 match: 1 time
> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 5
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "1970-09-07"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"

Let’s do 10 randoms for 2 people in class (probability to get a match in a single trial is 0.1%):

  • no matches
> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ")
Enter the size of class to generate random birthdays for?: 2
> sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"
[1] "Date of length 0"

And now one more fun: let’s add another chart:

probability to have the same day of a month in a class of n persons

(assuming for simplicity there are 31 days)

birthdays <- function(n)

birthdays <- Vectorize(birthdays)

birthdates <- function(n)

birthdates <- Vectorize(birthdates)

n <- 1:50

plot(n,birthdays(n),type="l",ylim=c(0,1),col="dark red",
 lwd=2,lty=3,main="Matching days two at least", ylab="probability",
 xlab="person count", axes=F)




legend(1, 1, legend=c("Days of year", "Days of month"), col=c("dark red", "blue"), lwd=2,lty=3, cex=0.8)

And we see that probability 50:50 the same day of month is reached at 7 persons.


Next blog expected to be about probabilities and binomial distribution, discussing solutions for tasks like

# Q1. A coin is tossed 20 times. What is the probability of getting exactly 3 heads?
# Q2. A die is rolled 20 times. What is the probability of getting exactly 10 six?

# Q3: 80% of people who purchase pet insurance are women. If 9 pet insurance owners are randomly selected, find the probability that exactly 6 are women.

P.S. Meanwhile – behind the scene:


My Oracle certifications what-ifs and lessons learned: have good skills and you’ll not need luck

My takeaway from the Big Data course “Data processing systems” I was blogging about is passing the course exam and reusing the knowledge for an Oracle certification.

I passed course exam in the end of December and I had a month till the next semester and I decided to use it for additional learning to apply for Oracle certification Oracle Big Data 2017 Certification Implementation Specialist.

I will return to Big Data topic when I have the next Big Data course in following semesters. This semester my challenges are

  • Data warehousing
  • Statistics.

I hope to start blogging about them soon.

Meanwhile in this blog entry may I share some my lessons learned from Oracle certifications,

they have about 200 different certifications in 10 categories. You can browse:

–Certifications and exams required (NB: combinations available)

–Exams and what certification this exam is useful for

I have done four (I know a guy holding 32) and it makes me think I have some things to say about the process. I have

Credentials are granted based on a combination of passing exams, training and performance-based assignments, depending on the level of certification.

Oracle has a good marketing:

  • “Get the job you want – earn an Oracle Certification”
  • “86% of hiring managers surveyed say that IT certifications are a priority during the candidate evaluation process” etc.


How to choose from exam set?

It depends on

  • your plans towards certification path
  • value (eg passing the exam 1Z0-071 suits for 4 certifications!)
  • project requirements
  • your new job or career change dreams (I did is several years ago and I know my OCA certificate was the trump when I, at that time Informix developer, was confirmed at my current Oracle expert position)
  • your current skills (why not go for some ‘done’? – I do that a lot)
  • Oracle environment available
  • Oracle versions covered in exam
  • exam price
  • accessibility (proctored (classroom)/non proctored (online))
  • exam length and passing score
  • for fresh exams there might be less training materials available
  • your attitude and readiness to learn and challenge yourself (I do that a lot)

If exams are similar, then search like : ‘1Z0-071 vs 1Z0-061’ etc. I usually choose the hardest one to learn more.

Some time ago SQL was only a tool for me. Like hammer, partially because I was working in Informix and had no feeling I can do miracles in SQL and Informix SPL (apologies for liking Oracle SQL and PLSQL better).

Little by little I realized I want more and decided to study Oracle in parallel to get ready for changing job profile. I read about the Order of the Wooden Pretzel, inspired by the famous Steven Feuerstein quote “SQL is not a complete language. Some people can perform seeming miracles with straight SQL, but the statements can end up looking like pretzels created by someone who is experimenting with hallucinogens”. To prove that, international SQL challenges were held and winner becomes Knight of the Order of the Wooden Pretzel.

May the quickest, most entertaining, most educational, most creative, and somewhat readable solution prevail!

In recognition that SQL is not the only language in which enterprising developers can create pretzels on hallucinogens, the challenge is also open to NoSQL solutions. Well, it seems this has been put on hold for a while but maybe I was not googling long enough.


Oracle strongly prohibits cheating

You sign an agreement before exam (example – including

Bad news:

  • you can accidentally stumble into brain dumps forums/topics (and you may not even know they are…) – just be aware
  • practice tests you buy might appear attractive wrapped dumps

Good news:

  • authorised practice tests available for most exams (but not for all, for example, Big Data exam does not have).

Some characteristics of the exams in SQL and PLSQL:


Exam day

  • Shortly: pay for exam and book time via PearsonVUE in BDA
  • Read confirmation email what you need to bring with you
  • Note, you must have TWO person documents with you:
    • passport and driver licence, for example
  • Arrive ~20 mins earlier (there is coffee and cookies in BDA)
  • You will be photographed AND face will be compared to previous exams if any (so might be several photos taken until matches)
  • You will place all belongings in a lockable wardrobe
  • They’ll give
    • a special pen like marker
    • 1-2 sheets of laminated A4 paper (depends on exam); you may not erase your notes
    • ear plugs
  • You will be guided to a computer room and set until you see [Start test]
  • You will be videomonitored all the time
  • You may leave the room to WC, time is not stopped

Exam software

Before starting exam you have option to choose intro.

Interface changes between versions but the idea stays the same:

  • Question content (may be with [Exhibit] or [View] buttons for popups)
  • Answer options – radio buttons or checkboxes
  • Checkbox [Mark] for later review
  • Buttons [Previous] and [Next] – you always can browse
  • Button [Review] (sub-options – marked ones or all)
  • Button [End exam] (you may press it any time; re-asks before quit)
  • Time counter showing how much time left like (00:46:32)
  • Question counter like ‘Question 17 of 63’

Do not leave marked unanswered question – points are not negated for qwrong answer and guessing comes for free – because exam time flies very fast.

English language, no dictionary available, no time discount.

Question types

  • May contain [Exhibit] or [View] buttons for popups, usually showing table descriptions, rarely questioned target report template
  • Radio buttons (one from four options usually)
  • Checkboxes, usually 4 to 6
    • Choose two (or three)
    • Choose all that apply
  • May contain options ‘all of above’ and/or ‘none of the above’
  • May be combined question statements and options like
    1. A-4, B-1, C-3, D-1, E-1, F-3
    2. A-3, B-4, C-1, D-1, E-1, F-3
    3. A-1, B-2, C-4, D-2, E-2, F-1
    4. A-1, B-3, C-3, D-2, E-2, F-2
    5. E. A-1, B-1, C-1, D-2, E-1, F-3

After exam

    1. You will be given a printout that your results will be available online within in about 30 min
    2. Usually there are within 10 minutes (if you have issue with CertView account you can some hours later call Oracle support and by your testing ID they will tell results by phone)


  • Earlier they used to send printed Certificates and cards but now are saving environment and does not send
  • You will receive emails. Approx 10 minutes after exam –  about availability of results, about 24 hours later about e-certificate and about 48 hours later about badge in Acclaim.


Associate level certification


Professional level certification



Learning – what?

    • I found Gints Plivna blog to be very useful, especially NULL:
    • Stay real and stay calm. Don’t try to memorize REGEXP
    • Enjoy the process and have fun
    • I subscribed to various a la ‘Daily SQL challenge’
      • in LinkedIN
      • via email
    • To SQL, PL/SQL groups in LinkedIN and Facebook

Learning – how?

    • Despite experience it takes time, if you target to add value for yourself
      • depends on experience. Eg, OCA about a month daily 1-2 hours
      • read topic by topic in manuals
      • play a lot in Oracle
      • google pros, cons, examples
      • drill practice test
    • Set mindset to do exercises and daily tasks correct at once
    • Consider buying practice tests and drilling daily by portions
    • Remember: passing score is never 100%! You MAY afford to have mistakes
      • Helpful: fast recognising of obvious errors to lessen options
      • Questions containing functions may serve as hints for others (like syntax of NVL2 or TO_CHAR or SUBSTR)
    • Internet full of incorrect examples and wrong answers, so it is crucial to distinguish right from wrong and test, test, test
    • if this is not Oracle page and there is a question and answer like ‘Correct: B’ without explanation it sounds stolen dump – be aware
    • often accompanied with note ‘any fool can see that correct is B’Many people enjoy writing blogs about preparation, I was reading them also.







Practice tests – see

Oracle Database 11g: SQL Fundamentals I 1Z0-051

(it was easier then real exam)

30 days online access

99 $



Oracle Database 11g: Program with PL/SQL 1Z0-144 30 days online access

99 $



Oracle Database 11g: Advanced PL/SQL 1Z0-146

(again – it was easier then exam)

30 days online access

99 $



There are plenty of practice tests if you google. Remember: You sign agreement to not use unauthorised materials

– Practice Exam:Oracle authorized practice exam from Transcender

Examples of challenges




  • There is always a chance something you do not know appears
  • There might be questions like ‘what is the 3rd parameter of DBMS_RLS.REFRESH_GROUPED_POLICY’
  • And, eg, which of the answers can be produced by a specific built in package (DBMS_REDEFINITION, DBMS_HPROF, DBMS_LOB, UTL_COLL etc)
  • Don’t panic – your background is good enough to do a good guess. Remember, you may afford to have mistakes

And, after all, even failing is not the end or world. Very many people fail. One of my friends passed PLSQL exam in 4th try.

Good skills and good luck!


Big Data: Phonetic Similarity : Soundex – words are similar if they sound the same

I guess you have seen surnames like Meier and Mayer or Smith, Smyth, Smithe, Smeth, Smeeth. These might be as well correct as misspelt surnames – if you’d dictate them to me by phone, who knows what I’d write.

So far I was blogging about similarity algorithms based on string writing. Today let’s discuss finding a match if the words sound the same. The most commonly used phonetic measure is Soundex. It has several extensions and my today’s topic is

Soundex for English language.

Soundex calculates a four character code from a word  based upon the pronunciation and considers two words as similar if their codes are equal. The idea is that similar sounding letters have are assigned the same soundex code. Widely used in genealogy, in archives, searching ancestors, relatives, families, heirs.

  1. The first character is the starting letter of a word. (in a variation called “Reverse Soundex” prefixes the last letter instead of the first)
  2. Drop all other occurrences of a, e, i, o, u, y, h, w.
  3. Replace consonants after the first letter with digits as follows:
    • b, f, p, v → 1
    • c, g, j, k, q, s, x, z → 2
    • d, t → 3
    • l → 4
    • m, n → 5
    • r → 6
    • If two or more letters with the same number were adjacent in the original name (before step 2), or adjacent except for any intervening h and w, then omit all but the first.
    • Return the first four padded with 0 (padding means replace blanks with 0, like ‘Ahoi’ will have code A000, ‘Dude’ will have D300 – always four characters code).

Let’s have an example set – surnames. I used Oracle RDBMS this time.


Now let’s compare similarities by three methods: Edit distance, Jaro-WInkler, Soundex.


and here are the results. Notice the combinations we have: if I set a similarity threshold by Edit distance or Jaro-Winkler to 50% then we have several combinations. including false positives and false negatives:

  • all three methods match – like ‘Mirhe’
  • Jaro-Winkler and soundex match, but Edit distance doesn’t – like ‘Meiyar’
  • Jaro-Winkler match but Soundex doesn’t – like ‘Mayes’
  • Edit distance and Jaro-Winkler match but Soundex doesn’t – like ‘Mimre’ or ‘Mirfe’


You see, Soundex is not a silver bullet and, as I have always been writing, we must try and test, test ad try.

I’ll show you one more weakness of Soundex:


From the three approaches I used Soundex is the only one which did not find similarity :)


Some of Soundex variants

  • The New York State Identification and Intelligence System (NYSIIS) algorithm maintains relative vowel positioning, while Soundex does not.
  • Daitch–Mokotoff Soundex (D–M Soundex) adaptation to Jews with Germanic or Slavic surnames, sometimes referred as “Eastern European Soundex”. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999, calculation is much more complex than Soundex.
  • Metaphone, Double Metaphone, Metaphone 3. Powerful and customisable rule set, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations.

I googled online Metaphone calculator, they say It’s more accurate than soundex – hardly can agree:

  • The metaphone code for Schwarzenegger is SXWRSNKR.
  • The metaphone code for Schvartzeneger is SXFRTSNJR.
  • These surnames do not have the same metaphone code.

Then I tried for one of my Soundex similarities and – again

  • The metaphone code for Meiyar is MYR.
  • The metaphone code for Mire is MR.
  • These surnames do not have the same metaphone code.

I was also searching for Soundex Latvian edition – I am quite sure it exists. I found this:

2008. g. izstrādāts un pilveidots elastīgs universālas leksikona sistēmas datubāzes
modelis, kas paredz vienotas infrastruktūras (kopīgu indeksēšanas un atgriezeniskās
saites mehānismu u.c.) un funkcionalitātes (šķirkļu izvērstas meklēšanas un
konfigurējamas atainošanas u.c.) pieejamību visām datubāzē izvietotajām vārdnīcām
neatkarīgi no to šķirkļu shēmām. Attiecībā uz indeksēšanu un meklēšanu, latviešu
valodai tika pielāgots Soundex algoritms, lai nodrošinātu neprecīzi ievadītu, bet pēc
izrunas līdzīgu vārdu atrašanu. (A. Spektors)

P.s. Tiem, kas lasa arī latviešu valodā – šeit ir maziņš un mīlīgs foruma ieraksts, kā cilvēks cenšas izveidot meklēšanas ieteikumu rīku (“vai jūs domājāt XXYZZX?”)


This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: hybrid similarity measure: the Soft TF/IDF Measure to deal with misspelt words

Some days ago I covered the topic about finding which two of these strings are most likely about the same real world entity – by recognizing requently used words and assigning them lower impact, thus first and third options were found as most similar:

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Apple

Let’s add a challenge: misspell the word

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Aple

And, as you might imagine, TF/IDF measure is not effective anymore because it cannot recognize Apple is similar to Aple as in classic implementation is looking for equality, not similarity. Today, similar like we did in generalised Jaccard, we will replace the requirement for words to be equal with requirement to be similar.

As I did before with TF/IDF, let’s remove commas, brackets and tokenize our collection into a bag of terms – documents x,y,z.

  • x = {apple,corporation,usa}
  • y = {ibm,usa,corporation}
  • z = {aple,corp}

Full term set T = {apple,corporation,usa,ibm,aple,corp}

Now we will use any string similarity measure of our choice (and business need). I have chosen today Needleman-Wunsch and Jaro-Winkler to illustrate the differences, eg, s(apple,aple) = 0.8 by Needleman-Wunsch and s(apple,aple) = 0.93 by Jaro Winkler. I used calculator

Let’s choose similarity (it is my choice, it could be any other value 0..1)

  • threshold k = 0.55 for Needleman-Wunsch measure
  • threshold k = 0.45 for Jaro-Winkler measure


Now we will reveal these terms which have similar term in the other document.

Soft TF/IDF based on Needleman-Wunsch

Initial steps are the same as per classic TF/IDEf. Calculate:

Frequency (TF)

is the number of times each term appears in each document.

Full term set T = {apple,corporation,usa,ibm,aple,corp} (as you see we will have six dimension vectors)

TF(apple,x) = 1 (document x has one time apple)
TF(apple,y) = 0
TF(apple,z) = 0
TF(corporation,x) = 1 
TF(corporation,y) = 1 
TF(corporation,z) = 0 
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0
TF(aple,x) = 0
TF(aple,y) = 0
TF(aple,z) = 1
TF(corp,x) = 0
TF(corp,y) = 0
TF(corp,z) = 1

Inverse Document Frequency

IDF(apple) = 3/1 (one of three documents contains apple) = 3
IDF(corporation) = 3/2 = 1.5
IDF(usa) = 3/2 = 1.5
IDF(ibm) = 3/1 = 3
IDF(aple) = 3/1 = 3
IDF(corp) = 3/1 = 3

Feature vectors (Feature = TF*IDF)


Let’s normalise these vectors: remember from trigonometry, it means getting the unit vector of length 1: divide the coordinates by the length of the vector.

length vector document x = sqrt((3*3) + (1.5*1.5) + (1.5*1.5)) = 3.67
length y = sqrt((1.5*1.5) + (1.5*1.5) + (3*3)) = 3.67
length z = sqrt((3*3) + (3*3)) = 4.24

Normalised vectors for documents (each coordinate divided by vector’s length):


Now Needleman-Wunsch scores come into a game.

We compute the close terms.

Close(x,y,0.55) = {(apple is not because its corporation in Document y is similar by 0.55 but there is another word – corporation which bonds by 1 and thus blocks apple’s similarity), corporation (because it has strongest bond with word corporation in document y), usa (because it has strongest bond to usa in document y)}


  • close(x,y,0.55) = {corporation,usa}
  • close(x,z,0.55) = {apple}
  • close(y,z,0.55) = {} (noone pair passed threshold)


In the final step we compute features but giving a weight to each component to the TF/IDF formula. We are looking for the most closest vectors. The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one. As our vectors are normalised, cosine formula now is simple computing the dot (scalar) product.

  • similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Needleman-Wunsch similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
  • similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.8 = 0.46 = 46%
  • similarity(y,z) = 0% (remember, that words pairs, incl. corp and corporation did not pass our threshhold)

Voila! By Needleman-Wunsch the most similar strings are

  • Apple Corporation, USA
  • Corp. Aple

Now let’s recalculate soft TF/IDF using Jaro-Winkler.

Threshold k = 0.45 for Jaro-Winkler (I have set different from Needleman-Wunsch just for more fun to learn differences better).

See in pictures how do we get to the closest results by keeping the strongest bonds:




  • close(x,y,0.45) = {corporation,usa}
  • close(x,z,0.45) = {apple,corporation}
  • close(y,z,0.45) = {corporation}

Now let’s calculate Features from normalized vectors (the same vectors we calculated)


As these vectors are normalised, I’ll remind cosine formula is simple computing the dot (scalar) product.

  • similarity(x,y) = x corporation coordinate 0.41 * y corporation coordinate 0.41 * Jaro-Winkler similarity weight 1 + x usa coordinate 0.41 * y usa coordinate 0.41 * 1 = 0.34 = 34%
  • similarity(x,z) = x apple 0.82 * z aple 0.71 * 0.93 + x corporation 0.41 * z corp 0.71 * 0.87 = 0.79 = 79%
  • similarity(y,z) = y corporation 0.41 * z corp 0.71 * 0.87 = 0.25 = 25%

Voila! By Jaro-Winkler the most similar strings again are

  • Apple Corporation, USA
  • Corp. Aple

Do you feel the power of the idea of this hybrid (string & set combined) similarity calculation method?


This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: combining string and set matching methods. One of hybrid similarity measures – generalised Jaccard index

It’s time to start to combine string and set matching methods – let’s have a look at one of Hybrid Similarity measures.

Classic Jaccard measure considers overlapping tokens (words, q-grams). To be considered as overlapped, the token must be identical. Jaccard index works very well when names are in mixed order like “Food and drinks”, “Drinks&Food”, “Serving foods, sweets, drinks”. However pure Jaccard is too restrictive when text contains errors.

I had an example in my Jaccard blog entry comparing classes by pupils’ names – Martins, Marta, Katrina, Ance etc. What if some of names were written with errors?

Generalized Jaccard measure helps.

First of all, we as usual convert the comparable string into tokens. I’ll reuse the example and put some errors in names

Class one pupils x = {Kartina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils y = {Annija,Martins,Matra,Karloina,Ance}

Today we’ll learn also soft overlap – using the most matching pairs.

First step. compare each pair

To compare we need a similarity measure s which returns values 0..1 (the closer to 1 the more similar).

For more fun let’s apply two for the same pairs- Edit distance (Levenshtein distance) and Jaro-Winkler measure – see, the result differs? :) I used (sorry, this page has a bug in Jaro-Winkler algorithm – because it is not true (janis,martins) has 0 by JW (it should be 0.67) – but I could not find any other online calculator and for our experiment this bug is acceptable and let’s use it as an example how easy is to misuse method when we simply believe to somebody’s programmed result without understanding)


Second step.

Choose threshold and keep only those who exceed. I have chosen threshold 0.5.

Threshold for Edit Distance


Threshold for Jaro-Winkler


Third step. Keep only the strongest bond.

To find that I draw all the upper threshold bonds at first.


Martins and Martins are of course bonded. It means no any other bonds possible from or to.

Karlina to Karloina has the next strongest remaining bond. Again, no other bonds from/to.

We have left Anna to Annija because all other bonds relate to “engaged” entities. For examle, Kartina cannot bond to Karloina with 0.71 because Karlina bonded with 0.88.

We calculate then weight of matching by adding all the match scores (2.55) and divide by the (all name in class X plus all names in class B minus matchinmg pairs) = 0.319 = 32% similarity when we hybridise Jaccard with Edit Distance.


Now let’s do the same for Jaro-Winkler. First of all, all bonds upper than threshold:


and keep only the strongest bonds. again you see, for example, Kartina cannot bond 0.88 with Martins because Martins bonded to Martins with 1. Kartina cannot also bond with 0.91 to Karloina because Karlina bonded to Karloina with 0.98.

And formula again – matching weight divided by sum of entities minues mathing pairs and – voila! = 0.707 = 71% similarity when we hybridise Jaccard with Jaro-Winkler measure.


I’ll remind that in my previous blog entry explaining Jaccard measure I showed that

  • similarity with correctly spelled names and requirement for name equality was 10%
  • similarity using bigrams was 50%
  • similarity with trigrams was 26%

Today we calculated (on a slightly modified set – errors added)

  • similarity with hybrid method with Edit distance was 32%
  • similarity with hybrid method with Jaro-Winkler measure was 71%

Isn’t it funny – five different results? There is no silver bullet in string and set matching. Try and test, try and test, try and test… repeat.


This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: set similarity : TF/IDF scores and Feature vectors to devaluate terms common in other documents

Term Frequency, Inverse Document Frequency and Feature Vectors. Another great concept which is easy when you start doing it.

Which two of these strings are most likely about the same real world entity?

  • Apple Corporation, USA
  • IBM (USA) Corporation
  • Corp. Apple

I believe you see: first and third. Why do you think so? Because of Apple there? Don’t you see the Corporation, Corp and USA? You somehow understand these are ancillary words, less meaningful.

How to teach computer to recognize that context? If we look at string level, Edit distance or Jaro-winkler, Affine Gap or Smith-Waterman, noone will solve it, they all will consider these are more similar:

  • Apple Corporation, USA
  • IBM (USA) Corporation

Well, Jaccard distance would spot the answer we are looking for but it is nice to learn another sophisticated method with added value – recognizing frequently used terms.

Today’s topic: two sets are similar if they share many frequent terms unless these terms are common in other strings.

TF score will show us how frequently is this term used.

IDF score will who us is this terms common in other strings.

Feature vectors will reveal the most similar strings. A feature vector is a vector that contains information describing an object’s important characteristics – in our case these characteristics (measurable properties) will be TF and IDF scores.

Let’s start with some data cleaning.

1) remove commas and brackets (I might also remove a, the, and, semicolons etc)

  • Apple Corporation USA
  • IBM USA Corporation
  • Apple Corp

2) convert all to lowercase

3) tokenize (split strings into words) and apply Smith-Waterman and/or Jaro-Winkler to calculate that Corp is similar to Corporation – I’ll assume that these words are equal (I might have set Corporation in my example but I wanted to make this example more challenging to show that data cleaning and applying rules is a normal step)

4) convert ‘corp’ to ‘corporation’ for me not to explain each time that I have a function which recognizes similarity of these terms

Let’s tokenize our collection into a bag of terms and call them documents x,y,z.

  • x = {apple,corporation,usa}
  • y = {ibm,usa,corporation}
  • z = {apple,corporation}

Term set T is {apple,corporation,usa,ibm}

Frequency (TF)

is the number of times each term appears in each document.

TF(apple,x) = 1 (see, document x contains one time term 'apple')
TF(apple,y) = 0 (this document does not contain 'apple')
TF(apple,z) = 1
TF(corporation,x) = 1 
TF(corporation,y) = 1 
TF(corporation,z) = 1 
TF(usa,x) = 1
TF(usa,y) = 1
TF(usa,z) = 0
TF(ibm,x) = 0
TF(ibm,y) = 1
TF(ibm,z) = 0

From this we can also read that if this document contains this term, the score is >0, otherwise 0. TF score can’t be negative.

Inverse Document Frequency (IDF)

is the number of documents in collection (N) divided by documents containing this term (Nd)


IDF(apple) = 3/2 (two of three contain apple) = 1.5
IDF(corporation) = 3/3 (all three contain corporation) = 1
IDF(usa) = 3/2 (two of three contain usa) = 1.5
IDF(ibm) = 3/2 (two of three contain ibm) = 1.5

Well, now we have numbers. And?

Feature vectors

Do you remember vectors from school? It was that something like an arrow. Plane flying to North with speed 850km/h is like one vector and the wind blowing North-West is like another vector. If wind’s velocity is higher the plane from the ground might seem to be slipping sideways a little.

Vectors are used in computing very, very often. Example: v = [R; G; B]; is a feature vector containing color components of a pixel or an object.

In our case

  • documents x,y,z will be the vectors
  • Term set T {apple,corporation,usa,ibm} represent the dimensions (yeah, if thousand of terms than we have thousand-dimensional vector, uhh!)
  • features will be calculated from TF and IDF scores.
  • we will find if vectors are blowing the same direction (their angles are close). If yes – we will consider these documents similar

Formula to calculate features for each of documents and each of terms is:


Now let’s calculate features to draw a table (matrix). Note that all our features are greater or equal to 0 (because negatives are not possible).

F(apple, x) = TF(apple, x) * IDF(apple) = 1 * 1.5 = 1.5
F(apple,y) =  0 * 1.5 = 0
F(apple,z) = 1 * 1.5 = 1.5
F(corporation,x) = 1 * 1 = 1
F(corporation,y) = 1 * 1 = 1
F(corporation,z) = 1 * 1 = 1
F(usa,x) = 1 * 1.5 = 1.5
F(usa,y) = 1 * 1.5 = 1.5
F(usa,z) = 0 * 1.5 = 0
F(ibm,x) = 0 * 1.5 = 0
F(ibm,x) = 1 * 1.5 = 1.5
F(ibm,x) = 0 * 1.5 = 0


These are coordinates of  our vector in four-dimensional space (as we have four terms).

X (1.5,1,1.5,0)

Our vectors can have features only greater or equal to 0 (because we can’t have negative term frequencies in a document) and it means they will never have angle greater than 90 and cosine will never be less than 0. See my example visualisation for a three three-dimensional vectors (


We have to find our three documens x,y,z feature vector pair which makes the closest angle. You remember trigonometry, don’t you? The more narrow the angle is, the larger is its cosine. Thus we have to calculate the cosine of the angle between vectors and pick the largest one.

The official formula – uh, looks impressive, doesnt it? (I wrote it using

Cosine_formula.PNGIt is not as complex: sum of scalar products divided by sum of vectors length (magnitude).

  • vector x = {apple,corporation,usa} coordinates (1.5,1,1.5,0)
  • vector y = {ibm,usa,corporation} (0,1,1.5,1.5)
  • vector z = {apple,corporation} (1.5,1,0,0)
  • Term set T is {apple,corporation,usa,ibm}

vector pair x,y scalar product and magnitude


Cosine of angle between vectors x,y is


vector pair xz


vector pair y,z


The larger cosine, the closer angle. And winner most similart pair, having the closest angle is…

  • Apple Corporation, USA
  • Corp. Apple


P.S. You might play also using this Feature formula:



This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: set similarity : q-grams, Overlap measure, Jaccard index, Jaccard distance

Today’s topic: is John Smith similar to Smith John? As you see, Edit distance 10 will measure as not similar, so as Needleman-Wunsch (score = – 5). Smith Waterman would just point to the two similar substrings. Hmm, are these strings really not similar?

I asked my kid: how could we find guys of which class have the most similar names to your class? He answered: well, we could take name by name and calculate Edit distance score for each and then sum results together?

Let’s have a look closer on some official methods.

Class one pupils {Katrina,Janis,Karlis,Martins,Anna,Karlina}

Class two pupils {Annija,Martins,Marta,Karolina,Ance}

First of all, let’s learn to obtain the set of comparable items – tokens.

Let’s start having names as our tokens. Set similarity operates with tokens, not “words” because words sound like are something meaningful for us. Tokens are… just tokens :) anything we have agree how to split the set – you’ll see several approaches.

Simple: if we use word a s token then string “David Alexander Smith” consists of three tokens: david, alexander, smith.

A bit more complex: we might agree that we during tokenisation remove common stop words like “of, the, and” or ” Mr, Ms, fon, da” etc.

Then string “Mr. David Alexander fon Smith” again consists of three tokens: david, alexander, smith.

q-grams or n-grams

This impressive name just means that we split the string in substrings of length q. I’ll use David and show you two approaches used – which approach you choose  later depends on your needs and further usage of results.

q=1 (unigram):

  • {D,a,v,i,d}

q=2 (bigram, sometimes called digram):

  • first approach: {#d,da,av,vi,id,d#} (I used the special character # to pad for the length)
  • second approach: {da,av,vi,id}

q=3 (trigram):

  • {##d,#da,dav,avi,vid,id#,d##}
  • {dav,avi,vid}

q=4 (four-gram):

  • {###d,##da,#dav,davi,avid,vid#,id##,d###}
  • {davi,avid}

q=5 (five-gram):

  • {####d,###da,##dav,#davi,david,#avid,##vid,###id,####d}
  • {david}

I used R for some tests, see it does the second approach:


Similar strings have many common q-grams.

You might find it interesting that translation software most likely detects language by comparing q-grams to the statistical results of languages. This my blog entry could be split in bigrams and would have result “English” returned.

Q-gramming may be applied to word level also.

The quick brown fox jumps over a lazy dog.


  • {# the,the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog,dog #}
  • {the quick, quick brown,brown fox, fox jumps,jumps over,over a,a lazy,lazy dog}


  • {# # the, # the quick,the quick brown,quick brown fox,brown fox jumps,fox jumps over,jumps over a,over a lazy,a lazy dog,lazy dog #,dog # #}


Note it is a analyst’s (my, yours) decision if the method used requires some data cleaning like removal of punctuation, whitespaces, letter de-capitalising etc.

The overlap measure

Let’s do some simply examples for learning the idea: string 1 will be Carlo, string 2 Carol.

If I use bigrams for tokenisation.

Let S1 = set of tokens of string 1.

  • {#c,ca,ar,rl,lo,o#}

Let S2 = set of tokens of string 2

  • {#c,ca,ar,ro,ol,l#}

Overlap(S1,S2), O(S1,S2) = number of common tokens.

In our case common tokens are #c,ca,ar

O(S1,S2) = 3

If I use trigrams for tokenisation.

  • {##c,#ca,car,arl,rlo,lo#,o##}
  • {##c,#ca,car,aro,rol,ol#,l##}

O(S1,S2) = 3

The Jaccard index a.k.a coefficient a.k.a measure

Well, you saw the Overlap measure. What is it? It actually has no value itself without context. We need to make it usable. Let’s have an index always 0..1 – the more to 1, the higher is set similarity. I’ll also multiply by 100 to have percents.

Jaccard(S1,S2) = count of common tokens  in strings S1,S2 / count of all unique tokens in both S1,S2

Or rephrasing a bit more scientific:

Jaccard(S1,S2) = intersect of tokens S1,S2 / union of tokens S1,S2

Note: ‘union’, not ‘union all’. Union are all unique, while union all are all.

Jaccard(S1,S2) = overlap of tokens / total count of unique tokens

Knockout now with The Formula :)


String 1 Carlo, string 2 Carol.

If I use bigrams for tokenisation:

  • S1 = {#c,ca,ar,rl,lo,o#}
  • S2 = {#c,ca,ar,ro,ol,l#}
  • Overlap(S1,S2) = 3
  • total count of unique tokens = 9

Jaccard(S1,S2) = 3/9 = 0.33 = 33%

Visualisation from


If I use trigrams for tokenisation:

  • S1 = {##c,#ca,car,arl,rlo,lo#,o##}
  • S2 = {##c,#ca,car,aro,rol,ol#,l##}
  • Overlap(S1,S2) = 3
  • total count of unique tokens = 11

Jaccard(S1,S2) = 3/11 = 0.27 = 27%

Interpreting the result:

  • Two sets having all tokens as common will be 100% similar. The closer to 100%, the more similarity (e.g. 90% is more similar than 89%).
  • If they have no common tokens, they are 0% similar.
  • The half –  50% – means that the two sets have common half of the members.

The Jaccard distance

Measure of how different or dissimilar two sets are. It is the complement of the Jaccard index and can be found by subtracting the Jaccard index/measure/coefficient from 1 (or percentage from 100%).

For the above example with bigrams, the Jaccard distance is 1 – 33% = 67%.

Trigrams have Jaccards distance = 100% – 25% = 75%

So what?

Here you might ask: hey, but you showed they are not usable at all! Carlo IS similar to Carol!

Let me remind I was showing the idea on a very simplified example and we are talking about set similarity, not string similarity.

To treat as a string, let’s use Jaccard for Carlo and Carol having q=1

  • {c,a,r,l,o}
  • {c,a,r,o,l}

Voila! Jaccard measure = 1, Jaccard distance = 0%.

Because THE SETS OF SYMBOLS are equal!


Now back to sets and names in classes.

Tokenisation method 1 – names are tokens:



O = 1 (Martins)

Total count of unique tokens = 10

Jaccard measure = 1/10 = 0.1 = 10%

Jaccard distance = 90% (remember: not similarity, but dissimilarity)

Conclusion: by method 1 names in these classes are not similar.

And now let’s calculate the same using bigrams :)

Tokenisation method 2 – bigrams are tokens:

I used R for splitting to bigrams


and Excel for visualisation


Jaccard measure = 15/30 = 0.5 = 50%

Jaccard distance = 100% – 50% = 50%

Conclusion: by method 2 names in these classes are half similar.

Tokenisation method 3 – trigrams are tokens:


Jaccard measure = 11/43 = 0.26 = 26%

Jaccard distance = 100% – 26% = 74%

Conclusion: by method 3 names in these classes are quite a little similar.

Have you got the idea? Now you can apply the same method on different classes and have a reliable method to measure similarity.

Which one to choose?

The method which suits YOU the best. I can’t predict, will you consider Marta and Martins similar or dissimilar in YOUR business case.


This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: string similarity: dealing with typos (Jaro meausre, ups, measure)

Have you evre had a typo? How to instruct the computer that Luxmeb is most likely meant Luxembourg? (try to write in google and, see, it knows! :) let’s have look – how.

For this type of short strings like words Jaro measrue, ups, Jaro measure or Jaro distance or Jaro score helps.

The higher the Jaro distance is for two words is, the more similar these words are. The score 0 means no similarity and 1 means an exact match. Sometimes result is used in percents, I’ll show both.

The Jaro measure idea is assumption that

if one word contains near the same letters as the other and these letters can quite easily be reordered

to match – like length and lenght – or teh/the, dacnign/dancing

–  then words are similar.

Values used in the calculation of the Jaro Distance:

  1. Length of words being compared
  2. Count of common characters
  3. Count of transpose (exchange) operations needed to reorder common characters to equal order

Official formula looks complex, like scientists usually do to impress us:



It is really easy, let’s start with Jon and John.

  • s1 = Jon length 3
  • s2 = John length 4
  • m = 3 (three common characters J, o, n
  • t = 0 (no symbol transpositions needed Jon to John, because matching symbols are already in the same order)

jaro(Jon,John) = 1 / 3 x (3 matching sumbols /3 length of Jon + 3 matching symbols / 4 length of John + ((3 matching symbols – 0 transpositions)/3 matching symbols) = 1 / 3 x (3/3 + 3/4 + 3/3) = 1/3 x (1+ 0.75 + 1) = 1/3 x 2.75 = 0.917 (quite similar) or 92%

Edit distance of Jon, John = 1. Levenstein Distance Measure is  1 – d(x,y) / [max(length(x), length(y))] : s(Jon,John) = 1 – 1 / [max(3, 4)] = 1 – 1/4 = 0.75 or 75% similarity

You see, Jaro score 92% is better than Levenshtein’s 75%.

and also better than

  • Needleman-Wunsch 75% (scoring 1;-1;-1) (three matches 1×3 + one gap penalty -1 = 2)
  • Smith-Waterman 83% (Jo and Jo, n and n are the best matching substrings)

(calculator to play –

Let’s calculate Jaro now for Jon and Ojhn

  • s1 = Jon length 3
  • s2 = Ojhn length 4
  • m = 3 (three common characters J, o, n)
  • t = 1 (transpositions needed to reorder common symbols (ojn – jon)

Jaro = 1 / 3 x (3/3 + 3/4 + (3-1)/3) = 0.81 or 81% similarity

Jaro for Jon and Ohnj (I reordered symbols)

  • s1 = Jon length 3
  • s2 = Ohnj length 4
  • m = 3 (three common characters J, o, n)
  • t = 2 (transpositions needed to reorder common symbols (onj – ojn – jon)

Jaro = 1 / 3 x (3/3 + 3/4 + (3-2)/3) = 0.70 or 70% similarity

Levenstein distance for JON, OHNJ is 3, measure is 1 – 3/4 = 0.25 or 25% similarity. See the difference?

Jaro for Luxmeb and Luxembourg

  • s1 = Luxmeb length 6
  • s2 = Luxembourg length 10
  • m = 6 (seven common characters L,u,x,e,m,b)
  • t = 1 (transpositions needed to reorder common symbols:  Luxmeb – Luxemb

Jaro = 1 / 3 x (6/6 + 6/10 + (6-1)/6) = 0.81 is 81% similarity

Edit distance result is 1 – 6 / [max(6,10)] = 1 – 6/10 = 0.4 is 40% similarity.

A variation to handle prefixes is Jaro – Winkler measure.

In real life this is a very common situation the strings have the same beginning (I bleieve you all konw tihs jkoe). Like beginnign, Luxmebogr, lenhgt, meausre. To take that into account and improve scoring, there is a measure


  • dw is Jaro-Winkler measure result
  • dj is Jaro distance – like we calculated before
  • l is the length of longest common exact matching prefix (often referred as max possible 4)
  • p is the weight given to the prefix (how important it is for us) (default use to be 0.1)

Another way to express the same formula is

  • jaro-winkler(x,y) = (1 – PL*PW)*jaro(x,y) + PL*PW


  • PL = length of the longest common prefix
  • PW is a weight given to the prefix

Let’s calculate Jaro-Winkler for Luxmebogr and Luxembourg.

  • I’ll assume that prefix weight is recomended 0.1
  • Jaro distance (s1=9, s2=10, m=9, t=1) = 0.91 (91%)
  • Longest common exactly matching prefix is Lux

Jaro-Winkler distance is

  • 0.91 + (3×0.1x(1-0.91)) = 0.94 or 94% similarity (first formula)
  • (1-3×0.1)x0.91 + 3×0.1 = 0.94 or 94% similarity (second formula – of course, the same because formulas are equal, just written in another way)

Jaro-Winkler for Luxmeb and Luxembourg

Jaro was 0.81 or 81% similarity

0.81 + (3×0.1x(1-0.81)) = 0.87 or 87% similarity

Can’t believe it is so easy, can you?

There are many more string similarity measures and algorithms in the world, different time elapsed, different precision and a lot of adaptions for specialized cases like nucleotides.

However, if you have read all my string similarity posts, you should have a good foundation of the basic ones.


This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

%d bloggers like this: