There are two courses scheduled in the Big Data analytics module this semester:

Warehousing I dare to think I know quite well – however formal requirements of course are much harder one could expect – presentations, researches and a real DWH developed – and I love it as it means I will learn new things, not just reuse existing.

I wish I could share the same optimism about statistics theory and formulas. I faced the sad truth:

- I have forgotten how to calculate integrals and I deeply regret also other skills like Poisson Approximation of Binomial Probabilities have left my memories…
- I have no experience with programming in R.

### The first lecture started.

Professor told us the plan. Well, this course will be quite tough, I thought.

Imagine my face when I realized this was not a plan of the course. This was a plan for the first intro lecture. Kind of the easy one.

EG, the topic “binomial distributions” was covered for some seconds – “Binomial distributions you all obviously know”. Click, next slide – “Central Limit Theorem, you all know that also”.

Mamma Mia, where am I and where are my belongings? Why we had no the intro about the emergency exit and jackets with oxygen falling for those who are in panic??

Calm down, dude, calm down.

The teacher, prof. Janis Valeinis is the type of teachers I value the most – he has a sparkle is his eyes. He loves the topic, he lives in the topic and statistics is his passion. It actually is a honor for me to study here and have this amazing chance to learn from a professor like he. After all, this is master’s degree level studies and this is just normal students are presumed to be skilled beforehand.

Look around, you see, each student is sitting with a poker face and I will also. I WILL!!! I was here 20 years ago as a successful student in love with differential equations and algebra of sets. I will domesticate integrals and Poisson again, and even more, I will put a bridle on R too.

So my mission is, should I choose to accept it:

- remember a lot of basics, from combinatorics to integrals
- be able to follow the study topics presuming all students know the concepts
- accelerate my R skills from 0 to 100
- do homeworks, pass quizzes and exam
- ah, yes, and keep working full day as DWH analyst – programmer.

I will not mention my other roles like servant of our cats. No discounts anyway.

## Here comes my plan. NB: I do not know yet will it work :)

**Step 1. Install RStudio (free, open source)**

**Step 2. Buy 1,001 Statistics Practice Problems For Dummies and set a goal solve 60 questions per day **– I love learning by doing.

Example of question:

And answer explained:

(I like the idea and keep in waiting list ‘R for dummies’ and ‘Statistical Analysis with R For Dummies’)

**Step 3. Subscribe in Coursera to Introduction to Probability and Data.** If you choose “audit mode”, it is for free because you can listen but will not get any certificate of completion etc.

- Install Coursera app – I can access course better via app. When using browser I see week is locked till day x but via app I can listen to any topic.
- Download topics for offline usage. I am listening them in headphones on my way to work, to lunch and at any suitable moment.

**Step 4. Plan and mark in you calendar at least 2 hours per day as learning hours** – I use one hour in early mornings and two after work in evening. Weekends I devote for studying about 8 hours a day.

I just love doing it like others do knitting. Thanks to my family accepting my forever learning way of life.

Enjoy the happy moments when the feeling “I did it” rises like a phoenix.

## Professor started the course with a so called birthday paradox.

How many students in class do you need to do a bet that at least two of them will have birthday in the same day of year, assumed that all 365 possible birthdays are equally likely?

365 students? 181 students? 100?

Hah, in room with **23** people there is a chance 50:50 two will share the same birthday.

And we learned the mathematics behind it.

First person always has a birthday and any day suits, so we can ignore it.

Probability that second random person

- has the same birthday as first is 1/365 (0.3%).
- does not have the same birthday as first is 364/365 (99.7%) as it may have any birthday other than the birthday of first person.

Probability that third person

- has the same birthday as the first person is 1/365 * 1/365 and the same calculations if is has the same birthday as the second person (and probability would be 1/365*1/365*1/365 if we’d be looking all three of them having birthdays the same day). This number of possible birthday match combinations keeps growing with each next person in sequence, that’s why in “at least one” task easier is to calculate from opposite to the probability that there are no two people in the room having the same birthday and subtract from 1 thus getting the probability of at least one matching birthday.
- does not have the same birthday as the first and second is 363/365 as third person may have any of the birthdays not already taken by first and second persons

Probability that fourth person does not have the same birthday is 362/365 etc.

Total probability that

- second person does not have the same birthday as first
**AND**third person does not have the same birthday as first and second**AND**fourth person does not have the same birthday as first and second and third

is multiplication of probabilities.

365/365 * 364/365 * 363/365*…how many persons do we want to consider.

Here I will confess I started using R with an assumption this is just another C++ and started writing recursive loops like

birthday <- function(n) { 1-(365-n+1)/365 } y <- 1; for (n in 1:10) { y <- y*birthday(n); return(y); }

Happily I did not succeed because if I did I might be adapted to a wrong mindset. The key to understanding for me was this sentence I found in one of blogs about R:

**### In R you need to switch your mentality to thinking in vectors instead of for loops.###**

When I realized that concept I became friends with R (moreover, now I am in love with R). *Thanks, God and University degree, I am quite familiar with vectors.*

In R *everything* is a vector. Single number is a length-one vector. You can choose when to calculate over a vector and when as a number. This little example helped me:

#simple function to multiply given number with the next in sequence > test <- function(n) {n*(n+1)} # calculate when n is 1 > n <- 1 > test(n) [1] 2 #when n is 2 > n <- 2 > test(n) [1] 6 # when n is 3 > n <- 3 > test(n) [1] 12# and now one of the powers of R - no looping needed:> n <- 1:3 > test(n) [1] 2 6 12

One more example of different usages

> testn <- function(n) {prod(n*(n+1))} > testn(1:3) # now you will have the results multiplied 2 * 6 * 12 = 144 [1] 144 # vectorize it and repeat - completely different result > testn <- Vectorize(testn) > testn(1:3) [1] 2 6 12

After getting this idea I immediately deleted my clumsy loops and voila! my new R-life began. Learned by playing.

> prod(3) [1] 3 > prod(3):1 [1] 3 2 1 > prod(3:1) [1] 6

And birthdays now are easy – we instruct loop through set of given persons in class, like to calculate for 2 persons, for three, 4, 5, … as many as we want.

I must note here that The Pigeon-hole Principle (Dirihlē princips) clearly says that if we have more than 365 persons, eg, 366 persons, it guarantees that at lest two persons will have the same birthday. But it is not today’s topic. Today we’ll see that we mathematically don’t need 366 persons to have a match :)

birthdays <- function(n) { 1-prod((365-n+1):365)/365^n }

This vector stuff is not obvious, especially if you don’t have background understanding. I’ll show more examples. Let’s forget to vectorize function.

birthdays <- function(n) { 1-prod((365-n+1):365)/365^n n <- 1:3 birthdays(n) [1] 0.0000000 0.9972603 0.9999925 Warning message: In (365 - n + 1):365 : numerical expression has 3 elements: only the first used >

What happened? Is it really that 2 persons in a class have 99.7% chance to match birthdays? And three even 99%?

### C’mon, something definitely wrong here.

We passed too much values to our built-in R loop, see this example:

So, what happened and why for n 1 2 3 (to calculate the needed 365,364,363)

we got *crazy* results 0.0000000 0.9972603 0.9999925?

Because in the first part of formula we try to get three elements up to 365:

365 to 365 = 365

364 to 365 = 364, 365

363 to 365 = 363, 364, 365

It is not possible to have all at once by my current code. Thus why R warned only the first value will be used of n = 1, resulting ‘365’ is taken as he result set for the first part of formula for any n

0.0000000 (this is for n=1) is the result of formula (1 – (**365**)/365^1) *this is ok as only one occurrence here*

0.9972603 (this is for n=2) is the result of formula 1 – (**365**)/365^2 *while we need the result to be 1-(365/365)*(364/365) = 1-(365*364)/365^2 = 0.00273973*

0.9999925 (this is for n=3) is the result of formula (1 – (**365**)/365^3) *while we need the result to be 1-(365/365)*(364/365)*(363/365) = 1-(365*364*363)/365^3 = 0.00820417*

### Let’s have even more crazy example and follow up to getting their values:

n <- 3:4;n

[1] 3 4

> birthdays(n)

[1] 0.008204166 0.997282751

Warning message:

In (365 – n + 1):365 :

numerical expression has 2 elements: only the first used

now we have n 3 4 and after calculations (365-n+1) we have 363, 362 and as per our conditions we want loops

{363,364,365}

{362,363,364,365}

But again, we can loop only one and R says it will use the first.

0.008204166 (this is for n=3) is the result of formula 1 – (**363*364*365**)/365^3

0.997282751 (this is for n=4) is the result of formula 1 – (**363*364*365**)/365^4

Now we have more motivation to remember to vectorize function :) and

## voila! we have The Results!

> n <- 1:4;n [1] 1 2 3 4 >birthdays <- Vectorize(birthdays)> birthdays(n) [1] 0.000000000 0.002739726 0.008204166 0.016355912

*Why it started working? Because now the function will work with products (364*365), (363*364*365) etc instead of previous approach to loop through one set.*

Here:

0.000000000 is the probability to match birthday in class with one person (1-365/365). Of course, it is not possible as there is no other person to match with.

0.002739726 is the probability to match birthday when 2 persons in class (0.3% or 1-364/365*365/365)

0.008204166 is for 3 persons in class (0.8% or 1-((363/365)*(364/365)*(365/365))

0.016355912 is for 4 persons (1.6% or 1-((362/365)*(363/365)*(364/365)*(365/365)

Now let’s draw it as a chart and search for the person count when we reach probability 50:50 or 0.5

birthdays <- function(n) { 1-prod((365-n+1):365)/365^n } n <- 1:50 birthdays <- Vectorize(birthdays) plot(n,birthdays(n),type="l",ylim=c(0,1),col="dark red", lwd=2,lty=3,main="Dzimšanas dienas", ylab="varbūtība, ka vismaz diviem ir vienā dienā", xlab="cilvēku skaits", axes=F) axis(2,at=seq(0,1,0.1),labels=T) axis(1,at=seq(0,max(n),1),labels=T) # draw line ar probability 0.5 abline(h=0.5,lwd=1,col="brown") legend(1, 1, legend="Varbūtības grafiks", col="dark red", lwd=2,lty=3, cex=0.6)

and enjoy the result.

We see that probability becomes 50:50 at person count 23. Let’s perform probability calculations for these two values:

birthdays(

22)

[1]0.4756953

> birthdays(23)

[1]0.5072972

### You might ask to perform the calculations in opposite way:

if we set the desired probability like 0.5, how do we find the count of persons when it occurs?

This type of calculations is done by approximations because the opposite formula you have to find the n is complex as n is the time of multiplication, decreasing and raising to the power.

In our situation the solution will be like engineers do :)

Option a: do the calculations, store in array and retrieve the nearest value by index. Read below how many persons to invite in class if you want 50% or 99% probability:

> # calculate values for 1 to 100 persons and store in an array > answer <- birthdays(1:100); answer [1] 0.000000000 0.002739726 0.008204166 0.016355912 0.027135574 0.040462484 0.056235703 0.074335292 0.094623834 [10] 0.116948178 0.141141378 0.167024789 0.194410275 0.223102512 0.252901320 0.283604005 0.315007665 0.346911418 [19] 0.379118526 0.411438384 0.443688335 0.475695308 0.507297234 0.538344258 0.568699704 0.598240820 0.626859282 [28] 0.654461472 0.680968537 0.706316243 0.730454634 0.753347528 0.774971854 0.795316865 0.814383239 0.832182106 [37] 0.848734008 0.864067821 0.878219664 0.891231810 0.903151611 0.914030472 0.923922856 0.932885369 0.940975899 [46] 0.948252843 0.954774403 0.960597973 0.965779609 0.970373580 0.974431993 0.978004509 0.981138113 0.983876963 [55] 0.986262289 0.988332355 0.990122459 0.991664979 0.992989448 0.994122661 0.995088799 0.995909575 0.996604387 [64] 0.997190479 0.997683107 0.998095705 0.998440043 0.998726391 0.998963666 0.999159576 0.999320753 0.999452881 [73] 0.999560806 0.999648644 0.999719878 0.999777437 0.999823779 0.999860955 0.999890668 0.999914332 0.999933109 [82] 0.999947953 0.999959646 0.999968822 0.999975997 0.999981587 0.999985925 0.999989280 0.999991865 0.999993848 [91] 0.999995365 0.999996521 0.999997398 0.999998061 0.999998560 0.999998935 0.999999215 0.999999424 0.999999578 [100] 0.999999693 > which(answer > readline(prompt="Enter desired probability: "))[1]Enter desired probability: 0.5[1]23> which(answer > readline(prompt="Enter desired probability: "))[1]Enter desired probability: 0.99[1]57

Option b: use R function uniroot to do actually the same – loop through values for the hit, but I haven’t learned it yet.

## Now let’ s have some fun: let’s generate random birthdays and see what happens.

I assigned numbers 1 to 365 to the days, then I generate two random sample sets and take the intersected values and convert them to date of year 1970 (just for fun).

Let’s start with probability 50:50 or 23 persons:

smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ") 23 sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01"))

### And the results of 10 randoms for 23 people (probability to get a match in a single trial is 50%):

- no match: 1 time
- one common birthday: 4 times
- two common birthdays: 3 times
- three common birthdays: 2 times

> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ") Enter the size of class to generate random birthdays for?: 23 > sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01")) [1] "1970-09-18" "1970-11-25" [1] "Date of length 0" #no match [1] "1970-03-11" "1970-04-03" "1970-07-25" [1] "1970-01-07" "1970-12-29" [1] "1970-01-25" [1] "1970-01-26" [1] "1970-04-01" [1] "1970-01-23" "1970-06-06" "1970-08-27" [1] "1970-09-30" [1] "1970-08-16" "1970-08-27"

### Let’s do 10 randoms for 10 people in class (probability to get a match in a single trial is 12%):

- no match: 7 times
- 1 common birthday: 2 times
- 2 common birthdays: 1 time

> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ") Enter the size of class to generate random birthdays for?: 10 > sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01")) [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "1970-12-19" [1] "Date of length 0" [1] "1970-07-04" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "1970-02-06" "1970-11-09"

### Let’s do 10 randoms for 5 people in class (probability to get a match in a single trial is 3%):

- no match: 9 times
- 1 match: 1 time

> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ") Enter the size of class to generate random birthdays for?: 5 > sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01")) [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "1970-09-07" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0"

### Let’s do 10 randoms for 2 people in class (probability to get a match in a single trial is 0.1%):

- no matches

> smp <- readline(prompt="Enter the size of class to generate random birthdays for?: ") Enter the size of class to generate random birthdays for?: 2 > sort(as.Date(intersect(sample(365,smp,replace=TRUE), sample(365,smp,replace=TRUE))-1, origin = "1970-01-01")) [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0" [1] "Date of length 0"

And now one more fun: let’s add another chart:

### probability to have the same day of a month in a class of n persons

(assuming for simplicity there are 31 days)

birthdays <- function(n) { 1-prod((365-n+1):365)/365^n } birthdays <- Vectorize(birthdays) birthdates <- function(n) { 1-prod((31-n+1):31)/31^n } birthdates <- Vectorize(birthdates) n <- 1:50 plot(n,birthdays(n),type="l",ylim=c(0,1),col="dark red", lwd=2,lty=3,main="Matching days two at least", ylab="probability", xlab="person count", axes=F) axis(2,at=seq(0,1,0.1),labels=T) axis(1,at=seq(0,max(n),1),labels=T) abline(v=23,lwd=1,col=5) abline(h=0.5,lwd=1,col="brown") lines(n,birthdates(n),type="l",col="blue",lwd=2,lty=3) abline(v=6.8,lwd=1,col="violet") legend(1, 1, legend=c("Days of year", "Days of month"), col=c("dark red", "blue"), lwd=2,lty=3, cex=0.8)

And we see that probability 50:50 the same day of month is reached at 7 persons.

Next blog expected to be about probabilities and binomial distribution, discussing solutions for tasks like

# Q1. A coin is tossed 20 times. What is the probability of getting exactly 3 heads?

# Q2. A die is rolled 20 times. What is the probability of getting exactly 10 six?

# Q3: 80% of people who purchase pet insurance are women. If 9 pet insurance owners are randomly selected, find the probability that exactly 6 are women.

P.S. Meanwhile – behind the scene: