Big Data: CAT, ups, CAP theorem. And ACID and BASE transactions basics, also full of cats


Spoiler alert: a lot of cats today.

Picadilla

Picadilla

Warm-up intro while cats are approaching: to enable working with large datasets computers are connected in a distributed system as nodes that share data. Data records are replicated across nodes to keep the system up.

It is always a business owner’s decision what to do in case when one or more nodes lose connection to distributed system. Shall all the system stop or shall those nodes still operate if they are able to respond?

Examples

Youtube node storing your video copy loses connection to others. You watch your video and see statistics 100 views. 10 minutes later you watch the same video and there are 242286 views (or vice versa). Ooops (either synchronization happened or you now are watching video on different node). But would you feel happier if no video available at all? Youtube have chosen availability over views count consistency.

Despacito

Video used Despacito laukos (Latvian parody)

Another example: one of nodes in ticket sale system loses connection. You connect and see 4 free seats. You press [Book] and see ‘Dear Customers, please come later, apologies’. You get upset and press refresh for some hours until site recovers, however no more seats. They have chosen consistency over availability.

Imagine, you’d bought and came to the event – whoops – there is another guy with same tickets. However – I must note many companies have calculated it is much cheaper to apologize and give gift cards or pay penalties instead of stopping whole business.

Why they having near unlimited money can’t just do everything ideal?

Cats proudly presents CAP theorem

You have one cat and you feed it. Single processor, single INPUT/OUTPUT.

Murmor

Then the era of Big Cats come and you have three cats: Fred, Murmor and Picadilla. You feed them and write down in your notes which and when was fed and live happily ever after.

One day you got sick and your cats were hungry. Single point of failure happened.

Distributed System introduced

You ask your spouse Alex and child Max to involve. Now you are distributed system with three nodes. They do the same as you: when seeing hungry cat, feed and write down in their notebooks it was fed. (you will ask, why not on a common whiteboard? Because we are talking about CAP theorem which applies to distributed systems and I must pretend not all feeding data can fit on a whiteboard)

Some days later your start noticing that Murmor seems fat. You go to check notes and find out that each of you have been feeding Murmor several times a day as this hell boy was constantly pretending to be hungry. The same day you get notified that Cats Care Government Agency will do regular audits.

You call the family meeting and discuss the issue that your data are not consistent and cats are having never ending party and Agency is a threat.

Consistency

You decide: before any of you is feeding any cat, you call others and hang on the phone while each writes in their notes. Thus, each will always know the latest time in their notes. Cats are biting your leg, yelling, pretending to faint and sitting on your neck but you are happy – because Consistency now is solved, you all have the same data. Cats Care Government Agency calls to examine are highly welcome.

Everything is just perfect – mobile networking fine, Alex and Max always picks up the phone, pens are writing well and notes have enough blank pages.

One day Alex leaves for expedition to jungle. When you call Alex deeply regrets forgetting notes at home. As you have agreed that data consistency is must have, it means that day you cannot feed cats because you and Max will update notes but Alex will not. Just imagine the horror Cats Care Government Agency might call you and then Alex to ask latest feeding date and come to save cats by taking them away from these shameless liars. You (heh – cats) have faced the Availability issue. They are not fed at all now.

Availability

When Alex returns to her notes you call the family meeting and decide if any of you cannot take notes others still feed cats, update their notes and leave red post-its for others. When others return home they copy all the post-its to notes. Voilả, now you have Availability.

You accept the risk if Cats Care Government Agency calls, the latest feeding data might be not the latest one but any of you still can share any history statistics – which food did you use, how often feeding was etc, based on your notes)

So CAP theorem postulates: when some of you has left home notes (partition occurs in your distributed system – or in CAP theorem terminology partition tolerance happened):

  • Either you all guarantee to have the latest feeding date in your notes (and do not care cats are hungry waiting) – Consistency
  • Or you feed cats according to your notes (and do not care if beasts are overfed or Agency might get old data) – Availability

Isn’t it obvious that we can’t have both Consistency and Availability at the same time?

Let’s exploit cats for two very famous concepts explained.

ACID transactions – pessimistic approach which forces consistency. The ideal world for data critical systems like banking (massive data quality checking, a lot of built-ins for transaction control etc. My native RDBMS world)

  • Atomic: all tasks within transaction succeed or every task is rolled back. If Max does not succeed writing notes then you and Alex erase date also from your notes and return food to fridge. Cats go crazy.
  • Consistent: on the completion of transaction the database is structurally sound. Notes are up to date without any punctuation errors and all cats have eaten exactly the same food as written in notes. No half eaten chicken left.
  • Isolated: transactions are run sequentially. There is no chance you and Max are both feeding Picadilla, while Fred eats Murmor’s fish.
  • Durable: once transaction is complete, it cannot be undone, even in presence of failure. When food is eaten and suddenly light was turned off or Max stepped on Fred’s tail the food does not appear back in bowl and you cannot just decide to add a delicacy for Picadilla – because transaction is over.

If you had enough patience to read you might notice that having this level of checks you just cannot operate petabytes. Like hoping to cut a forest with surgical scalpel.

Big data world is BASE transactions – optimistic approach accepting that database state is in a state of flow (much looser then ACID but much more scalable and big data friendly)

  • Basic Availability: appears to work most of time. Either you or Max will always hang near fridge, so cats have a chance to be fed often, even if Alex is in jungle
  • Soft state: no need for different nodes to be consistent all the time. You will feed Picadilla, leave post-it for Max and don’t care when Max updates notes
  • Eventual consistency: achieved lazily later. Some day Alex returns from jungle and will write all the dates from post-its to notes, so for some time you will all actually have the same feeding dates in your notes.

Thank you all for patience! Tomorrow is the deadline to apply for the semester end practice and I am going to draft and submit cat autofeeding system offer.

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: