Big Data: with respect to NoSQL Zoo


Relational databases have many advantages, basically because of completely structured way of storing data within fundamental structure – easily understood table. But! (c) Besides RDBMS existance and advantages Google built Bigtable, Amazon developed Amazon DynamoDB, NSA built Accumulo, in part using Bigtable as an inspiration. Facebook built Cassandra, Powerset built HBase, LinkedIn built Voldemort etc.

Currently >225 different databases – see http://nosql-database.org/ – “Your Ultimate Guide to the Non-Relational Universe”

It wouldn’t have happened if ultra popular and well-established relational databases had all the capabilities these brands were looking for, would it?

RDBMS are still at the peak of the wave because wide and solid, well-grounded usage over many years in combination with strong scientific basis, financial capabilities and a lot of lessons learned have led to vendors investing resources during decades

  • to improve and polish built-in locking and transactions management,
  • preventing collisions between multi-users updating the data,
  • provide highly customizable data access control solutions,
  • expand SQL capacity (I’ll remind that outside of core SQL there are very many nuances when SQL querying different vendor databases. I’ll show some samples someday later).
  • offering a lot of metadata (data about data) and utilities (a lot of them are seldom used)

RDBMS will be alive and used always, that’s definitely is not a concern. Let’s have a simple example to have some fun and also increase respect to them. We have

A = 5 (could it be you account balance or product count in store)

B = 12

Two operation sets in queue to be processed. Maybe Anna and Peter both have pressed the button [Apply]:

Operation set O1

C:=A+B

Write A:=C

Operation set O2

C:=A+B

Write B:=C

Let’s model what may happen:

Table-AB

Which answer is correct? Which is not correct? WHY?

The only answer is: first-in, first-served. They both are correct. Welcome to the world we live in :) If developers do not build any other means to prioritize these operations, they rely on RDBMS internal built-ins for data consistency by serializing, locking and other means (heh, and one of developers’ support tasks is to be able to track down and explain to end-users why the result is 29 or 22).

For some more entertainment – let’s do the same without transaction and enjoy yet another result:

Table-AB-wt

You see, someone must implement that background logic serialising, prioritising, serving network failures, concurrent transactions and other megastuff. There is enormous count of built-in RDBMS features and amongst them are as well crucial ones as overheads nice to have ones, and that reminds a Swiss knife like this:

Swiss-knife

Noone would dare to say it is lacking features. Their marketers also will tell that this knife has solution for near every situation. And actually they are right, aren’t they.

Would YOU dare to say there a too few functions in? Would YOU recommend that this is the way to go for restaurant chef or manicure or car repair?

Of course, I can imagine you saying YES – when there is no another knife at all or other option is spade or – if this is the only tool you have seen in your life.

If you were Google or Amazon, or Facebook, you actually would believe there are another ways. Because otherwise you will choke and die, drowning in your data and watching customers running away.

You then need to deal with consistency, serialising, scaling etc. Everything. Imagine if you’d have to program the chess game if you want to play it. You sit and think: well… should I start with designing horse or thinking how to stay within chessboard after a horse turn?

It is a grave decision and serious amont of work and issues to be solved when designing your own system. This is not anymore installing Oracle and writing ‘update emp set mgr_id=17’. This is a task for many person-years. This is parallel with existing systems and growing business, this is very expected and pressed my management and must be fast, correct, stable, expandable and thousands of other must-bies.

Year 2004 Google began developing their internal data storage system Bigtable searching for cheap, distributed ways to store and query data at massive scale to petabytes of data and thousands of machines using a shared-nothing architecture and having two different data structures, one for recent writes, and one for storing long-lived data, with a mechanism for moving data from one form to the other.

Near the same time, Amazon experienced growing business and direct database access was one of the major bottlenecks. They developed and released Amazon DynamoDB as the result of 15 years of learning and implementing database that can store and retrieve any amount of data, and serve any level of request traffic.

My deepest respect to all the developers all over the world!

NoSQL

There is no such thing as the only one ‘NoSQL’ database, neither one vendor, nor one server, book, any silver bullet. As I wrote, currently >225 different databases.

Examples of fasic classification by data model is:

Popularity and trends

To have overall understanding about as-is and trends we may have a look to sites where popularity of database is measured. As an example was mentioned https://db-engines.com/en/ranking_definition. You see they measure by:

  • Number of results in search engines queries
  • Frequency of searches in Google Trends.
  • Number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflowand DBA Stack Exchange.
  • Number of job offers, in which the system is mentioned
  • Number of profiles in LinkedIn and Upwork, in which the system is mentioned
  • Twitter tweets, in which the system is mentioned

Actual ranking

See here: https://db-engines.com/en/ranking

Overall_ranking_sep2017

Of course, Oracle leads there – is has been in market for years and still has broad range of usage as well as serving perfect fit as well as pain to move away from it.

Much of fun is also querying by types.

Key-value https://db-engines.com/en/ranking/key-value+store

Key-value-ranking-Sep2017

Document oriented https://db-engines.com/en/ranking/document+store

Document-ranking-Sep2017

and also a lot of other reports like https://db-engines.com/en/ranking_categories

Categories-ranking-Sep2017

To my very pleasure plan of ‘Data processing systems’ lectures reveals that we will have separate sessions about near each of the most popular approaches. Can’t wait!

Next blog entry hopefully will be about CAP theorem, ACID and BASE transactions. Fingers crossed to have weekend time for blogging.

Disclaimer
This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

One response to this post.

  1. Posted by MārtiņšŠ on 18/09/2017 at 16:10

    Pēdējā laikā sanācis izmantot NoSQL datubāzi DBreeze. Nav tā populārākā, bet ļoti ātra uz lieliem apjomiem https://github.com/hhblaze/DBreeze

    Patīk

    Atbildēt

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: