Big Data: basics of wide column store (column family) databases


Column family databases are designed for very, very large volumes of data where speed is crucial (millions of processed records per second and volume is terabytes – petabytes – …) such as performing

  • Analytics – logfiles, scientific, stock trends etc
  • Searching (like googling)
  • Social networking
  • Applications geographically distributed over multiple data centers

One of use cases – Facebook uses HBase for its messaging platform. Other popular databases (https://db-engines.com/en/ranking/wide+column+store):

Column-family_ranking_Oct2017

Wide column store is a sub-type of key-value database. It stores the data of a table not record after record but column by column. Actually the concept is interesting:

Traditional row based table:

Cats

ID, name, breed, colour, animal
1:picadilla, burmese, brown, null
2:murmor, siamese, null, cat
3:fred, null, grey, kitten

The column store stores columns together as ID and value, like this:
Cats.Name 1:picadilla,2:murmor,3:fred
Cats.Breed 1:burmese,2:siamese
Cats.Colour 1:brown,3:grey
Cats.Animal 2:cat,3:kitten

As you see, you anytime can easily add a new column like vaccination of Fred:

Cats.Vaccination 3:20-oct-2017

Or add a Date of birth for Picadilla and Fred

Cats.DOB 1:02-jul-2015,3:17-dec-2014

In a traditional table you should define the table structure before. If you add a column later, you add this column for the whole row like:

Table-Cats

Within Wide column store columns are created as needed when sending data to the database. Hence a record can have billions of columns and each record may have different set of columns.

Columns are grouped into column groups – families. Rule of thumbs are

  • to group into a family columns frequently used together,
  • to have a few families and unlimited number of columns.

Each row in the column family has a unique identifier key, like we had 1,2,3. These unique keys might be the ones you sometimes see in URLs like xEWfvS1-ffrEf-q4m. Or timestamps, or any other unique value.

To speed up querying, also derived data are stored, not calculated on fly. For example, store for each cat also average chicken liver amount eaten per day, per week, or weight of eaten food difference between yesterday and today. Or length of its tail proportion to length of its ears. Or any other information (see, how the data amount to be stored is growing?). Thus the time is saved – instead of calculations a very fast lookup by key is performed.

Should ‘Murmor, the son of a great Catapurr, born in tundra when the Sun was high‘ be stored as one string or by each word in a separate column? It depends on usecases. If you are expect database to query on any of these like ‘how many cats were born in tundra’ or ‘how many times we have word ‘the’ used in names’ then you should split them. If this is just a string then store it as a string in one column.

Joins are not used in wide column stores. Instead of that data denormalisation is used. If Picadilla favorite food  Chicken is the same as Murmor’s, then both Picadilla and Murmor will have a value ‘Chicken’ stored in their data. There will be no reference to food ‘Chicken’ lookup or colour lookup, or breed lookup.

Column-family-Cats

Here you see JSON notation for Cats family:

JSON-Cats

The columns themselves may store columns. Then is called Supercolumn family. However the more complex data structure you choose the harder it is to parse and process. Simplicity is the winning strategy of having advantage of column family databases.

Here you see JSON storing doctor checkups for Murmor:

JSON-Cats-superfamily

I hope you are as excited as I am. Magic behind ‘Big Data’ slowly reveals. No magic. Only programming :) 

By the way, here is a nice example how a tweet looks like stored in JSON. Beautiful tweets you see in Twitter is nothing else as end-user frontend programming, based on retrieving,  parsing and visually displaying these JSONs.
Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Advertisements

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: