Big Data: the curtain rises


This blog entry is inspired by the first lecture within Data processing systems course, a part of my journey Lielo datu analītiķa modulis – tālākizglītības iespēja IT profesionāļiem. My reflections on warm up session are here.

When we say Big Data we imagine a lot of analytics, funny and serious findings (eg 37 Big Data case studies), huge data center and distributed file system beneath the database.

I’ll note here that despite this big-data-buzzwording file databases are nor a fresh fashion trend neither invention. Flat file databases were a natural development early in computing history, long time before relational databases flourished. However, file databases were treated then more as theoretical concepts for gourmands yet nowadays they are here as must have to rein in big data.

I’ll explain why

Any of computer science students is familiar with relational databases concepts and SQL, PL/SQL (heh, the world I live like a duck to water). Relational databases (RDBMS) are widely used perform transaction update and especially being valued for their strength handling the difficult issues of consistency during update.

However – relational database overheads required for these complex update operations support handicaps them for other functions like storing and retrieving video and unstructured data. So, there has always been a market niche for operating with large amount of poorly structured data. This niche appeared to be languishing because people were fond of relational databases for decades and expanded their usage in ways they shouldn’t as people just could not imagine that big data explosive burst to come.

Once upon a time data flow was kind of predictable and controllable. General truth was you define structure and load data there and decline if data doesn’t fit.

Big Data burst wiped away this belief.

  • Old: RDBMS traditional approach: before you must define table, columns and data types and only after you may load data
  • New: NoSQL (MongoDB): data may exist before and their definition later as particular collection of data doesn’t have to be defined before data is added

Nowadays you collect the data and think later, how to describe and use them.

Progress is impossible without change

A lot of nowadays giants were started their business using relational database years ago and while growing realized they must change architecture. Examples – LinkedIN moved from RDBMS to NoSQL, Twitter moved from MySQL to HDFS HADOOP with scalding, Facebook’s photo storage system has already been completely rewritten several times as the site has grown. I’ll not say FB moved away from RBDMS because FB still use MySQL, but primarily as a key-value persistent storage.

Facebook is one of widely known big data (very big data) producers and consumers – would never ever become popular if you had to load your timeline for some hours (https://www.quora.com/What-is-Facebooks-architecture-6) as it would be if they use RDBMS only.

There are businesses without enough capacity to change and they agonize. One of document management system, widely used in government institutions (let’s not name and shame) has supplemented their historically used RDBMS with storing documents split up to tables. It means when one needs to retrieve a document this action’s implementation takes up to 70 joins of different tables to form that one output document – you might imagine how inconvenient and inefficient it is.

Like your selfie stored aa separated pieces. Instead of ‘my latest selfie’ you would query like ‘take the face from faces table where face is mine and photo date is latest, union the latest shoulders from shoulders table, join the hands, coffee mug and background, then order them to form a rectangle and display’. Despite pain to write that query I believe query would work quite fast – until thousands of users tries do the same in parallel.

Puzzle
(picture created using http://www.jigsawplanet.com)

Back to the basics

You see storage and retrieval of data can be implemented in means other than the tabular relations used in relational databases. However, while the concepts of architecture and techniques are always evolving, the basic needs remains the same:

  • Data storing (files, partitions, tablespaces etc)
  • Data definition (create and alter tables, views etc)
  • Data manipulation (select, insert, update, delete)
  • Data control (grant, revoke access)

So, IT guys have been working for decades to make that possible. There is no one suitable database type for everything. There are hundreds of different systems (by the way, do you know document-oriented database management system, search and Big data analytics platform Clusterpoint, having development center based in Latvia?). Each of them claims to be the best 🙂 and each of them have their strengths. Giants use to write their own languages and own architectures. There are custom-written systems, like Haystack, a highly scalable object store used to serve Facebook’s immense number of photos. Facebook Messages is using its own architecture. Facebook Query Language (FQL) was introduced 2007 (no longer available).

Each software manufacturer may call the methods differently and have different syntax and different approach. But, when you get to the idea, all other is a matter of mindset, techniques and reading manuals.

Let’s have some googled examples.

Data definition – creating very simple tables

Oracle (Relational database) example

CREATE TABLE departments 
(  
 department_id number(10) NOT NULL,  
 department_name varchar2(50) NOT NULL,  
 CONSTRAINT departments_pk PRIMARY KEY (department_id) 
);

IBM InfoSphere Hadoop example

CREATE HADOOP TABLE t (
 i int, s string)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 STORED AS TEXTFILE;

Hive example (Hive Query Language (HQL) statements that are similar to standard SQL statements. Apache Hive is considered the standard for interactive SQL queries over petabytes of data in Hadoop.)

CREATE TABLE products (url STRING, category STRING)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY 't'
 STORED AS TEXTFILE
 TBLPROPERTIES ("skip.header.line.count"="1");

MongoDB example

Inside the database are collections, which is a somewhat similar version of a SQL table. Since Mongo documents exist independently, documents inside the same collection are able to have different fields. This allows you to create documents inside a collection that have different fields, but can still be queried by their specific shared fields.

> db.books.insert( 
{ 
 "title" : "Kafka on the Shore", 
 "author" : "Haruki Murakami", 
 "publish_year" : 2002, 
 "genre" : ["fiction", "magical realism"] 
} )

Data Manipulation – inserting data

Oracle (RDBMS)

INSERT INTO categories (category_id, category_name) 
VALUES (150, 'Miscellaneous');

Mongo DB

db.inventory.insertMany([

{ item: “journal”, qty: 25, size: { h: 14, w: 21, uom: “cm” }, status: “A” },

{ item: “notebook”, qty: 50, size: { h: 8.5, w: 11, uom: “in” }, status: “A” },

{ item: “paper”, qty: 100, size: { h: 8.5, w: 11, uom: “in” }, status: “D” },

{ item: “planner”, qty: 75, size: { h: 22.85, w: 30, uom: “cm” }, status: “D” },

{ item: “postcard”, qty: 45, size: { h: 10, w: 15.25, uom: “cm” }, status: “A” }

]);

Hive

LOAD DATA INPATH '/tmp/maria_dev/products.tsv' 
OVERWRITE INTO TABLE products;

Data Manipulation – querying data

Oracle (Relational database) – find all departments whose names starts with T

SELECT department_id FROM departments 
WHERE UPPER(DEPARTMENT_NAME) LIKE ‘T%’);

Elasticsearch engine big data query example – to find all records that have an author whose name begins with the letter ‘t’

POST /bookdb_index/book/_search 
{ "query": 
  { "wildcard" : 
    { "authors" : "t*" } 
  }, 
   "_source": ["title", "authors"], 
   "highlight": 
   { "fields" : 
       { "authors" : {} } 
   } 
}

MongoDB example

Select all documents in the collection where the status equals “A” and either quantity is less than 30 or item starts with the character p:

db.inventory.find( { 
    status: "A", 
    $or: [ { qty: { $lt: 30 } }, { item: /^p/ } ] 
} )
The operation corresponds to the following SQL statement:
SELECT * FROM inventory 
WHERE status = "A" 
AND (qty < 30 OR item LIKE "p%")
Hive example
SELECT to_date(o.ts) logdate, o.url, o.ip, 
o.city, upper(o.state) state,
o.country, p.category, 
CAST(datediff(from_unixtime(unix_timestamp()),
 from_unixtime(unix_timestamp(u.birth_dt, 'dd-MMM-yy'))) / 365 AS INT) age, 
 u.gender_cd
FROM omniture o
INNER JOIN products p
ON o.url = p.url
LEFT OUTER JOIN users u
ON o.swid = concat('{', u.swid , '}');

It seems Hive does not allow update but only insert overwrite.

Other Big Data examples – to be honest it was quite hard to google without proper understanding what is what in Big Data world.

Hopefully I’ll find answers later during the course. Challenge, should I accept it.

Disclaimer
This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Advertisements

One response to this post.

  1. Kosmoņautika kaut kāda.
    Man labāk patika laiki kad datubāze bija ms access datubāze un applikācija bija visual basikā ar drag & drop sataisīta forma.

    Like

    Atbildēt

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google+ photo

You are commenting using your Google+ account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

w

Connecting to %s

%d bloggers like this: