Big Data: HADOOP ecosystem. There is no one ‘hadoop.zip’ to install


When I was a kid, there was a saying – we say ‘the Party’ and mean Lenin, we say ‘Lenin’ and mean the Party. Today we say Big Data and mean HADOOP which is presumed to be used for processing of 50% of enterprise data all around the world – hundreds to thousands of nodes and petabytes of data is nothing special.

I heard about 8500 computers and 100 petabytes HADOOP cluster in one of training videos.

HADOOP is everywhere: social media, searching tools, government, finances etc. Yahoo, Facebook, Amazon, eBay, IBM, American Airlines etc. P.S. this also mean very high volume of job postings. HADOOP admin salary about 90k – 120k USD/year, developer/data scientist 60k – 150k USD/year.

There is no one “hadoop.zip” which you can download and install and happy having HADOOP.

In one sentence: HADOOP framework is the collection of open source LINUX based software to enable big data flow to be split and processed locally by using parallel, distributed (map-reduce) algorithms across many (one to thousands) low-cost computers and application layer (HDFS) handling hardware faults. HADOOP started its raise since ~Y2003.

Often referenced as ‘HADOOP ecosystem’, ‘HADOOP umbrella’, ‘HADOOP cluster’, it is addressing:

  • lots of data flowing in at a very high speed,
  • large and increasing volume,
  • variety of unstructured, different data – logfiles, videos, messages, records, comments, chats
  • if you lose a computer, it automatically rearranges and replicates the data – no data loss

HADOOP supports running applications on Big Data like social networking portal or recipes portal or trend analytics of retail.

It took a while to understand that Hadoop is like data batch processing fabrics in backoffice and is a set of tools, each strong in a particular area. Outside the HADOOP there are frontoffice applications who needs something to be done with data – queried, stored etc, eg, searching device or TOP 10 most sold books this week genre sci-fi or retrieving Facebook message. These applications send their tasks to HADOOP as tasks or jobs to be queued and performed by map reduce and the results to be sent back to the calling applications. This is done on data stored in HDFS file system.

Examples of applications where HADOOP is behind the scenes:

  • Mining of users behaviour to generate targeted advertisements and recommendations (Apache Mahout)
  • Searching and grouping documents, based on certain criteria
  • Searching uncommon patterns to detect fraud
  • And, of course, Facebook which runs the world’s largest Hadoop cluster. Your messages, likes, shares and comments, they are there, in Facebook data centers and HADOOP. When their previous messaging platform was running our limits they spent weeks testing different frameworks, to evaluate the clusters of MySQL, Apache Cassandra, Apache HBase and other systems. Facebook selected Apache HBase, one of HADOOP family.

Programmers love HADOOP because they do not have to worry about, where data are, what if computers fails, how to split big data and how to break down tasks. If a code works for a megabyte file, it will work for hundreds of gigabytes. If it works on one computer, it will work on 8500 computers. it’s HADOOP business – scalability.

Teamwork. No superhero.

If we use our classical enterprise mindset these data would be handled by a one supercomputer. But everything has its limits, even supercomputers, shall it be processing speed limit or disk space limit.

HADOOP splits and conquers, combining “weakness” of each separate computer (low price, so called commodity hardware)  into a powerful engine. You can add more and more computers (scalability) when your data are growing or requirements changing. The bright side of HADOOP scalability is that count of computers is linear to processing speed: to double the processing speed double the number of computers.

  • Data processing part of HADOOP is map-reduce.
  • Data storing data is HDFS (HADOOP distributed file storage).

Slaves

Computers incorporated into HADOOP framework are called slaves. Each has processes running on it:

  • Tack Tracker: responsible for performing the piece of task assigned to this computer
  • Data Node: responsible for managing the piece of data given to this computer

Master (or masters)

As word slaves might have made you think, they have a master. NB: Master computer is not a supercomputer, it is one of commodity hardware like others. This node has the same processes running as each slave – tack tracker and data node, and has yet two additional processes running:

  • Job Tracker to break tasks into smaller pieces, send to tack tracker processes, receive results, combine results and send the result back to the calling application
  • Name Node to keep an index which data are residing on which node (computer). It tells the calling application which node stores the data required, and the application contacts this node directly, it is not depending on name node to return the dataset. Actually, data flow itself never goes through Name Node, it only points to the storing Node.

Task tracker and Job tracker are puzzle pieces of Map Reduce component.

Name Node and Data Node are puzzle pieces of HDFS.

HDFS

HDFS software is built presuming that hardware fails will definitely happen. It is called built-in fault tolerance.

By default, HADOOP maintains three copies of each data file in different computers. Once a node fails system keeps running and if a node is later restored then HADOOP maintains data copied are spread to this node.

Interesting that fault tolerance applies no only to classical “disk failure” but also to failure of task tracker processes. In case it happens, Job tracker asks another node to perform that job.

Does it sound like Master Computer is single point of failure now?

HADOOP has solution also for Masters failures. The tables masters store data indexes (which data files are on which nodes) are also backed up to different computers. HADOOP can be set to have more than one master where backup master overtakes in case main master fails.

When trying to understand that universe you feel like in the Bold and the Beautiful where everyone has married everyone else for several times and have several backup wives :D

Some of tools within HADOOP ecosystem  – be aware, list is growing

Apache HIVE: data warehouse – query, analysis, summaries. It takes a SQL like language, then converts it to Pig and then to Map-Reduce. Unline SQL, querying HIVE always inserts the results into a table.

Facebook uses this up to 90% of their computations.

I wrote a HIVEQL query which finds all Picadilla page views referred October 2017 by my cat farm main competitor BestCats.com

INSERT OVERWRITE TABLE bestcats_refer_picadilla
SELECT picadilla_page_views.*
FROM picadilla_page_views
WHERE picadilla_page_views.date >= '2017-10-01' 
AND picadilla_page_views.date <= '2017-10-31' 
AND picadilla_page_views.url_referrer like '%bestcats.com';

 Apache HBASE: people need to read and write their data in real-time. HBASE meets that need: it is a noSQL big data store, providing real-time read/write access to large datasets in distributed environment (as all our data are spread across HADOOP cluster).

HBASE can be accessed by Hive, by Pig and by Map Reduce, and stores the data in HDFS.

Facebook messaging is one of most famous examples of HBASE users. When you send a message to your friend in Facebook, this message actually is an object in an HBASE table. The more messages you send the more nodes Zuckerberg has to add to FB Cluster :)

HCatalog: HBASE table metadata storage, pulled out of HBASE and treated as a separate project for different data processing tools — Pig, MapReduce — to access it, read and write.

Apache Zookeeper: a centralized service for maintaining configuration information. It also stores some of HBASE metadata.

Apache Mahout: machine learning targeted advertisements and recommendations. It will scan the data in system and recommend ‘movies you might like’ or ‘places for you to explore’. It lets you write map reduce applications, focused on machine learning.

Apache Pig: tool for analysing data. It is a high-level language for programming Map-Reduce logic (Pig Latin – similar to SQL). Programmers write high level description and Pig translates it to machine code for Map Reduce and runs that map reduce.

Yahoo experience was that 50% of tasks are written by using Pig instead of direct Map Reduce coding by programmers.

I wrote a very simple script which takes all comments from my imaginary cats site and filters out mentions of my favorite cat Picadilla into a new file picadilla_mentions.

This action might be sent to HADOOP when you open my cats site page and click on Picadilla photo. Then you could have a page opened by telling what a famous cat she is:

catnotices = LOAD 'cats_site_comments';
picanotices = FILTER catnotices BY $0 MATCHES '.*PICADILL+.*';
STORE picanotices INTO 'picadilla_mentions';

Apache Oozie: workflow scheduler to manage Hadoop batch map reduce jobs run – eg, run this job on this schedule, with this interval etc. Other features, like it can trigger a map reduce job when necessary data appear

Apache Flume: collecting large amount of log data and providing data model for log analytics. Load log data into HDFS and process by map reduce.

Apache Scoop: tool for writing map reduce applications to transfer bulk data between Apache Hadoop and structured databases, like RDBMS (Oracle, for example)

Roles

HADOOP administrators and user roles are overlapping. In general they perform:

  • Installation
  • Monitoring
  • Tuning (users help to do that)

Users perform:

  • Design and coding apps for certain objectives (admins help to do that)
  • Import/export data
  • Working with HADOOP ecosystem tools, connectors

Pricing

HADOOP open source software is for free.

Since Y2006 HADOOP software is distributed under Apache Software Foundation, an American non-profit corporation formed as a decentralized open source community of developers. The software they produce is distributed under the terms of the Apache License and is free and open-source software. The idea of license is that no particular company controls the HADOOP.

ASF makes money from donations, sponsorship, running conferences,

HADOOP ecosystem vendors use different pricing models: service per node, per terabyte, subscription, hardware plus software deals, selling specialised connectors, selling cloud services and an interesting concept freemium –  free usage till a threshold limit.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

2 responses to this post.

  1. Posted by McB on 07/11/2017 at 18:54

    Interesantāk par šo murgu būtu skatīties vaļu pārošanas rituālu pa Discovery.
    Pilno, 5 stundu versiju.

    Patīk

    Atbildēt

    • Posted by Mr. Meeseek on 08/11/2017 at 12:48

      Man gan, kā praktiskam cilvēkam, šis posts šķiet labs – ir gan amerikas algas pieminētas, gan konkrēti tooļi, gan koda piemērs.
      Pretstatā postiem ar tikai abstraktiem buzzwordiem – big data, data scientist, cloud utt.

      Patīk

      Atbildēt

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: