Big Data: domesticating the MapReduce wildcat


MapReduce – simple, yet extremely powerful technique or even paradigm. This concept initially was introduced by Google and nowadays widely used in Big Data systems. Eg, Apache Hadoop.

To be honest, it took some time for me to understand, despite well designed study materials. Because when you start thinking you have so many what-if questions. The hardest nut was data cleaning because samples I googled use unbelievably clean data like ‘word count in a book’. Guys, show me please book where w0rdzz a l1k3 ^|^hee$e and I’ll be happy to see your usable results. Like if they showed you only a paper plane folding and nothing more while you a hoping to learn about rocket ship. At least its paper model.

That’s why this blog is for – I document my learning journey.

Back to MapReduce now. It does not change original data in any way. It performs only logical operations and calculations on the very large amount of data stored on many different servers. Operations to support user answers like:

  • which is the longest word in these 10 books?
  • which is the most popular word in the 100 thickest books in this bookshop?
  • what percentage of words contains both r and n in the city library?
  • how many times vowels are used more than consonants in Indo-European languages?
  • which is the longest word sequence common between Twilight Saga and Harry Potter?
  • which of my, Alexa and Max common friends have commented my public posts about cats more that they comment post about cats in average?
  • what is the proportion by each country its twitter users are tweeting about their country football team vs their direct neighbour country?

I’ll note – do all that on terabytes and petabytes of data and do that fast. Wear your Big Data hat on. You could count words in one medium-size book alone but your life would be too short to count in a library, ten libraries.

Ahh, books… You know I am limited to cats so we have a Cat farm instead. We are running our farm site, cat souvenirs online shop, cat shows, forums and charities, thousands of photos and people liking and commenting them.

Brief background of Cat Farm site

When we started 10 years ago it was a small hobby with articles run on one SQL server and plain text comments each limited to 254 characters. When we posted more articles, introduced forums and users started more commenting, we added a new disk to server and altered tablespace to add datafile. It was scaling-up.

After Madonna visited our Cat Farm 5 years ago, it boosted site popularity and our server and SQL could not handle the social networking features we wanted to introduce. We moved away from RDBMS to Big Cat Big Data System (BCBDS) with Hadoop file system (NB: according to study plan, lecture about it is planned later) and MapReduce framework.

There are 3 nodes in our BCBDS (3 here because we had 3 quite powerful computers – Alex, Max and we were not using them because of smartphones era). Each of nodes now store several gigabytes with user comments. As user count grows and more comments flow in, we are planning soon to add two more nodes (medium-class servers) to parallelise and improve performance. It will be scaling-out.

MapReduce Cat Example

Local newspaper has anniversary and looks for cute stuff to write. Of course, cats. They are curious:

*) which of our cats is discussed the most in our Cat Farm page comments.

*) are tri-color cats discussed more than white or foxy?

Input

Let’s face reality: our page comments are like

Mango u soo butiful;;this cat looks so fat and ugly like my cousin Fred;;ahh I love Amber;; hi my name is FRED and I am 5 years old;;what a sweety your Piccy is;;mmm this sweet-kitt am-ber looks like real amber;;is his name Mingo or Minko?;;folks there feed poorMurmor;;I wanna adopt Amber;;soo cute FrEddy;;OMG OMG murrmorr my love he looks like my Angus;;could you please shut up with your cats

Cat Farm Analyst or Programmer: Defines map function

At first, we must define what keywords – in this case – names are we looking for to map. These rules will be distributed to each node as map function to be applied on its stored data, in our case – a lot of text files.

We know names of our cats, so in this example we do not have to write complex logic for recognizing names to conclude if this site comment related to our question.

  • We set a filter: process only words Mango,Fred,Amber,Picadilla,Minko,Murrmor
  • We add a condition to ignore letter case (Fred=FRED=FrEd etc)

To improve results, we have a quick look to find some typical typos.

I was long time in doubt, what filtering and transforming of Cat names shall be part of Map and what of Reduce function? Was reading and was upset why do internet people use so perfect samples? Then I found article Top 50 MapReduce job interview questions and answers and voila! Map: … in which we specify all the complex logic/business rules/costly code.

  • We add non-letters skipping (amb-er=amber)
  • We add double-letters skipping (Piccadilla=Picadilla)

We discussed option not to count ‘amber’, but count only ‘Amber’, also maybe cut off ‘cousin Fred’, but found it too time consuming for a local newspaper request.

[Magic] Here we use the MapReduce system manual to code “Dear Node, please, for each word found satisfying these conditions return me the key-value pair: <word,count>” I’ll skip real code now because my practical skills are not good enough yet.[/Magic]

Map function is now defined. Each node will calculate zero or more key-value pairs.

Cat Farm Analyst or Programmer: Define Reduce function

Reducer function performs light-weight processing like aggregation/summation to get the desired output from the key-value pairs each of nodes have calculated. We will use two reducers: one for cat names count and the other for colours count. I believe this reduction might be done within one function also, but this is beyond my skills yet.

Function for Cat names: group by key and sum values.

Function for colours:

  • If key is ‘fred’ or ‘amber’ then add value to ‘foxy’ counter
  • If key is ‘mango’ or ‘minko’ or ‘murrmor’ then add value to ‘tri-color’ counter
  • If key is ‘picadilla’ then add value to ‘white’ counter,

Both reduce functions return the key-value pair: <word,count>. This result in a real system then might processed by user interface software, for example, to turn first letter to capital and shown using coloured large font in centre of screen.

Digress a bit: map and reduce function logic really depend on our needs and what we treat as a key. Currently key is cat name because we are looking for it. But if we would be looking for daytime when the most comments come in, then key might be minute and value might be comments count within period of this minute (1,45), (2,49), …, (1440,34). The Reduce function might be defined to do grouping by hours then.

Reduce functions defined. We are eager for the results. Lights..Camera…Go!

Framework: Split phase

MapReduce framework distributes our mapping rules function amongst nodes – mappers. Usually there is an orchestrator node set up by framework software processes.

NB: in our BCBDS nodes are storing also backup copies for other nodes data (Cat Farm is afraid page comments being unavailable or even lost). MapReduce framework’ s built-in logic automatically takes care the data not to be double (triple, …) mapped and counted, that’s why I do not write about that.

Nodes: Map phase

Each node, based on the same map function, performs its stored data mapping to key – value pair. All nodes do that in parallel.

Node1 scans its gigabytes of comments in txt files and does mapping

(fred,1)

(mango,1)

(fred,1)

(fred,1)

(amber,1)

Also, Node2 and Node3 perform the same with their stored data.

Nodes: Combine phase (sometimes called mini-reducer)

It would be waste of time and traffic if all “single” pairs would be sent as input to reduce function. So it is reasonable to do basic aggregation on nodes. Node – mapper combines the same values and the result is:

(mango,117)

(fred,568)

(amber,344)

Node2 scans its stored data and result after map phase and combine phase is

(picadilla,7)

(amber,768)

(minko,93)

Node3 scans its stored data and result after map phase and combine phase is

(murmor,76)

(amber,7)

(fred,701)

Framework: Shuffle and sort phase

Within shuffling the data are transferred from map function to reduce function. Otherwise final output is not possible as there is no data – we cannout group and sum cat names if no names provided.

I was initially assuming nodes send their results data over to some central processor which then sends the data back – and still am trying understand that paradigm: no, nodes don’t send. This is the beauty of distributed computing frameworks – their processes orchestrate the flow with internal algorithms how are the combined key-value pairs distributed over to nodes to perform reduce function (eg, it must not happen that reduce function does not add cat names count from some nodes). We will have guest lecturers later – real system architects – and I hope they will openly share details.

MapReduce framework built-in logic (framework software processes on orchestrator node) does shuffling and sorting for all the result set. It splits the result from nodes (previous role – mappers) to nodes (current role – reducers) to calculate outcomes (to perform Reduce function).

Reduce phase for Cat names count

Orchestrator arranges Node1 will reduce key-value pairs

(amber,344)

(amber,7)

(amber,768)

Node2 will

(fred,568)

(fred,701)

(mango,117)

Node3 will

(minko,93)

(murmor,76)

(picadilla,7)

Note: all ambers are on one reducer node, all freds on another. I do not know yet how it would be if one group is unproportionally large to be sent to one reducer.

Note: one careful analyst might wonder why Picadilla so unpopular. Because in coments they often write Piccy or Pica or Picady but that was not noticed when defining Map function. Yeah, keyword tuning is a real challenge within uncleaned source like comments are. Remember, this is not traditional RDBMS or data warehouse where we use to have strict data validation and cleaning rules at entry point or by a regular process. This is BIG DATA world – data just flow in.

Reduce phase for colours count

I assume here shuffling will be done differently as conditions added to nodes output. I am still learning this.

Final output

Final output of Reduce for colours is

(foxy,2388)

(tri-color,256)

(white,7)

Output of Reduce for cat names are

(amber,1119)

(fred,1269)

(mango,117)

(minko,93)

(murmor,76)

(picadilla,7)

We copy the data to Excel, remove brackets, capitalize first letters and send to newspaper. And soon we are reading article about our Farm with several fun facts:

  • people love foxy cats more than 340 times more than whites,
  • the most popular cat name in this Farm is Fred.

Folks, be careful of statistics you provide. Somebody might treat it seriously.

NB in conclusion

As we talking about exa,pexa,schmexa-bytes and parallelisation among several nodes by Map-Reduce framework, the normal question is: how to balance nodes load? It would not be OK if one node receives to calculate million words starting with ‘A’,’B’,…,’W’ and second node ten words ‘Z’ because there will be delay while waiting Node1 results due to unbalanced load.

You’ll also ask – well, and how do we decide, how many nodes should be used for Map and for Reduce phase calculations and how the key-value pairs to be distributed in the balanced way?

Hehe, that’s what Map-Reduce framework developers and architects are paid for :) our business here is to define – our key will be Cat name and value will be its count, this is our Map function, this is Reduce function, and framework’s internal business is – how to distribute these functions to nodes, how to split, shuffle, balance reducing. There is no one global standard for that.

Some of developers keep internal logic as commercial secret, some are proudly publishing whitepapers to show their system strengths. There are rumours some have patented their approach, while some are open source.

Thank you for your patience. I hope you also have more questions now :) There are key-value databases and graph databases on my learning queue now.

P.S. my application about cat autofeeding practice was accepted. Have I developed it? No. Do I have a clue how to? No. Am I afraid? No. Will I be able to do it? I hope yes.

Disclaimer
This blog is solely my personal reflections.kep
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Advertisements

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out / Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out / Mainīt )

Google+ photo

You are commenting using your Google+ account. Log Out / Mainīt )

Connecting to %s

%d bloggers like this: