Archive for the ‘Mācību lietas’ Category

Big Data: basics of document oriented databases


One of the most important concept of NoSQL as the term ‘document’ can be treated in a very broad way: as well as my passport as this blog entry.

Use cases of document oriented databases:

  • Storing log files
  • Storing high volume of different data coming in – eg, one machine is measuring rotations per minute and produces CSV, and the other one measures temperature and produces XML, and yet the one keeps track of rejected operations and produces fixed length files
  • Efficient storing. For example, you have 30 attributes in your user profile like ‘date of birth’, ‘preferred music’, ‘favorite sport’ etc. One of users might have provided only date, the other only sport etc. You do not have to store the whole structure for each user telling ‘date of birth’ and ‘preferred music’ not provided (like you would store NULL or ‘n/a’ in RDBMS table structure). You just store the data you have: user1 ‘favorite sport’.
  • Storing blogs and articles with comments and likes
  • Storing business data where searching by metadata and content is crucial
  • Storing documents which structure can be changed any time – eg, you can easily add a new feature like emotions in your blog platform because you do not have to redefine the whole schema
  • Store new types of documents. You were storing only cats but today you can start storing and querying also dogs, trees, galaxies and stationery
  • Write operation works very fast as DOD have no RDBMS transaction and locking mechanisms overhead
  • Unlimited increase of database size since documents are key value pairs with document id being the key and the document being the value.

The idea of DOD is to provide a scalable framework for storing, inserting, querying and retrieving and updating (at single document level) unlimited amount of self-describing structured or semi-structured data, called “document”. Part of data is the document content itself and part is data about data.

Teacher used MongoDB as example and you see why (https://db-engines.com/en/ranking/document+store). Yellow you see Latvian team’ produced Clusterpoint:

document_db_ranking

Usually documents are stored in JSON or XML or binary like PDF and MS Word, Excel (eg, MongoDB and CouchDB uses JSON). Each document has unique ID and can have its own structure. There is no predefined schema. However we should understand that document databases are not designed to fit anything. Eg, these databases are not the best for deep nested data because cannot search them effectively.

As I am SQL person, here comes my survival kit (https://www.slideshare.net/fabiofumarola1/9-document-oriented-databases):

RDBMS_docdb_terminology

Relations among data items can be represented in two ways: referencing and embedding. Which one better? Hah, this is The Question any developer and any architect has a lot of pain and lots of guidelines are written and tons of stackoverflow.com posts. More art than a science. Each has pros and cons. Good news is that you can always change your mind. This is one of reasons why I love programming: I can play a lot. If I were surgeon it would be much harder to recompile.

Referencing to food stored in another document:

{
 _id: cat_1,
 name: “Picadilla”,
 colour: “brown”,
 food: “catfood_123”,
 amount: 2,
 dateofbirth: “10-OCT-2010”
}
{
 _id: catfood_123,
 name: “Tasty Chicken Liver”,
 producer: “Catty Food Inc.”,
 address: “Wildcat Boulevard 17”
}

Embedding food data in single document:

Notice that food has no ID itself – the _id field is a required field of the parent document, and is typically not necessary for embedded documents. You can add an _id field if you want.

{
 _id: cat_1,
 name: “Picadilla”,
 colour: “brown”,
 food:
   {
    name: “Tasty Chicken Liver”,
    producer: “Catty Food Inc.”,
    address: “Wildcat Boulevard 17”
   }
 amount: 2,
 dateofbirth: “10-OCT-2010”
}

Some ideas collected about referencing vs embedding:

  • the more that you keep in a single document the better – easy to query
  • any data that is not useful apart from its parent document definitely should be part of the same document
  • separate data into its own collection that are meant to be referred to from multiple places
  • embed is good if you have one-to-one or one-to-many relationships, and reference is good if you have many-to-many relationships
  • embedded documents are easy to retrieve (as everything stored there). When querying by parts of them, limitations exist, like sorting limited for insertion order
  • to my surprise I am reading that no big differences for inserts and updates speed

Consistency

When you design your schema consider how you will keep your data consistent. Changes to a single document are atomic (complete operation guaranteed), but, when updating multiple documents, it is likely that in a moment of time address of the same Cat food producer may differ amongst cats (NB: but there are a few databases like Clusterpoint which can handle multiple document updates in a transaction). In general, there is no way to lock a record on the server. The only way is you can build into the client’s logic to lock a field.

Remember, NoSQL systems are by desing support BASE, not ACID transactions. It is normal and real that at a moment of time you may see different address of food providers – or different comment content, or different views count.

My favorite example is Candy Crush daily free booster. If I spin it on a tablet and later try on a phone, I get ‘come tomorrow’. But if I spin on all the devices at once then I get a booster in each of devices #lifehack. In RDBMS transaction control mechanism would guarantee that once spinned you may not repeat it.

Very powerful querying

Documents can be queried by any their attributes like

{
 name = “Picadilla”
 colour: [“brown”, “amber”]
}

The querying language is powerful and not obvious, even if I am guru of SQL querying. It would take a while to learn it. Here is nice material of SQL mapped to MongoDB queries.

As usual, each query should be checked its speed before going live. Smilar to RDBMS, nice feature is call ‘explain’ for a query to see what is database performing when running the query – which index its using etc.

MongoDB_SQL2

Inserting new records

Again MongoDB as example.  Insert one:

db.inventory.insertOne(
   { item: "catfood", qty: 10, tags: ["chicken”, “liver"], ingredients: { water: 50, meat: 30, fat: 30, unit:"pct" } }
 )

Insert many:

db.inventory.insertMany([
item: "catfood", qty: 10, tags: ["chicken”, “liver"], ingredients: { water: 50, meat: 30, fat: 30, unit:"pct" } }
item: "dogfood", qty: 23, tags: ["beef"], ingredients: { water: 40, meat: 20, fat: 40, unit:"pct" } }
item: "kittenfood", qty: 3, tags: ["turkey", “fillet”], ingredients: { water: 55, meat: 30, fat: 25, unit:"pct" } }
])

Updating

Nice reason to learn upsert clause: if set to true, creates a new document in case if no document matches the query criteria. multi means if matching documents exist, the operation updates all matching documents.

db.cats.update( { "name": "Meowcha"},
   { "colour": ["brown", "white", "black"] },
   { upsert: true, multi: true } )

Deleting

We set deletion criteria and remove matching documents. Eg, the following operation removes the first document from the collection cats where amount is greater than 2:

db.cats.remove( { amount: { $gt: 2 } }, true )

Well, I have a had a look on the very, very basics. Can I apply for a DOD or MongoDB expert role now? No :) But I can honestly say I now know much more I knew a month ago and I definitely would not be afraid if told in my project ‘we are going to start using document oriented database tomorrow’.

Disclaimer
This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Advertisements

Big Data: some of universal file formats


All the data – this blog, Facebook messages, comments, Linkedin articles, anything – has to be stored somewhere somehow. How? It depends (here you can see how a tweet looks like in JSON format) but there are some universal formats.

Besides writing my notes here I am going to prove here it is possible to start and learn. You do not need any servers, any installs to learn XML querying – just google for online XPATH test or online XQUERY or online JSON query and go, do, test, learn.

Sometimes I see young girls wasting their life being bored at receptions or empty shops and sitting at a computer with Solitaire or gossip page open and I think – if I were them I swear I would learn programming online every free minute I have! When I was studying we had to sit in libraries and subscribe in advance for hour a day accessing mainframe. No excuses nowadays, guys!

XML

This is one of The Formats you should know when woken up 3AM because a lot of Big Data databases store data in XML format. Both XML and JSON (see below) are human-readable and machine-readable plain text file formats.

Database management systems, whose internal data model corresponds to XML documents, are called Native XML DBMS and they claim to use the full power of XML: represent hierarchical data and support XML-specific query languages ​​such as XPath, XQuery or XSLT.

NB: Native XML DBMS do not necessarily store data as XML documents, they can use other formats for better efficiency.

Databases which use other data models like relational and are capable of storing XML documents, are called XML-enabled DBMS.

Current ranking of native XML databases: https://db-engines.com/en/ranking/native+xml+dbms

NativeXMLDB_ranking_Oct2017

Lesson learned with self-made XMLs

XML data values have beginning and end, and are separated by tags, you know –

XML example

Many years ago, I was working as a designer of XML files for data exchange. We were young and enchanted by the unlimited power of any-structure container format and we used very long tags. Our intentions were good – human readable plain text for fixed orderforms, like [MozzarellaWithFourCheesesPizzaPriceBeforeTaxesTipsNotIncludedCurrencyLVL]1.17[/MozzarellaWithFourCheesesPizzaPriceBeforeTaxesTipsNotIncludedCurrencyLVL].

We were to XMLionize hundreds of documents and do it very fast, so we worked like a fabrics.

We did it. But… What we got was:

  • Storage space consuming documents
  • Network traffic
  • Quite funny software code parsing there wondertags
  • The same business term and tag called in many variations like Pizza, Pica, Picca
  • Grammar errors in tags confusing users like MozcarelaWihtSieru
  • Mixed language and translation errors in tags like PicaArCheese
  • At a glance easy XML readability was misleading when tag became inconsistent with value
  • Documentation was not consistent, incl. curiosities when writers corrected grammar in tags in Word docs (thinking they are doing great work)
  • Unmaintainable structure – see the example with LVL in tag and ‘four cheeses’ – recipes do change

My lessons learned –

  • short and neutral tags
  • create structure using hierarchy, not tag names
  • include version attribute in the beginning of file
  • follow the same style (we used PascalCase), usually one of:

– Lower case    firstname All letters lower case

– Upper case    FIRSTNAME All letters upper case

– Underscore    first_name    Underscore separates words

– Pascal case   FirstName Uppercase first letter in each word

– Camel case    firstName Uppercase first letter in each word except the first

Querying XML documents

One might ask – why should we query plain text file if we can search in notepad? Answer: it is easy only on short samples. But when you have a lot of data, you will get confused it is first, second or hundredth value.

XPath language

Declarative “path like” syntax to identify and navigate nodes in an XML document. Nice web page to play online: https://www.freeformatter.com/xpath-tester.html NB: tags are case sensitive

I played a bit there using self-made simple sample FoodCalendar.

It took a while with //CatName[2]/text() until I understood that second element  – [2] means second for its parent tag, not second in list returned. And correct query what I wanted – second cat in the list – was:

(//CatName/text())[2]

All foods eaten more then 2:

//*[@Amount>2]

Count cats:

count(//Cat)

All foods containing ‘Chicken’:

//*[contains(@FoodUsed, ‘Chicken’)]

Extension for XPath is XQuery

It is much more complex language to extract and manipulate XML data and transform them into HTML, CSV, SQL, or any other text-based format. I read some manuals of one of native XML database https://exist-db.org/exist/apps/demo/examples/basic/basics.html and wrote a very simple query in XQuery online test http://videlibri.sourceforge.net/cgi-bin/xidelcgi to find what is Picadilla eating:

for $i in $catxml//Cat

where  $i//CatName=”Picadilla”

return (“CAT “, $i//CatName, “EATS”, data($i//@FoodUsed), data($i//@Amount), ” TIMES A DAY”, data($i//@Date))

and answer was

CAT

Picadilla

EATS

ChickenLiver

5

TIMES A DAY

10-OCT-2017

XQueryTest

Of course, this language is much more powerful, you can analyze data and write computation functions there. My target from playing was to see if it is possible to learn this querying, and I see – it is, just some more time needed.

JSON

Another must-have-to-know is JSON format, yet another way to store information as text only in an organized and human-readable manner. Document databases such as MongoDB use JSON documents in order to store records, just as tables and rows store records in a relational database.

JSON format files can easily be sent to and from a server, and used as a data format by any programming language.

We can store any number of properties for an object in JSON format.

It is shorter as XML, however they both have similarities.

  • Both JSON and XML are “self describing” (human readable)
  • Both JSON and XML are hierarchical (values within values)
  • Both JSON and XML can be parsed and used by lots of programming languages

Differences:

  • JSON is shorter
  • JSON is quicker to read and write
  • JSON can use arrays

and the biggest difference is that XML has to be parsed with an XML parser. JSON can be parsed by a standard JavaScript function. That supports explaining huge popularity of JSON.

JSONPath

Similarly to XPath, there is JSONPath which is a JSON query language.

I took my FoodCalendar XML and converted to JSON via https://www.freeformatter.com/xml-to-json-converter.html.

and wrote a simple query to filter all foods and dates where amount eaten is < 5

$.Cats..Meals[?(@.Amount<5)].[FoodUsed,Date]

http://jsonpath.com/

JSON_path

CSV and FIXED LENGTH FILE – universal formats for tabular data set

At least basic understanding of these universal tabular (not hierarchical) formats is crucial because a lot of NoSQL Big Data files are stored in these formats.

CSV (comma separated) – the one you should know. It will never die because of its simplicity.

CSV has become kind of industry standard, despite it has no universal standard: text file having one record on each line and each field is separated by comma (or ; or TAB or other symbol). Built-in commas (or another separator) separated by double quote. Double quote characters surrounded by double quotes etc.

List of advantages is impressive:

  • compact size – write once the column headers and then no more additional tags in data needed
  • human readable,
  • machine easy generate and easy read,
  • widely used for tabular data
  • most applications support it.

Very popular amongst MS Excel users. Used to transfer data between programs, import and export data, sometimes used as a workaround to export, then modify and import back.

Disadvantages:

  • complex data are too complex to transfer within CSV,
  • poor support of special characters,
  • no datatypes defined (text and numeric is treated the same)
  • when importing into SQL, no distinction between NULL and quotes.

As there are no universal standards, widely hit issue is new line delimiters – Linux uses one, Windows another etc.

Example of formatted XLSX file:

Excel_formatted_table

Saved As CSV:

NR,CatName,Diet,Food,Date,Amount,
1,Picadilla,,"Sausage ""The Best""",12/09/2017,1(?),"Peter, can you check, was it really only one??"
2,Murmor,Y,"Chicken,boiled ©",14-Sep-17,0.5,
3,Fred,N,"Salmon,,fresh",15.09.2017,2,

See:

  • the last comma in the first row – one column is without heading
  • first record contains NULL value for ‘Diet’, contains quotes for Food and comma in comment
  • second record contains special symbol which might (heh, will) be lost when importing to another software
  • dates still different format
  • colors lost, formatting lost

And the same open in Excel again

Cats-csv-Excel1

Double-clicked to expand columns

Cats-csv-Excel2

Fixed-length fields

As name reveals, it is an agreement that first column always has exactly X (5 or 10 or any other value) characters, the second column has exactly Y, the third has exactly Z and so on. The same for all rows.

If the value is shorter than blanks are padded with spaces or any other specified character. Padding can be done on either side or both.

When you use this format be ready to do a lot of configuration like defining each field length and even writing your own code.

I converted my cats.CSV to fixed length file (http://www.convertcsv.com/csv-to-flat-file.htm). I had to do several configuration options like field length and alignment

NR   CatName   DietFood                     Date                     Amount

1    Picadilla     Sausage "The Best"       12/09/2017               1(?)  Peter, can you check, was it really only one??

2    Murmor    Y   Chicken,boiled �         14-Sep-17                0.5

3    Fred      N   Salmon,,fresh            15.09.2017               2

And after opening in Excel –

Excel-fixed-length-file

As you see, a smarter reader software is necessary.

You might ask – hey, let’s add some data definition to the file and then file format is readable automatically by software. Why not :) it is called

DBF (database table file) format

It is a fixed length file having its data definition in the beginning of file in machine readable standardized format. There is no one universal DBF standard. Each file contains a description. Interesting to know this format but I see no reason to learn very details.

I tried to convert online my CSV to DBF but failed. Then I saved CSV to XLSX and converted XLSX to DBF.

Opening converted result with Notepad:

Cats-DBF-Notepad

Excel:

Cats-DBF-Excel

Opening DBF online (http://www.dbfopener.com/):

Cats-dbf-viewer-online

These experiments were enough to illustrate that even small and simple fixed length files need some time to be converted.

And, as you see, copyright symbol lives its own life :)

Disclaimer
This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: enchanted with the idea of graph database power


 

Why there so many differetent database management systems? Because each of it supports the best something you need.

  • Payments 24/7 and queries like calculating average price of TOP 5 most sold goods? – choose relational database and enjoy its built-ins transaction support and SQL

Relational database: predefined tables where one document usually is split amongst tables and a SQL query must be written to join parts to be retrieved as a piece set for “document”

  • Building an e-shop shopping cart? -key value store and retrieve/write cart data by ID

Key-value store: data value is unknown black box for store and is located by its key and retrieved very fast.

  • Storing blog posts or messages? -document oriented database

Document-oriented store: similar to KV store, but document oriented database knows predefined metadata about internal structure of the document. The document content itself may be anything autonomous – MS Word, XML, JSON, binary etc, but database engine uses some structure for organizing documents, providing security, or other implementation. In contrast to relational databases, document is stored as singular object. (Next lecture will be about them, so expect a blog post)

  • Navigate user from point A to point B? Social networking? – graph database.

Graph database: networking. Database is collection of nodes and edges. Each node represents an entity (such as a cat or person or product or business) and each edge represents a connection or relationship between two nodes – likes, follows, blocks, ….

NoSQL graph database does not need its schema re-defined before adding new data – neither relations, nor data itself. You can extend the network any direction – billions of cats, ups, data items. You can add cats, farms, persons, foods, cars, you can add their likes, dislikes, hates, reads the same, sister of, attends the same lunch, checked in the same place, just anything.

Picadilla (likes) Fred. Murmor (hates) Minko. Picadilla (eats together with) Minko. Fred (meows) at Amber. Murmor (sleep in the same room as) Amber.

You see, as Murmor hates Minko, he better avoid Picadilla. Amber also should be cautious of Picadilla as she likes Fred, which meows at Amber, so there is a chance they will meow both at Amber.

Next day you add more observations.

Minko (likes) Murmor. It increases chance that he will hate Minko and avoid Picadilla.

As more data you have as better you can trace connections (King of the World). More paths and usability of paths you can find – just imagine the power. Graph databases are meant for that. And Facebook… what a set of queries I could write there…. mmmmm….

I felt in love when realised we can query graph database using specialized query language.

Who likes Fred? Which are friends of those who hate Minko? What is common between Picadilla and Murmor? Which cats cant’t be in one room? How many cats in average like one cat?

Hinghest ranking is Neo4j database. Its graph query language is CYPHER.

To be honest, CYPHER language is human readable and seems quite easy to learn.

MATCH (cat) –[:likes] -> (person) where cat.name=’Picadilla’ RETURN person

I googled samples – https://neo4j.com/developer/cypher-query-language/

Find Someone in your Network Who Can Help You Learn Neo4j

MATCH (you {name:"You"})
MATCH (expert)-[:WORKED_WITH]->(db:Database {name:"Neo4j"})
MATCH path = shortestPath( (you)-[:FRIEND*..5]-(expert) )
RETURN db,expert,path

Cast of movies starting with “T”)

MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)
WHERE movie.title STARTS WITH "T"
RETURN movie.title AS title, collect(actor.name) AS cast
ORDER BY title ASC LIMIT 10;

It seems matter of mindset and syntax if you have skills in SQL querying. I like graph databases.

Disclaimer
This blog is solely my personal reflections.kep
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: learning key-value store basics


Are you ever wondering how do Facebook manages 20 million reads per second? Every 60 seconds 510 000 comments, 293 000 statuses updated, 136 000 photos uploaded. Imagine, you are one of a milliard active there and it works fast like you are one in the whole Universe! HOW?!?! (d’oh, when I understand, many cats will be needed to explain)

Big Data studies are (slowly) opening a whole new world for me. On my SQL planet query (well, quite complex analytical one like ‘calculate income within last 30 years generated by users who have complained and whose complaints were declined but they have still returned as customers and used the service more than average’) runs 15 minutes to hour.

Key-value concept is a very flexible data storing and retrieving approach used in NoSQL systems. Key-value store stores key-value pairs. Values are identified via key and accessed via direct request to the object in memory or on disk (no RBMDS layer, no overhead, no relationship among other pairs, no control if value matches predefined format etc).

Being experienced in relational databases, I somehow expected that somewhere is a definition like key is number and value is text. No, no, no. There is no defined schema in key-value store. They have just pair of fields – a key and the value. This is called content-agnostic database –  store anything you want either data itself or pointer to data – from JSON to XML, from HTML to images, from log files to chat history, from books to videos. No need to learn a special data query language, and it is easy to move data to another system because of this simplicity.

There are some basic KV concepts:

  • You can find the value only by its key – and find very fast. (when system provides fast access to data operations, it is called low latency system – you might see this term when googling KV). Key structure example: userID_messageID
  • The values are opaque (they aren’t self-descriptive), the key-value store doesn’t know anything about values (like courier, delivering something to your address, does not know content). Application (not the key-value store but application – like Instagram frontend) has complete control over performing operations based on value content – like you might decide open the pack with a knife or a sword, or maybe it is a flower bucket or throw the content away.
  • You always retrieve full value from key-value store. You cannot filter or control value returned. What do you do with this value – it’s your application business. Of course, application can read the binary content and decode to text and then filter content out of it.
  • Data access is performed by “simple” get, put and delete commands to key-value store. As all the other operations are performed by applications, the key-value store gets in general one request to read, one request to write: read when application retrieves value and write when user changes are done in your application – eg, cart in e-shop – operations are not written on disk in key-value store each time you add a product or modify desired amount – the data reside within application cache.

Cat Passport Key Value store

Hmm. What should I use for key and value? Relational database modelling is descriptive: I know what objects I have within business and build model – table CATS (CatID number, Name VARCHAR2(30), ,,,,), table GENDERS, table BREEDS.

Within Key-Value store it is crucial to have the key be usable for retrieving data. I must think ‘what I want to know’ instead of relational approach ‘describe what I know’.

Design of keys and values 

I could choose to store my cat data ahyhow I want (remember, value will be stored binary 0028 e8e9 00fc 5cbb… and only application decodes it to text after retrieving from store):

Key: Picadilla

Value: Born:20150921;Breed:Burma;Color:#FFBF00;Gender:1;Photo:IMG_20170925_121432.JPG

And I can decide store with date of birth and store in another format:

Key: 20150921

Value:(Name)Picadilla(/)(Breed)Burma(/)(Color)amber(/)(Gender)Female(/)(Photo)IMG_20170925_121432.JPG(/)

And I can decide store with my name as a key and store in yet another format:

Key: Burma

Value:N=Picadilla|B=21-SEP-2015|C=255,191,0|G=F

And I can decide store with random generated number as a key

Key:6756970977876576789097

Or URL

Key:https://mysite.cats.cc/thebestcats/Picadilla.htm

I can build my software read these attributes and show name as large red, breed as blue etc. I can add latest vaccination date, free form comments and save it in key-value store. If I add vaccination to Picadilla, then the are no checks if I add the vaccination date also to Murmor. Freedom of values.

Often JSON format is used for values structure describing – I will tell about it someday a bit.

P.S. Of course, in Facebook etc never there never will be key like ‘VitaKarnite’ in KV store. They use keys either random unique identifiers or calculated by hash functions from something like Use rID and Username.

I believe any of large photo storing system stores photos as key-value pairs. I am pretty sure these pairs do not contain user name or any other additional potentially changing data. I could assume that uploading time might be stored inside of value. I think there is kind of combination where tables store metadata like user name, surname, last login, relationship to friends. Also there might be table UserPhotos storing UserID and unique autogenerated PhotoID, and then Key-Value pair would be PhotoID and pointer to file location. When new photo uploaded, new metadata record (UserID plus PhotoID) generated and key-value pair added in key-value store.

I had a look on most popular key-value databases and googled  usecases.

  • Near everywhere you will find this example – cart in e-shop
  • User preference and profile stores
  • Product recommendations; latest items viewed on a retailer website drive future customer product recommendations
  • Ad servicing; customer shopping habits result in customized ads, coupons, etc. for each customer in real-time

It’s enough for today despite I am behind learning schedule, have graph databases to learn yet and tomorrow is next lecture already.

P.S. Instead of counting sheep I design approaches for CandyCrush – how would I store the board layout, record moves, build controls, define attention loss detection logic – you know when you are losing patience suddenly 5 color bombs there. I am figuring out also user characteristics metrics from their CandyCrush history. I am sure I could pull out a lot of interesting things :) and I am even more sure a lot of companies already do. I’d eat my hat if banks do not profile you by the style you use internetbank.

Disclaimer
This blog is solely my personal reflections.kep
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: domesticating the MapReduce wildcat


MapReduce – simple, yet extremely powerful technique or even paradigm. This concept initially was introduced by Google and nowadays widely used in Big Data systems. Eg, Apache Hadoop.

To be honest, it took some time for me to understand, despite well designed study materials. Because when you start thinking you have so many what-if questions. The hardest nut was data cleaning because samples I googled use unbelievably clean data like ‘word count in a book’. Guys, show me please book where w0rdzz a l1k3 ^|^hee$e and I’ll be happy to see your usable results. Like if they showed you only a paper plane folding and nothing more while you a hoping to learn about rocket ship. At least its paper model.

That’s why this blog is for – I document my learning journey.

Back to MapReduce now. It does not change original data in any way. It performs only logical operations and calculations on the very large amount of data stored on many different servers. Operations to support user answers like:

  • which is the longest word in these 10 books?
  • which is the most popular word in the 100 thickest books in this bookshop?
  • what percentage of words contains both r and n in the city library?
  • how many times vowels are used more than consonants in Indo-European languages?
  • which is the longest word sequence common between Twilight Saga and Harry Potter?
  • which of my, Alexa and Max common friends have commented my public posts about cats more that they comment post about cats in average?
  • what is the proportion by each country its twitter users are tweeting about their country football team vs their direct neighbour country?

I’ll note – do all that on terabytes and petabytes of data and do that fast. Wear your Big Data hat on. You could count words in one medium-size book alone but your life would be too short to count in a library, ten libraries.

Ahh, books… You know I am limited to cats so we have a Cat farm instead. We are running our farm site, cat souvenirs online shop, cat shows, forums and charities, thousands of photos and people liking and commenting them.

Brief background of Cat Farm site

When we started 10 years ago it was a small hobby with articles run on one SQL server and plain text comments each limited to 254 characters. When we posted more articles, introduced forums and users started more commenting, we added a new disk to server and altered tablespace to add datafile. It was scaling-up.

After Madonna visited our Cat Farm 5 years ago, it boosted site popularity and our server and SQL could not handle the social networking features we wanted to introduce. We moved away from RDBMS to Big Cat Big Data System (BCBDS) with Hadoop file system (NB: according to study plan, lecture about it is planned later) and MapReduce framework.

There are 3 nodes in our BCBDS (3 here because we had 3 quite powerful computers – Alex, Max and we were not using them because of smartphones era). Each of nodes now store several gigabytes with user comments. As user count grows and more comments flow in, we are planning soon to add two more nodes (medium-class servers) to parallelise and improve performance. It will be scaling-out.

MapReduce Cat Example

Local newspaper has anniversary and looks for cute stuff to write. Of course, cats. They are curious:

*) which of our cats is discussed the most in our Cat Farm page comments.

*) are tri-color cats discussed more than white or foxy?

Input

Let’s face reality: our page comments are like

Mango u soo butiful;;this cat looks so fat and ugly like my cousin Fred;;ahh I love Amber;; hi my name is FRED and I am 5 years old;;what a sweety your Piccy is;;mmm this sweet-kitt am-ber looks like real amber;;is his name Mingo or Minko?;;folks there feed poorMurmor;;I wanna adopt Amber;;soo cute FrEddy;;OMG OMG murrmorr my love he looks like my Angus;;could you please shut up with your cats

Cat Farm Analyst or Programmer: Defines map function

At first, we must define what keywords – in this case – names are we looking for to map. These rules will be distributed to each node as map function to be applied on its stored data, in our case – a lot of text files.

We know names of our cats, so in this example we do not have to write complex logic for recognizing names to conclude if this site comment related to our question.

  • We set a filter: process only words Mango,Fred,Amber,Picadilla,Minko,Murrmor
  • We add a condition to ignore letter case (Fred=FRED=FrEd etc)

To improve results, we have a quick look to find some typical typos.

I was long time in doubt, what filtering and transforming of Cat names shall be part of Map and what of Reduce function? Was reading and was upset why do internet people use so perfect samples? Then I found article Top 50 MapReduce job interview questions and answers and voila! Map: … in which we specify all the complex logic/business rules/costly code.

  • We add non-letters skipping (amb-er=amber)
  • We add double-letters skipping (Piccadilla=Picadilla)

We discussed option not to count ‘amber’, but count only ‘Amber’, also maybe cut off ‘cousin Fred’, but found it too time consuming for a local newspaper request.

[Magic] Here we use the MapReduce system manual to code “Dear Node, please, for each word found satisfying these conditions return me the key-value pair: <word,count>” I’ll skip real code now because my practical skills are not good enough yet.[/Magic]

Map function is now defined. Each node will calculate zero or more key-value pairs.

Cat Farm Analyst or Programmer: Define Reduce function

Reducer function performs light-weight processing like aggregation/summation to get the desired output from the key-value pairs each of nodes have calculated. We will use two reducers: one for cat names count and the other for colours count. I believe this reduction might be done within one function also, but this is beyond my skills yet.

Function for Cat names: group by key and sum values.

Function for colours:

  • If key is ‘fred’ or ‘amber’ then add value to ‘foxy’ counter
  • If key is ‘mango’ or ‘minko’ or ‘murrmor’ then add value to ‘tri-color’ counter
  • If key is ‘picadilla’ then add value to ‘white’ counter,

Both reduce functions return the key-value pair: <word,count>. This result in a real system then might processed by user interface software, for example, to turn first letter to capital and shown using coloured large font in centre of screen.

Digress a bit: map and reduce function logic really depend on our needs and what we treat as a key. Currently key is cat name because we are looking for it. But if we would be looking for daytime when the most comments come in, then key might be minute and value might be comments count within period of this minute (1,45), (2,49), …, (1440,34). The Reduce function might be defined to do grouping by hours then.

Reduce functions defined. We are eager for the results. Lights..Camera…Go!

Framework: Split phase

MapReduce framework distributes our mapping rules function amongst nodes – mappers. Usually there is an orchestrator node set up by framework software processes.

NB: in our BCBDS nodes are storing also backup copies for other nodes data (Cat Farm is afraid page comments being unavailable or even lost). MapReduce framework’ s built-in logic automatically takes care the data not to be double (triple, …) mapped and counted, that’s why I do not write about that.

Nodes: Map phase

Each node, based on the same map function, performs its stored data mapping to key – value pair. All nodes do that in parallel.

Node1 scans its gigabytes of comments in txt files and does mapping

(fred,1)

(mango,1)

(fred,1)

(fred,1)

(amber,1)

Also, Node2 and Node3 perform the same with their stored data.

Nodes: Combine phase (sometimes called mini-reducer)

It would be waste of time and traffic if all “single” pairs would be sent as input to reduce function. So it is reasonable to do basic aggregation on nodes. Node – mapper combines the same values and the result is:

(mango,117)

(fred,568)

(amber,344)

Node2 scans its stored data and result after map phase and combine phase is

(picadilla,7)

(amber,768)

(minko,93)

Node3 scans its stored data and result after map phase and combine phase is

(murmor,76)

(amber,7)

(fred,701)

Framework: Shuffle and sort phase

Within shuffling the data are transferred from map function to reduce function. Otherwise final output is not possible as there is no data – we cannout group and sum cat names if no names provided.

I was initially assuming nodes send their results data over to some central processor which then sends the data back – and still am trying understand that paradigm: no, nodes don’t send. This is the beauty of distributed computing frameworks – their processes orchestrate the flow with internal algorithms how are the combined key-value pairs distributed over to nodes to perform reduce function (eg, it must not happen that reduce function does not add cat names count from some nodes). We will have guest lecturers later – real system architects – and I hope they will openly share details.

MapReduce framework built-in logic (framework software processes on orchestrator node) does shuffling and sorting for all the result set. It splits the result from nodes (previous role – mappers) to nodes (current role – reducers) to calculate outcomes (to perform Reduce function).

Reduce phase for Cat names count

Orchestrator arranges Node1 will reduce key-value pairs

(amber,344)

(amber,7)

(amber,768)

Node2 will

(fred,568)

(fred,701)

(mango,117)

Node3 will

(minko,93)

(murmor,76)

(picadilla,7)

Note: all ambers are on one reducer node, all freds on another. I do not know yet how it would be if one group is unproportionally large to be sent to one reducer.

Note: one careful analyst might wonder why Picadilla so unpopular. Because in coments they often write Piccy or Pica or Picady but that was not noticed when defining Map function. Yeah, keyword tuning is a real challenge within uncleaned source like comments are. Remember, this is not traditional RDBMS or data warehouse where we use to have strict data validation and cleaning rules at entry point or by a regular process. This is BIG DATA world – data just flow in.

Reduce phase for colours count

I assume here shuffling will be done differently as conditions added to nodes output. I am still learning this.

Final output

Final output of Reduce for colours is

(foxy,2388)

(tri-color,256)

(white,7)

Output of Reduce for cat names are

(amber,1119)

(fred,1269)

(mango,117)

(minko,93)

(murmor,76)

(picadilla,7)

We copy the data to Excel, remove brackets, capitalize first letters and send to newspaper. And soon we are reading article about our Farm with several fun facts:

  • people love foxy cats more than 340 times more than whites,
  • the most popular cat name in this Farm is Fred.

Folks, be careful of statistics you provide. Somebody might treat it seriously.

NB in conclusion

As we talking about exa,pexa,schmexa-bytes and parallelisation among several nodes by Map-Reduce framework, the normal question is: how to balance nodes load? It would not be OK if one node receives to calculate million words starting with ‘A’,’B’,…,’W’ and second node ten words ‘Z’ because there will be delay while waiting Node1 results due to unbalanced load.

You’ll also ask – well, and how do we decide, how many nodes should be used for Map and for Reduce phase calculations and how the key-value pairs to be distributed in the balanced way?

Hehe, that’s what Map-Reduce framework developers and architects are paid for :) our business here is to define – our key will be Cat name and value will be its count, this is our Map function, this is Reduce function, and framework’s internal business is – how to distribute these functions to nodes, how to split, shuffle, balance reducing. There is no one global standard for that.

Some of developers keep internal logic as commercial secret, some are proudly publishing whitepapers to show their system strengths. There are rumours some have patented their approach, while some are open source.

Thank you for your patience. I hope you also have more questions now :) There are key-value databases and graph databases on my learning queue now.

P.S. my application about cat autofeeding practice was accepted. Have I developed it? No. Do I have a clue how to? No. Am I afraid? No. Will I be able to do it? I hope yes.

Disclaimer
This blog is solely my personal reflections.kep
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: CAT, ups, CAP theorem. And ACID and BASE transactions basics, also full of cats


Spoiler alert: a lot of cats today.

Picadilla

Picadilla

Warm-up intro while cats are approaching: to enable working with large datasets computers are connected in a distributed system as nodes that share data. Data records are replicated across nodes to keep the system up.

It is always a business owner’s decision what to do in case when one or more nodes lose connection to distributed system. Shall all the system stop or shall those nodes still operate if they are able to respond?

Examples

Youtube node storing your video copy loses connection to others. You watch your video and see statistics 100 views. 10 minutes later you watch the same video and there are 242286 views (or vice versa). Ooops (either synchronization happened or you now are watching video on different node). But would you feel happier if no video available at all? Youtube have chosen availability over views count consistency.

Despacito

Video used Despacito laukos (Latvian parody)

Another example: one of nodes in ticket sale system loses connection. You connect and see 4 free seats. You press [Book] and see ‘Dear Customers, please come later, apologies’. You get upset and press refresh for some hours until site recovers, however no more seats. They have chosen consistency over availability.

Imagine, you’d bought and came to the event – whoops – there is another guy with same tickets. However – I must note many companies have calculated it is much cheaper to apologize and give gift cards or pay penalties instead of stopping whole business.

Why they having near unlimited money can’t just do everything ideal?

Cats proudly presents CAP theorem

You have one cat and you feed it. Single processor, single INPUT/OUTPUT.

Murmor

Then the era of Big Cats come and you have three cats: Fred, Murmor and Picadilla. You feed them and write down in your notes which and when was fed and live happily ever after.

One day you got sick and your cats were hungry. Single point of failure happened.

Distributed System introduced

You ask your spouse Alex and child Max to involve. Now you are distributed system with three nodes. They do the same as you: when seeing hungry cat, feed and write down in their notebooks it was fed. (you will ask, why not on a common whiteboard? Because we are talking about CAP theorem which applies to distributed systems and I must pretend not all feeding data can fit on a whiteboard)

Some days later your start noticing that Murmor seems fat. You go to check notes and find out that each of you have been feeding Murmor several times a day as this hell boy was constantly pretending to be hungry. The same day you get notified that Cats Care Government Agency will do regular audits.

You call the family meeting and discuss the issue that your data are not consistent and cats are having never ending party and Agency is a threat.

Consistency

You decide: before any of you is feeding any cat, you call others and hang on the phone while each writes in their notes. Thus, each will always know the latest time in their notes. Cats are biting your leg, yelling, pretending to faint and sitting on your neck but you are happy – because Consistency now is solved, you all have the same data. Cats Care Government Agency calls to examine are highly welcome.

Everything is just perfect – mobile networking fine, Alex and Max always picks up the phone, pens are writing well and notes have enough blank pages.

One day Alex leaves for expedition to jungle. When you call Alex deeply regrets forgetting notes at home. As you have agreed that data consistency is must have, it means that day you cannot feed cats because you and Max will update notes but Alex will not. Just imagine the horror Cats Care Government Agency might call you and then Alex to ask latest feeding date and come to save cats by taking them away from these shameless liars. You (heh – cats) have faced the Availability issue. They are not fed at all now.

Availability

When Alex returns to her notes you call the family meeting and decide if any of you cannot take notes others still feed cats, update their notes and leave red post-its for others. When others return home they copy all the post-its to notes. Voilả, now you have Availability.

You accept the risk if Cats Care Government Agency calls, the latest feeding data might be not the latest one but any of you still can share any history statistics – which food did you use, how often feeding was etc, based on your notes)

So CAP theorem postulates: when some of you has left home notes (partition occurs in your distributed system – or in CAP theorem terminology partition tolerance happened):

  • Either you all guarantee to have the latest feeding date in your notes (and do not care cats are hungry waiting) – Consistency
  • Or you feed cats according to your notes (and do not care if beasts are overfed or Agency might get old data) – Availability

Isn’t it obvious that we can’t have both Consistency and Availability at the same time?

Let’s exploit cats for two very famous concepts explained.

ACID transactions – pessimistic approach which forces consistency. The ideal world for data critical systems like banking (massive data quality checking, a lot of built-ins for transaction control etc. My native RDBMS world)

  • Atomic: all tasks within transaction succeed or every task is rolled back. If Max does not succeed writing notes then you and Alex erase date also from your notes and return food to fridge. Cats go crazy.
  • Consistent: on the completion of transaction the database is structurally sound. Notes are up to date without any punctuation errors and all cats have eaten exactly the same food as written in notes. No half eaten chicken left.
  • Isolated: transactions are run sequentially. There is no chance you and Max are both feeding Picadilla, while Fred eats Murmor’s fish.
  • Durable: once transaction is complete, it cannot be undone, even in presence of failure. When food is eaten and suddenly light was turned off or Max stepped on Fred’s tail the food does not appear back in bowl and you cannot just decide to add a delicacy for Picadilla – because transaction is over.

If you had enough patience to read you might notice that having this level of checks you just cannot operate petabytes. Like hoping to cut a forest with surgical scalpel.

Big data world is BASE transactions – optimistic approach accepting that database state is in a state of flow (much looser then ACID but much more scalable and big data friendly)

  • Basic Availability: appears to work most of time. Either you or Max will always hang near fridge, so cats have a chance to be fed often, even if Alex is in jungle
  • Soft state: no need for different nodes to be consistent all the time. You will feed Picadilla, leave post-it for Max and don’t care when Max updates notes
  • Eventual consistency: achieved lazily later. Some day Alex returns from jungle and will write all the dates from post-its to notes, so for some time you will all actually have the same feeding dates in your notes.

Thank you all for patience! Tomorrow is the deadline to apply for the semester end practice and I am going to draft and submit cat autofeeding system offer.

Big Data: with respect to NoSQL Zoo


Relational databases have many advantages, basically because of completely structured way of storing data within fundamental structure – easily understood table. But! (c) Besides RDBMS existance and advantages Google built Bigtable, Amazon developed Amazon DynamoDB, NSA built Accumulo, in part using Bigtable as an inspiration. Facebook built Cassandra, Powerset built HBase, LinkedIn built Voldemort etc.

Currently >225 different databases – see http://nosql-database.org/ – “Your Ultimate Guide to the Non-Relational Universe”

It wouldn’t have happened if ultra popular and well-established relational databases had all the capabilities these brands were looking for, would it?

RDBMS are still at the peak of the wave because wide and solid, well-grounded usage over many years in combination with strong scientific basis, financial capabilities and a lot of lessons learned have led to vendors investing resources during decades

  • to improve and polish built-in locking and transactions management,
  • preventing collisions between multi-users updating the data,
  • provide highly customizable data access control solutions,
  • expand SQL capacity (I’ll remind that outside of core SQL there are very many nuances when SQL querying different vendor databases. I’ll show some samples someday later).
  • offering a lot of metadata (data about data) and utilities (a lot of them are seldom used)

RDBMS will be alive and used always, that’s definitely is not a concern. Let’s have a simple example to have some fun and also increase respect to them. We have

A = 5 (could it be you account balance or product count in store)

B = 12

Two operation sets in queue to be processed. Maybe Anna and Peter both have pressed the button [Apply]:

Operation set O1

C:=A+B

Write A:=C

Operation set O2

C:=A+B

Write B:=C

Let’s model what may happen:

Table-AB

Which answer is correct? Which is not correct? WHY?

The only answer is: first-in, first-served. They both are correct. Welcome to the world we live in :) If developers do not build any other means to prioritize these operations, they rely on RDBMS internal built-ins for data consistency by serializing, locking and other means (heh, and one of developers’ support tasks is to be able to track down and explain to end-users why the result is 29 or 22).

For some more entertainment – let’s do the same without transaction and enjoy yet another result:

Table-AB-wt

You see, someone must implement that background logic serialising, prioritising, serving network failures, concurrent transactions and other megastuff. There is enormous count of built-in RDBMS features and amongst them are as well crucial ones as overheads nice to have ones, and that reminds a Swiss knife like this:

Swiss-knife

Noone would dare to say it is lacking features. Their marketers also will tell that this knife has solution for near every situation. And actually they are right, aren’t they.

Would YOU dare to say there a too few functions in? Would YOU recommend that this is the way to go for restaurant chef or manicure or car repair?

Of course, I can imagine you saying YES – when there is no another knife at all or other option is spade or – if this is the only tool you have seen in your life.

If you were Google or Amazon, or Facebook, you actually would believe there are another ways. Because otherwise you will choke and die, drowning in your data and watching customers running away.

You then need to deal with consistency, serialising, scaling etc. Everything. Imagine if you’d have to program the chess game if you want to play it. You sit and think: well… should I start with designing horse or thinking how to stay within chessboard after a horse turn?

It is a grave decision and serious amont of work and issues to be solved when designing your own system. This is not anymore installing Oracle and writing ‘update emp set mgr_id=17’. This is a task for many person-years. This is parallel with existing systems and growing business, this is very expected and pressed my management and must be fast, correct, stable, expandable and thousands of other must-bies.

Year 2004 Google began developing their internal data storage system Bigtable searching for cheap, distributed ways to store and query data at massive scale to petabytes of data and thousands of machines using a shared-nothing architecture and having two different data structures, one for recent writes, and one for storing long-lived data, with a mechanism for moving data from one form to the other.

Near the same time, Amazon experienced growing business and direct database access was one of the major bottlenecks. They developed and released Amazon DynamoDB as the result of 15 years of learning and implementing database that can store and retrieve any amount of data, and serve any level of request traffic.

My deepest respect to all the developers all over the world!

NoSQL

There is no such thing as the only one ‘NoSQL’ database, neither one vendor, nor one server, book, any silver bullet. As I wrote, currently >225 different databases.

Examples of fasic classification by data model is:

Popularity and trends

To have overall understanding about as-is and trends we may have a look to sites where popularity of database is measured. As an example was mentioned https://db-engines.com/en/ranking_definition. You see they measure by:

  • Number of results in search engines queries
  • Frequency of searches in Google Trends.
  • Number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflowand DBA Stack Exchange.
  • Number of job offers, in which the system is mentioned
  • Number of profiles in LinkedIn and Upwork, in which the system is mentioned
  • Twitter tweets, in which the system is mentioned

Actual ranking

See here: https://db-engines.com/en/ranking

Overall_ranking_sep2017

Of course, Oracle leads there – is has been in market for years and still has broad range of usage as well as serving perfect fit as well as pain to move away from it.

Much of fun is also querying by types.

Key-value https://db-engines.com/en/ranking/key-value+store

Key-value-ranking-Sep2017

Document oriented https://db-engines.com/en/ranking/document+store

Document-ranking-Sep2017

and also a lot of other reports like https://db-engines.com/en/ranking_categories

Categories-ranking-Sep2017

To my very pleasure plan of ‘Data processing systems’ lectures reveals that we will have separate sessions about near each of the most popular approaches. Can’t wait!

Next blog entry hopefully will be about CAP theorem, ACID and BASE transactions. Fingers crossed to have weekend time for blogging.

Disclaimer
This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

%d bloggers like this: