Big Data: HADOOP ecosystem. There is no one ‘hadoop.zip’ to install


When I was a kid, there was a saying – we say ‘the Party’ and mean Lenin, we say ‘Lenin’ and mean the Party. Today we say Big Data and mean HADOOP which is presumed to be used for processing of 50% of enterprise data all around the world – hundreds to thousands of nodes and petabytes of data is nothing special.

I heard about 8500 computers and 100 petabytes HADOOP cluster in one of training videos.

HADOOP is everywhere: social media, searching tools, government, finances etc. Yahoo, Facebook, Amazon, eBay, IBM, American Airlines etc. P.S. this also mean very high volume of job postings. HADOOP admin salary about 90k – 120k USD/year, developer/data scientist 60k – 150k USD/year.

There is no one “hadoop.zip” which you can download and install and happy having HADOOP.

In one sentence: HADOOP framework is the collection of open source LINUX based software to enable big data flow to be split and processed locally by using parallel, distributed (map-reduce) algorithms across many (one to thousands) low-cost computers and application layer (HDFS) handling hardware faults. HADOOP started its raise since ~Y2003.

Often referenced as ‘HADOOP ecosystem’, ‘HADOOP umbrella’, ‘HADOOP cluster’, it is addressing:

  • lots of data flowing in at a very high speed,
  • large and increasing volume,
  • variety of unstructured, different data – logfiles, videos, messages, records, comments, chats
  • if you lose a computer, it automatically rearranges and replicates the data – no data loss

HADOOP supports running applications on Big Data like social networking portal or recipes portal or trend analytics of retail.

It took a while to understand that Hadoop is like data batch processing fabrics in backoffice and is a set of tools, each strong in a particular area. Outside the HADOOP there are frontoffice applications who needs something to be done with data – queried, stored etc, eg, searching device or TOP 10 most sold books this week genre sci-fi or retrieving Facebook message. These applications send their tasks to HADOOP as tasks or jobs to be queued and performed by map reduce and the results to be sent back to the calling applications. This is done on data stored in HDFS file system.

Examples of applications where HADOOP is behind the scenes:

  • Mining of users behaviour to generate targeted advertisements and recommendations (Apache Mahout)
  • Searching and grouping documents, based on certain criteria
  • Searching uncommon patterns to detect fraud
  • And, of course, Facebook which runs the world’s largest Hadoop cluster. Your messages, likes, shares and comments, they are there, in Facebook data centers and HADOOP. When their previous messaging platform was running our limits they spent weeks testing different frameworks, to evaluate the clusters of MySQL, Apache Cassandra, Apache HBase and other systems. Facebook selected Apache HBase, one of HADOOP family.

Programmers love HADOOP because they do not have to worry about, where data are, what if computers fails, how to split big data and how to break down tasks. If a code works for a megabyte file, it will work for hundreds of gigabytes. If it works on one computer, it will work on 8500 computers. it’s HADOOP business – scalability.

Teamwork. No superhero.

If we use our classical enterprise mindset these data would be handled by a one supercomputer. But everything has its limits, even supercomputers, shall it be processing speed limit or disk space limit.

HADOOP splits and conquers, combining “weakness” of each separate computer (low price, so called commodity hardware)  into a powerful engine. You can add more and more computers (scalability) when your data are growing or requirements changing. The bright side of HADOOP scalability is that count of computers is linear to processing speed: to double the processing speed double the number of computers.

  • Data processing part of HADOOP is map-reduce.
  • Data storing data is HDFS (HADOOP distributed file storage).

Slaves

Computers incorporated into HADOOP framework are called slaves. Each has processes running on it:

  • Tack Tracker: responsible for performing the piece of task assigned to this computer
  • Data Node: responsible for managing the piece of data given to this computer

Master (or masters)

As word slaves might have made you think, they have a master. NB: Master computer is not a supercomputer, it is one of commodity hardware like others. This node has the same processes running as each slave – tack tracker and data node, and has yet two additional processes running:

  • Job Tracker to break tasks into smaller pieces, send to tack tracker processes, receive results, combine results and send the result back to the calling application
  • Name Node to keep an index which data are residing on which node (computer). It tells the calling application which node stores the data required, and the application contacts this node directly, it is not depending on name node to return the dataset. Actually, data flow itself never goes through Name Node, it only points to the storing Node.

Task tracker and Job tracker are puzzle pieces of Map Reduce component.

Name Node and Data Node are puzzle pieces of HDFS.

HDFS

HDFS software is built presuming that hardware fails will definitely happen. It is called built-in fault tolerance.

By default, HADOOP maintains three copies of each data file in different computers. Once a node fails system keeps running and if a node is later restored then HADOOP maintains data copied are spread to this node.

Interesting that fault tolerance applies no only to classical “disk failure” but also to failure of task tracker processes. In case it happens, Job tracker asks another node to perform that job.

Does it sound like Master Computer is single point of failure now?

HADOOP has solution also for Masters failures. The tables masters store data indexes (which data files are on which nodes) are also backed up to different computers. HADOOP can be set to have more than one master where backup master overtakes in case main master fails.

When trying to understand that universe you feel like in the Bold and the Beautiful where everyone has married everyone else for several times and have several backup wives :D

Some of tools within HADOOP ecosystem  – be aware, list is growing

Apache HIVE: data warehouse – query, analysis, summaries. It takes a SQL like language, then converts it to Pig and then to Map-Reduce. Unline SQL, querying HIVE always inserts the results into a table.

Facebook uses this up to 90% of their computations.

I wrote a HIVEQL query which finds all Picadilla page views referred October 2017 by my cat farm main competitor BestCats.com

INSERT OVERWRITE TABLE bestcats_refer_picadilla
SELECT picadilla_page_views.*
FROM picadilla_page_views
WHERE picadilla_page_views.date >= '2017-10-01' 
AND picadilla_page_views.date <= '2017-10-31' 
AND picadilla_page_views.url_referrer like '%bestcats.com';

 Apache HBASE: people need to read and write their data in real-time. HBASE meets that need: it is a noSQL big data store, providing real-time read/write access to large datasets in distributed environment (as all our data are spread across HADOOP cluster).

HBASE can be accessed by Hive, by Pig and by Map Reduce, and stores the data in HDFS.

Facebook messaging is one of most famous examples of HBASE users. When you send a message to your friend in Facebook, this message actually is an object in an HBASE table. The more messages you send the more nodes Zuckerberg has to add to FB Cluster :)

HCatalog: HBASE table metadata storage, pulled out of HBASE and treated as a separate project for different data processing tools — Pig, MapReduce — to access it, read and write.

Apache Zookeeper: a centralized service for maintaining configuration information. It also stores some of HBASE metadata.

Apache Mahout: machine learning targeted advertisements and recommendations. It will scan the data in system and recommend ‘movies you might like’ or ‘places for you to explore’. It lets you write map reduce applications, focused on machine learning.

Apache Pig: tool for analysing data. It is a high-level language for programming Map-Reduce logic (Pig Latin – similar to SQL). Programmers write high level description and Pig translates it to machine code for Map Reduce and runs that map reduce.

Yahoo experience was that 50% of tasks are written by using Pig instead of direct Map Reduce coding by programmers.

I wrote a very simple script which takes all comments from my imaginary cats site and filters out mentions of my favorite cat Picadilla into a new file picadilla_mentions.

This action might be sent to HADOOP when you open my cats site page and click on Picadilla photo. Then you could have a page opened by telling what a famous cat she is:

catnotices = LOAD 'cats_site_comments';
picanotices = FILTER catnotices BY $0 MATCHES '.*PICADILL+.*';
STORE picanotices INTO 'picadilla_mentions';

Apache Oozie: workflow scheduler to manage Hadoop batch map reduce jobs run – eg, run this job on this schedule, with this interval etc. Other features, like it can trigger a map reduce job when necessary data appear

Apache Flume: collecting large amount of log data and providing data model for log analytics. Load log data into HDFS and process by map reduce.

Apache Scoop: tool for writing map reduce applications to transfer bulk data between Apache Hadoop and structured databases, like RDBMS (Oracle, for example)

Roles

HADOOP administrators and user roles are overlapping. In general they perform:

  • Installation
  • Monitoring
  • Tuning (users help to do that)

Users perform:

  • Design and coding apps for certain objectives (admins help to do that)
  • Import/export data
  • Working with HADOOP ecosystem tools, connectors

Pricing

HADOOP open source software is for free.

Since Y2006 HADOOP software is distributed under Apache Software Foundation, an American non-profit corporation formed as a decentralized open source community of developers. The software they produce is distributed under the terms of the Apache License and is free and open-source software. The idea of license is that no particular company controls the HADOOP.

ASF makes money from donations, sponsorship, running conferences,

HADOOP ecosystem vendors use different pricing models: service per node, per terabyte, subscription, hardware plus software deals, selling specialised connectors, selling cloud services and an interesting concept freemium –  free usage till a threshold limit.

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Advertisements

Big Data: basics of Time series databases


Topic, painfully familiar for me as a data warehouse servant: data changing over time. Ticket valid from – to, Policy effective from-to, having sub positions like travel insurance from – to.

Some time ago this was more related to “serious” systems like banking and amongst other assumptions also storage space was weighted. Nowadays Internet of things, self-driving cars etc coming in, the challenges of fast and easy operating with time periods have come into daily life. Eg, what was the behavior of room temperature when heater temperature was increasing and fridge temperature decreasing?

Historically updating was much more used. Like when you logged in a system, the attribute ‘last_successful_login’ was updated.

Nowadays in line with unlimited storage and high performance databases each login is treated as an event and logged as a new event with a timestamp, so system owner can do historical tracking which daytime are you active most often, is your login activity count increasing or decreasing.

Time Series database growing popularity 

Paradigm switch to data accumulating combined with desire to use the data and support of this desire is the answer why time series approach and databases have experienced kind of boost within latest decade for many areas like

  • monitoring software systems: virtual machines growing popularity, services, applications
  • monitoring physical systems: equipment, connected devices, homes, bodies
  • asset tracking applications: vehicles, trucks, packages delivered
  • financial systems: cryptocurrencies, stock prices
  • tracking applications: customer interaction data
  • Business Intelligence tools: key metrics as well as overall state of the business

Time_db_ranking

(https://db-engines.com/en/ranking/time+series+dbms)

A time series database (TSDB) is optimized for handling time series data: each entry is associated with a timestamp, thus it contains data for each point of time. Trivialized example to show the idea:

21-OCT-2017: ticket 12345 valid
22-OCT-2017: ticket 12345 valid
23-OCT-2017: ticket 12345 valid
24-OCT-2017: ticket 12345 valid

instead of classics

ticket 12345 valid_from 21-OCT-2017 valid_to 24-OCT-2017

Better example would be a pulse measurement once per a second.

  1. Each measurement is inserted as a new data entry (well, there might be use case to update or overwrite previous record but, let’s be real, who does that nowadays?)
  2. The data arrives in time order and are stored in time order
  3. Time-intervals can be either regular (as I used once per 1 second) or irregular
  4. Data amount is growing very, very fast (and nobody has neither motivation, nor courage to clear history)

So we can define time-series data as data set which represents how something – a measurement, process etc changes over time. Price changing over time might be a price curve. Energy consumption within a period might be a load profile. Logging temperature values might be a temperature trace.

Time series data querying

TSDB concept queries are specialized to time querying like

  • language near to natural, like average time per minute could be ‘group by time (1 minute)’
  • flexible built-in time aggregations – per second, minute, hour…
  • easy comparing to previous record (in RDBMS we use different workarounds like LAG, OFFSET or querying previous ID by complex calculations)
  • joining by time series automatically like SELECT orders_per_hour.count / errors_per_hour.count from orders_per_hour INNER JOIN errors_per_hour.

RDBMS

In Relational dabatases developers use to set a lot of data integrity controls

  • from error messages in input forms to checking by trigger before save in database,
  • from snapshotting at time moments like end of the month till regular controls if this type of data already exist in data warehouse.

If controls find, eg, two valid documents for a day or a day without any valid document, they automatically create incident and assign to data stewards for investigation as well as may perform any other business rules like marking both documents to ‘invalid’ state or moving them to quarantine zone.

I know a RDBMS DWH where every night batch processes scan all key data and set indicators like ‘data valid today Y/N’ or ‘the most up to date version indicator Y/N’. then instead of time intervals SQL queries use them like

select salary 
from annexes 
where employee_id = 1717 
and top_version_indicator=’Y’

In TSBD instead of that you would query something like

select employee_name / annex_salary 
from employees 
INNER JOIN annexes

High risk of wrong query

Amongst other limitations of RDBMS traditional approach start_date + end_date requires high level of accuracy as it is so easy to build a query which selects wrong data set. Just forget a condition or use > instead of >=…

select employee.name, annex.salary
from employees, contracts, annexes
where employees.employee_id = contracts.employee_id
and contracts.contract_id = annex.contracts_id
and contracts.start_date = (select max(start_date) from contracts2 
                            where contracts2.employee_id = contracts.employee_id)
and contracts.end_date >= sysdate
and annex.start_date >= contracts.start_date
and (annex.end_date is null or  …

I have seen so many wrong this type of queries for many reasons

  • one date truncated to day and the other to minute, thus equality never exists if developer forgets TRUNC
  • date stored in different formats and developer forgets using TO_DATE
  • different convention is the end date inclusive or not – new developers could not even imagine in their worst nightmares that this system stores end date as non-inclusive and they must use < end_date instead of <= end_date
  • when start date equals end date – developers use to use and end_date > start_date
  • some leave NULL if end date unknown, some set to 01-JAN-3000, some to 31-DEC-2500 etc. Some have a total chaos. It is easy to overlook these plugs in queries and frontend and print a ticket valid till 01-JAN-3000
  • different time zones (a general suggestion is to store the data in a universal time and then translate it to the time of the user. There are opposite approaches also, storing the time of submission and the time zone it was submitted in)
  • data sometimes written in database in the moment they appear and sometimes later. Some developers may unreasonable rely that data always are written in the moment they appear.

A good and documented architecture of database and accurate analyst – developer are high valued. You might imagine, why. Who wants business key data presented as a result of wrong query?

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: basics of wide column store (column family) databases


Column family databases are designed for very, very large volumes of data where speed is crucial (millions of processed records per second and volume is terabytes – petabytes – …) such as performing

  • Analytics – logfiles, scientific, stock trends etc
  • Searching (like googling)
  • Social networking
  • Applications geographically distributed over multiple data centers

One of use cases – Facebook uses HBase for its messaging platform. Other popular databases (https://db-engines.com/en/ranking/wide+column+store):

Column-family_ranking_Oct2017

Wide column store is a sub-type of key-value database. It stores the data of a table not record after record but column by column. Actually the concept is interesting:

Traditional row based table:

Cats

ID, name, breed, colour, animal
1:picadilla, burmese, brown, null
2:murmor, siamese, null, cat
3:fred, null, grey, kitten

The column store stores columns together as ID and value, like this:
Cats.Name 1:picadilla,2:murmor,3:fred
Cats.Breed 1:burmese,2:siamese
Cats.Colour 1:brown,3:grey
Cats.Animal 2:cat,3:kitten

As you see, you anytime can easily add a new column like vaccination of Fred:

Cats.Vaccination 3:20-oct-2017

Or add a Date of birth for Picadilla and Fred

Cats.DOB 1:02-jul-2015,3:17-dec-2014

In a traditional table you should define the table structure before. If you add a column later, you add this column for the whole row like:

Table-Cats

Within Wide column store columns are created as needed when sending data to the database. Hence a record can have billions of columns and each record may have different set of columns.

Columns are grouped into column groups – families. Rule of thumbs are

  • to group into a family columns frequently used together,
  • to have a few families and unlimited number of columns.

Each row in the column family has a unique identifier key, like we had 1,2,3. These unique keys might be the ones you sometimes see in URLs like xEWfvS1-ffrEf-q4m. Or timestamps, or any other unique value.

To speed up querying, also derived data are stored, not calculated on fly. For example, store for each cat also average chicken liver amount eaten per day, per week, or weight of eaten food difference between yesterday and today. Or length of its tail proportion to length of its ears. Or any other information (see, how the data amount to be stored is growing?). Thus the time is saved – instead of calculations a very fast lookup by key is performed.

Should ‘Murmor, the son of a great Catapurr, born in tundra when the Sun was high‘ be stored as one string or by each word in a separate column? It depends on usecases. If you are expect database to query on any of these like ‘how many cats were born in tundra’ or ‘how many times we have word ‘the’ used in names’ then you should split them. If this is just a string then store it as a string in one column.

Joins are not used in wide column stores. Instead of that data denormalisation is used. If Picadilla favorite food  Chicken is the same as Murmor’s, then both Picadilla and Murmor will have a value ‘Chicken’ stored in their data. There will be no reference to food ‘Chicken’ lookup or colour lookup, or breed lookup.

Column-family-Cats

Here you see JSON notation for Cats family:

JSON-Cats

The columns themselves may store columns. Then is called Supercolumn family. However the more complex data structure you choose the harder it is to parse and process. Simplicity is the winning strategy of having advantage of column family databases.

Here you see JSON storing doctor checkups for Murmor:

JSON-Cats-superfamily

I hope you are as excited as I am. Magic behind ‘Big Data’ slowly reveals. No magic. Only programming :) 

By the way, here is a nice example how a tweet looks like stored in JSON. Beautiful tweets you see in Twitter is nothing else as end-user frontend programming, based on retrieving,  parsing and visually displaying these JSONs.
Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: basics of document oriented databases


One of the most important concept of NoSQL as the term ‘document’ can be treated in a very broad way: as well as my passport as this blog entry.

Use cases of document oriented databases:

  • Storing log files
  • Storing high volume of different data coming in – eg, one machine is measuring rotations per minute and produces CSV, and the other one measures temperature and produces XML, and yet the one keeps track of rejected operations and produces fixed length files
  • Efficient storing. For example, you have 30 attributes in your user profile like ‘date of birth’, ‘preferred music’, ‘favorite sport’ etc. One of users might have provided only date, the other only sport etc. You do not have to store the whole structure for each user telling ‘date of birth’ and ‘preferred music’ not provided (like you would store NULL or ‘n/a’ in RDBMS table structure). You just store the data you have: user1 ‘favorite sport’.
  • Storing blogs and articles with comments and likes
  • Storing business data where searching by metadata and content is crucial
  • Storing documents which structure can be changed any time – eg, you can easily add a new feature like emotions in your blog platform because you do not have to redefine the whole schema
  • Store new types of documents. You were storing only cats but today you can start storing and querying also dogs, trees, galaxies and stationery
  • Write operation works very fast as DOD have no RDBMS transaction and locking mechanisms overhead
  • Unlimited increase of database size since documents are key value pairs with document id being the key and the document being the value.

The idea of DOD is to provide a scalable framework for storing, inserting, querying and retrieving and updating (at single document level) unlimited amount of self-describing structured or semi-structured data, called “document”. Part of data is the document content itself and part is data about data.

Teacher used MongoDB as example and you see why (https://db-engines.com/en/ranking/document+store). Yellow you see Latvian team’ produced Clusterpoint:

document_db_ranking

Usually documents are stored in JSON or XML or binary like PDF and MS Word, Excel (eg, MongoDB and CouchDB uses JSON). Each document has unique ID and can have its own structure. There is no predefined schema. However we should understand that document databases are not designed to fit anything. Eg, these databases are not the best for deep nested data because cannot search them effectively.

As I am SQL person, here comes my survival kit (https://www.slideshare.net/fabiofumarola1/9-document-oriented-databases):

RDBMS_docdb_terminology

Relations among data items can be represented in two ways: referencing and embedding. Which one better? Hah, this is The Question any developer and any architect has a lot of pain and lots of guidelines are written and tons of stackoverflow.com posts. More art than a science. Each has pros and cons. Good news is that you can always change your mind. This is one of reasons why I love programming: I can play a lot. If I were surgeon it would be much harder to recompile.

Referencing to food stored in another document:

{
 _id: cat_1,
 name: “Picadilla”,
 colour: “brown”,
 food: “catfood_123”,
 amount: 2,
 dateofbirth: “10-OCT-2010”
}
{
 _id: catfood_123,
 name: “Tasty Chicken Liver”,
 producer: “Catty Food Inc.”,
 address: “Wildcat Boulevard 17”
}

Embedding food data in single document:

Notice that food has no ID itself – the _id field is a required field of the parent document, and is typically not necessary for embedded documents. You can add an _id field if you want.

{
 _id: cat_1,
 name: “Picadilla”,
 colour: “brown”,
 food:
   {
    name: “Tasty Chicken Liver”,
    producer: “Catty Food Inc.”,
    address: “Wildcat Boulevard 17”
   }
 amount: 2,
 dateofbirth: “10-OCT-2010”
}

Some ideas collected about referencing vs embedding:

  • the more that you keep in a single document the better – easy to query
  • any data that is not useful apart from its parent document definitely should be part of the same document
  • separate data into its own collection that are meant to be referred to from multiple places
  • embed is good if you have one-to-one or one-to-many relationships, and reference is good if you have many-to-many relationships
  • embedded documents are easy to retrieve (as everything stored there). When querying by parts of them, limitations exist, like sorting limited for insertion order
  • to my surprise I am reading that no big differences for inserts and updates speed

Consistency

When you design your schema consider how you will keep your data consistent. Changes to a single document are atomic (complete operation guaranteed), but, when updating multiple documents, it is likely that in a moment of time address of the same Cat food producer may differ amongst cats (NB: but there are a few databases like Clusterpoint which can handle multiple document updates in a transaction). In general, there is no way to lock a record on the server. The only way is you can build into the client’s logic to lock a field.

Remember, NoSQL systems are by desing support BASE, not ACID transactions. It is normal and real that at a moment of time you may see different address of food providers – or different comment content, or different views count.

My favorite example is Candy Crush daily free booster. If I spin it on a tablet and later try on a phone, I get ‘come tomorrow’. But if I spin on all the devices at once then I get a booster in each of devices #lifehack. In RDBMS transaction control mechanism would guarantee that once spinned you may not repeat it.

Very powerful querying

Documents can be queried by any their attributes like

{
 name = “Picadilla”
 colour: [“brown”, “amber”]
}

The querying language is powerful and not obvious, even if I am guru of SQL querying. It would take a while to learn it. Here is nice material of SQL mapped to MongoDB queries.

As usual, each query should be checked its speed before going live. Smilar to RDBMS, nice feature is call ‘explain’ for a query to see what is database performing when running the query – which index its using etc.

MongoDB_SQL2

Inserting new records

Again MongoDB as example.  Insert one:

db.inventory.insertOne(
   { item: "catfood", qty: 10, tags: ["chicken”, “liver"], ingredients: { water: 50, meat: 30, fat: 30, unit:"pct" } }
 )

Insert many:

db.inventory.insertMany([
item: "catfood", qty: 10, tags: ["chicken”, “liver"], ingredients: { water: 50, meat: 30, fat: 30, unit:"pct" } }
item: "dogfood", qty: 23, tags: ["beef"], ingredients: { water: 40, meat: 20, fat: 40, unit:"pct" } }
item: "kittenfood", qty: 3, tags: ["turkey", “fillet”], ingredients: { water: 55, meat: 30, fat: 25, unit:"pct" } }
])

Updating

Nice reason to learn upsert clause: if set to true, creates a new document in case if no document matches the query criteria. multi means if matching documents exist, the operation updates all matching documents.

db.cats.update( { "name": "Meowcha"},
   { "colour": ["brown", "white", "black"] },
   { upsert: true, multi: true } )

Deleting

We set deletion criteria and remove matching documents. Eg, the following operation removes the first document from the collection cats where amount is greater than 2:

db.cats.remove( { amount: { $gt: 2 } }, true )

Well, I have a had a look on the very, very basics. Can I apply for a DOD or MongoDB expert role now? No :) But I can honestly say I now know much more I knew a month ago and I definitely would not be afraid if told in my project ‘we are going to start using document oriented database tomorrow’.

Disclaimer
This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: some of universal file formats


All the data – this blog, Facebook messages, comments, Linkedin articles, anything – has to be stored somewhere somehow. How? It depends (here you can see how a tweet looks like in JSON format) but there are some universal formats.

Besides writing my notes here I am going to prove here it is possible to start and learn. You do not need any servers, any installs to learn XML querying – just google for online XPATH test or online XQUERY or online JSON query and go, do, test, learn.

Sometimes I see young girls wasting their life being bored at receptions or empty shops and sitting at a computer with Solitaire or gossip page open and I think – if I were them I swear I would learn programming online every free minute I have! When I was studying we had to sit in libraries and subscribe in advance for hour a day accessing mainframe. No excuses nowadays, guys!

XML

This is one of The Formats you should know when woken up 3AM because a lot of Big Data databases store data in XML format. Both XML and JSON (see below) are human-readable and machine-readable plain text file formats.

Database management systems, whose internal data model corresponds to XML documents, are called Native XML DBMS and they claim to use the full power of XML: represent hierarchical data and support XML-specific query languages ​​such as XPath, XQuery or XSLT.

NB: Native XML DBMS do not necessarily store data as XML documents, they can use other formats for better efficiency.

Databases which use other data models like relational and are capable of storing XML documents, are called XML-enabled DBMS.

Current ranking of native XML databases: https://db-engines.com/en/ranking/native+xml+dbms

NativeXMLDB_ranking_Oct2017

Lesson learned with self-made XMLs

XML data values have beginning and end, and are separated by tags, you know –

XML example

Many years ago, I was working as a designer of XML files for data exchange. We were young and enchanted by the unlimited power of any-structure container format and we used very long tags. Our intentions were good – human readable plain text for fixed orderforms, like [MozzarellaWithFourCheesesPizzaPriceBeforeTaxesTipsNotIncludedCurrencyLVL]1.17[/MozzarellaWithFourCheesesPizzaPriceBeforeTaxesTipsNotIncludedCurrencyLVL].

We were to XMLionize hundreds of documents and do it very fast, so we worked like a fabrics.

We did it. But… What we got was:

  • Storage space consuming documents
  • Network traffic
  • Quite funny software code parsing there wondertags
  • The same business term and tag called in many variations like Pizza, Pica, Picca
  • Grammar errors in tags confusing users like MozcarelaWihtSieru
  • Mixed language and translation errors in tags like PicaArCheese
  • At a glance easy XML readability was misleading when tag became inconsistent with value
  • Documentation was not consistent, incl. curiosities when writers corrected grammar in tags in Word docs (thinking they are doing great work)
  • Unmaintainable structure – see the example with LVL in tag and ‘four cheeses’ – recipes do change

My lessons learned –

  • short and neutral tags
  • create structure using hierarchy, not tag names
  • include version attribute in the beginning of file
  • follow the same style (we used PascalCase), usually one of:

– Lower case    firstname All letters lower case

– Upper case    FIRSTNAME All letters upper case

– Underscore    first_name    Underscore separates words

– Pascal case   FirstName Uppercase first letter in each word

– Camel case    firstName Uppercase first letter in each word except the first

Querying XML documents

One might ask – why should we query plain text file if we can search in notepad? Answer: it is easy only on short samples. But when you have a lot of data, you will get confused it is first, second or hundredth value.

XPath language

Declarative “path like” syntax to identify and navigate nodes in an XML document. Nice web page to play online: https://www.freeformatter.com/xpath-tester.html NB: tags are case sensitive

I played a bit there using self-made simple sample FoodCalendar.

It took a while with //CatName[2]/text() until I understood that second element  – [2] means second for its parent tag, not second in list returned. And correct query what I wanted – second cat in the list – was:

(//CatName/text())[2]

All foods eaten more then 2:

//*[@Amount>2]

Count cats:

count(//Cat)

All foods containing ‘Chicken’:

//*[contains(@FoodUsed, ‘Chicken’)]

Extension for XPath is XQuery

It is much more complex language to extract and manipulate XML data and transform them into HTML, CSV, SQL, or any other text-based format. I read some manuals of one of native XML database https://exist-db.org/exist/apps/demo/examples/basic/basics.html and wrote a very simple query in XQuery online test http://videlibri.sourceforge.net/cgi-bin/xidelcgi to find what is Picadilla eating:

for $i in $catxml//Cat

where  $i//CatName=”Picadilla”

return (“CAT “, $i//CatName, “EATS”, data($i//@FoodUsed), data($i//@Amount), ” TIMES A DAY”, data($i//@Date))

and answer was

CAT

Picadilla

EATS

ChickenLiver

5

TIMES A DAY

10-OCT-2017

XQueryTest

Of course, this language is much more powerful, you can analyze data and write computation functions there. My target from playing was to see if it is possible to learn this querying, and I see – it is, just some more time needed.

JSON

Another must-have-to-know is JSON format, yet another way to store information as text only in an organized and human-readable manner. Document databases such as MongoDB use JSON documents in order to store records, just as tables and rows store records in a relational database.

JSON format files can easily be sent to and from a server, and used as a data format by any programming language.

We can store any number of properties for an object in JSON format.

It is shorter as XML, however they both have similarities.

  • Both JSON and XML are “self describing” (human readable)
  • Both JSON and XML are hierarchical (values within values)
  • Both JSON and XML can be parsed and used by lots of programming languages

Differences:

  • JSON is shorter
  • JSON is quicker to read and write
  • JSON can use arrays

and the biggest difference is that XML has to be parsed with an XML parser. JSON can be parsed by a standard JavaScript function. That supports explaining huge popularity of JSON.

JSONPath

Similarly to XPath, there is JSONPath which is a JSON query language.

I took my FoodCalendar XML and converted to JSON via https://www.freeformatter.com/xml-to-json-converter.html.

and wrote a simple query to filter all foods and dates where amount eaten is < 5

$.Cats..Meals[?(@.Amount<5)].[FoodUsed,Date]

http://jsonpath.com/

JSON_path

CSV and FIXED LENGTH FILE – universal formats for tabular data set

At least basic understanding of these universal tabular (not hierarchical) formats is crucial because a lot of NoSQL Big Data files are stored in these formats.

CSV (comma separated) – the one you should know. It will never die because of its simplicity.

CSV has become kind of industry standard, despite it has no universal standard: text file having one record on each line and each field is separated by comma (or ; or TAB or other symbol). Built-in commas (or another separator) separated by double quote. Double quote characters surrounded by double quotes etc.

List of advantages is impressive:

  • compact size – write once the column headers and then no more additional tags in data needed
  • human readable,
  • machine easy generate and easy read,
  • widely used for tabular data
  • most applications support it.

Very popular amongst MS Excel users. Used to transfer data between programs, import and export data, sometimes used as a workaround to export, then modify and import back.

Disadvantages:

  • complex data are too complex to transfer within CSV,
  • poor support of special characters,
  • no datatypes defined (text and numeric is treated the same)
  • when importing into SQL, no distinction between NULL and quotes.

As there are no universal standards, widely hit issue is new line delimiters – Linux uses one, Windows another etc.

Example of formatted XLSX file:

Excel_formatted_table

Saved As CSV:

NR,CatName,Diet,Food,Date,Amount,
1,Picadilla,,"Sausage ""The Best""",12/09/2017,1(?),"Peter, can you check, was it really only one??"
2,Murmor,Y,"Chicken,boiled ©",14-Sep-17,0.5,
3,Fred,N,"Salmon,,fresh",15.09.2017,2,

See:

  • the last comma in the first row – one column is without heading
  • first record contains NULL value for ‘Diet’, contains quotes for Food and comma in comment
  • second record contains special symbol which might (heh, will) be lost when importing to another software
  • dates still different format
  • colors lost, formatting lost

And the same open in Excel again

Cats-csv-Excel1

Double-clicked to expand columns

Cats-csv-Excel2

Fixed-length fields

As name reveals, it is an agreement that first column always has exactly X (5 or 10 or any other value) characters, the second column has exactly Y, the third has exactly Z and so on. The same for all rows.

If the value is shorter than blanks are padded with spaces or any other specified character. Padding can be done on either side or both.

When you use this format be ready to do a lot of configuration like defining each field length and even writing your own code.

I converted my cats.CSV to fixed length file (http://www.convertcsv.com/csv-to-flat-file.htm). I had to do several configuration options like field length and alignment

NR   CatName   DietFood                     Date                     Amount

1    Picadilla     Sausage "The Best"       12/09/2017               1(?)  Peter, can you check, was it really only one??

2    Murmor    Y   Chicken,boiled �         14-Sep-17                0.5

3    Fred      N   Salmon,,fresh            15.09.2017               2

And after opening in Excel –

Excel-fixed-length-file

As you see, a smarter reader software is necessary.

You might ask – hey, let’s add some data definition to the file and then file format is readable automatically by software. Why not :) it is called

DBF (database table file) format

It is a fixed length file having its data definition in the beginning of file in machine readable standardized format. There is no one universal DBF standard. Each file contains a description. Interesting to know this format but I see no reason to learn very details.

I tried to convert online my CSV to DBF but failed. Then I saved CSV to XLSX and converted XLSX to DBF.

Opening converted result with Notepad:

Cats-DBF-Notepad

Excel:

Cats-DBF-Excel

Opening DBF online (http://www.dbfopener.com/):

Cats-dbf-viewer-online

These experiments were enough to illustrate that even small and simple fixed length files need some time to be converted.

And, as you see, copyright symbol lives its own life :)

Disclaimer
This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: enchanted with the idea of graph database power


 

Why there so many differetent database management systems? Because each of it supports the best something you need.

  • Payments 24/7 and queries like calculating average price of TOP 5 most sold goods? – choose relational database and enjoy its built-ins transaction support and SQL

Relational database: predefined tables where one document usually is split amongst tables and a SQL query must be written to join parts to be retrieved as a piece set for “document”

  • Building an e-shop shopping cart? -key value store and retrieve/write cart data by ID

Key-value store: data value is unknown black box for store and is located by its key and retrieved very fast.

  • Storing blog posts or messages? -document oriented database

Document-oriented store: similar to KV store, but document oriented database knows predefined metadata about internal structure of the document. The document content itself may be anything autonomous – MS Word, XML, JSON, binary etc, but database engine uses some structure for organizing documents, providing security, or other implementation. In contrast to relational databases, document is stored as singular object. (Next lecture will be about them, so expect a blog post)

  • Navigate user from point A to point B? Social networking? – graph database.

Graph database: networking. Database is collection of nodes and edges. Each node represents an entity (such as a cat or person or product or business) and each edge represents a connection or relationship between two nodes – likes, follows, blocks, ….

NoSQL graph database does not need its schema re-defined before adding new data – neither relations, nor data itself. You can extend the network any direction – billions of cats, ups, data items. You can add cats, farms, persons, foods, cars, you can add their likes, dislikes, hates, reads the same, sister of, attends the same lunch, checked in the same place, just anything.

Picadilla (likes) Fred. Murmor (hates) Minko. Picadilla (eats together with) Minko. Fred (meows) at Amber. Murmor (sleep in the same room as) Amber.

You see, as Murmor hates Minko, he better avoid Picadilla. Amber also should be cautious of Picadilla as she likes Fred, which meows at Amber, so there is a chance they will meow both at Amber.

Next day you add more observations.

Minko (likes) Murmor. It increases chance that he will hate Minko and avoid Picadilla.

As more data you have as better you can trace connections (King of the World). More paths and usability of paths you can find – just imagine the power. Graph databases are meant for that. And Facebook… what a set of queries I could write there…. mmmmm….

I felt in love when realised we can query graph database using specialized query language.

Who likes Fred? Which are friends of those who hate Minko? What is common between Picadilla and Murmor? Which cats cant’t be in one room? How many cats in average like one cat?

Hinghest ranking is Neo4j database. Its graph query language is CYPHER.

To be honest, CYPHER language is human readable and seems quite easy to learn.

MATCH (cat) –[:likes] -> (person) where cat.name=’Picadilla’ RETURN person

I googled samples – https://neo4j.com/developer/cypher-query-language/

Find Someone in your Network Who Can Help You Learn Neo4j

MATCH (you {name:"You"})
MATCH (expert)-[:WORKED_WITH]->(db:Database {name:"Neo4j"})
MATCH path = shortestPath( (you)-[:FRIEND*..5]-(expert) )
RETURN db,expert,path

Cast of movies starting with “T”)

MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)
WHERE movie.title STARTS WITH "T"
RETURN movie.title AS title, collect(actor.name) AS cast
ORDER BY title ASC LIMIT 10;

It seems matter of mindset and syntax if you have skills in SQL querying. I like graph databases.

Disclaimer
This blog is solely my personal reflections.kep
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Big Data: learning key-value store basics


Are you ever wondering how do Facebook manages 20 million reads per second? Every 60 seconds 510 000 comments, 293 000 statuses updated, 136 000 photos uploaded. Imagine, you are one of a milliard active there and it works fast like you are one in the whole Universe! HOW?!?! (d’oh, when I understand, many cats will be needed to explain)

Big Data studies are (slowly) opening a whole new world for me. On my SQL planet query (well, quite complex analytical one like ‘calculate income within last 30 years generated by users who have complained and whose complaints were declined but they have still returned as customers and used the service more than average’) runs 15 minutes to hour.

Key-value concept is a very flexible data storing and retrieving approach used in NoSQL systems. Key-value store stores key-value pairs. Values are identified via key and accessed via direct request to the object in memory or on disk (no RBMDS layer, no overhead, no relationship among other pairs, no control if value matches predefined format etc).

Being experienced in relational databases, I somehow expected that somewhere is a definition like key is number and value is text. No, no, no. There is no defined schema in key-value store. They have just pair of fields – a key and the value. This is called content-agnostic database –  store anything you want either data itself or pointer to data – from JSON to XML, from HTML to images, from log files to chat history, from books to videos. No need to learn a special data query language, and it is easy to move data to another system because of this simplicity.

There are some basic KV concepts:

  • You can find the value only by its key – and find very fast. (when system provides fast access to data operations, it is called low latency system – you might see this term when googling KV). Key structure example: userID_messageID
  • The values are opaque (they aren’t self-descriptive), the key-value store doesn’t know anything about values (like courier, delivering something to your address, does not know content). Application (not the key-value store but application – like Instagram frontend) has complete control over performing operations based on value content – like you might decide open the pack with a knife or a sword, or maybe it is a flower bucket or throw the content away.
  • You always retrieve full value from key-value store. You cannot filter or control value returned. What do you do with this value – it’s your application business. Of course, application can read the binary content and decode to text and then filter content out of it.
  • Data access is performed by “simple” get, put and delete commands to key-value store. As all the other operations are performed by applications, the key-value store gets in general one request to read, one request to write: read when application retrieves value and write when user changes are done in your application – eg, cart in e-shop – operations are not written on disk in key-value store each time you add a product or modify desired amount – the data reside within application cache.

Cat Passport Key Value store

Hmm. What should I use for key and value? Relational database modelling is descriptive: I know what objects I have within business and build model – table CATS (CatID number, Name VARCHAR2(30), ,,,,), table GENDERS, table BREEDS.

Within Key-Value store it is crucial to have the key be usable for retrieving data. I must think ‘what I want to know’ instead of relational approach ‘describe what I know’.

Design of keys and values 

I could choose to store my cat data ahyhow I want (remember, value will be stored binary 0028 e8e9 00fc 5cbb… and only application decodes it to text after retrieving from store):

Key: Picadilla

Value: Born:20150921;Breed:Burma;Color:#FFBF00;Gender:1;Photo:IMG_20170925_121432.JPG

And I can decide store with date of birth and store in another format:

Key: 20150921

Value:(Name)Picadilla(/)(Breed)Burma(/)(Color)amber(/)(Gender)Female(/)(Photo)IMG_20170925_121432.JPG(/)

And I can decide store with my name as a key and store in yet another format:

Key: Burma

Value:N=Picadilla|B=21-SEP-2015|C=255,191,0|G=F

And I can decide store with random generated number as a key

Key:6756970977876576789097

Or URL

Key:https://mysite.cats.cc/thebestcats/Picadilla.htm

I can build my software read these attributes and show name as large red, breed as blue etc. I can add latest vaccination date, free form comments and save it in key-value store. If I add vaccination to Picadilla, then the are no checks if I add the vaccination date also to Murmor. Freedom of values.

Often JSON format is used for values structure describing – I will tell about it someday a bit.

P.S. Of course, in Facebook etc never there never will be key like ‘VitaKarnite’ in KV store. They use keys either random unique identifiers or calculated by hash functions from something like Use rID and Username.

I believe any of large photo storing system stores photos as key-value pairs. I am pretty sure these pairs do not contain user name or any other additional potentially changing data. I could assume that uploading time might be stored inside of value. I think there is kind of combination where tables store metadata like user name, surname, last login, relationship to friends. Also there might be table UserPhotos storing UserID and unique autogenerated PhotoID, and then Key-Value pair would be PhotoID and pointer to file location. When new photo uploaded, new metadata record (UserID plus PhotoID) generated and key-value pair added in key-value store.

I had a look on most popular key-value databases and googled  usecases.

  • Near everywhere you will find this example – cart in e-shop
  • User preference and profile stores
  • Product recommendations; latest items viewed on a retailer website drive future customer product recommendations
  • Ad servicing; customer shopping habits result in customized ads, coupons, etc. for each customer in real-time

It’s enough for today despite I am behind learning schedule, have graph databases to learn yet and tomorrow is next lecture already.

P.S. Instead of counting sheep I design approaches for CandyCrush – how would I store the board layout, record moves, build controls, define attention loss detection logic – you know when you are losing patience suddenly 5 color bombs there. I am figuring out also user characteristics metrics from their CandyCrush history. I am sure I could pull out a lot of interesting things :) and I am even more sure a lot of companies already do. I’d eat my hat if banks do not profile you by the style you use internetbank.

Disclaimer
This blog is solely my personal reflections.kep
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

%d bloggers like this: