Big Data: some of universal file formats


All the data – this blog, Facebook messages, comments, Linkedin articles, anything – has to be stored somewhere somehow. How? It depends (here you can see how a tweet looks like in JSON format) but there are some universal formats.

Besides writing my notes here I am going to prove here it is possible to start and learn. You do not need any servers, any installs to learn XML querying – just google for online XPATH test or online XQUERY or online JSON query and go, do, test, learn.

Sometimes I see young girls wasting their life being bored at receptions or empty shops and sitting at a computer with Solitaire or gossip page open and I think – if I were them I swear I would learn programming online every free minute I have! When I was studying we had to sit in libraries and subscribe in advance for hour a day accessing mainframe. No excuses nowadays, guys!

XML

This is one of The Formats you should know when woken up 3AM because a lot of Big Data databases store data in XML format. Both XML and JSON (see below) are human-readable and machine-readable plain text file formats.

Database management systems, whose internal data model corresponds to XML documents, are called Native XML DBMS and they claim to use the full power of XML: represent hierarchical data and support XML-specific query languages ​​such as XPath, XQuery or XSLT.

NB: Native XML DBMS do not necessarily store data as XML documents, they can use other formats for better efficiency.

Databases which use other data models like relational and are capable of storing XML documents, are called XML-enabled DBMS.

Current ranking of native XML databases: https://db-engines.com/en/ranking/native+xml+dbms

NativeXMLDB_ranking_Oct2017

Lesson learned with self-made XMLs

XML data values have beginning and end, and are separated by tags, you know –

XML example

Many years ago, I was working as a designer of XML files for data exchange. We were young and enchanted by the unlimited power of any-structure container format and we used very long tags. Our intentions were good – human readable plain text for fixed orderforms, like [MozzarellaWithFourCheesesPizzaPriceBeforeTaxesTipsNotIncludedCurrencyLVL]1.17[/MozzarellaWithFourCheesesPizzaPriceBeforeTaxesTipsNotIncludedCurrencyLVL].

We were to XMLionize hundreds of documents and do it very fast, so we worked like a fabrics.

We did it. But… What we got was:

  • Storage space consuming documents
  • Network traffic
  • Quite funny software code parsing there wondertags
  • The same business term and tag called in many variations like Pizza, Pica, Picca
  • Grammar errors in tags confusing users like MozcarelaWihtSieru
  • Mixed language and translation errors in tags like PicaArCheese
  • At a glance easy XML readability was misleading when tag became inconsistent with value
  • Documentation was not consistent, incl. curiosities when writers corrected grammar in tags in Word docs (thinking they are doing great work)
  • Unmaintainable structure – see the example with LVL in tag and ‘four cheeses’ – recipes do change

My lessons learned –

  • short and neutral tags
  • create structure using hierarchy, not tag names
  • include version attribute in the beginning of file
  • follow the same style (we used PascalCase), usually one of:

– Lower case    firstname All letters lower case

– Upper case    FIRSTNAME All letters upper case

– Underscore    first_name    Underscore separates words

– Pascal case   FirstName Uppercase first letter in each word

– Camel case    firstName Uppercase first letter in each word except the first

Querying XML documents

One might ask – why should we query plain text file if we can search in notepad? Answer: it is easy only on short samples. But when you have a lot of data, you will get confused it is first, second or hundredth value.

XPath language

Declarative “path like” syntax to identify and navigate nodes in an XML document. Nice web page to play online: https://www.freeformatter.com/xpath-tester.html NB: tags are case sensitive

I played a bit there using self-made simple sample FoodCalendar.

It took a while with //CatName[2]/text() until I understood that second element  – [2] means second for its parent tag, not second in list returned. And correct query what I wanted – second cat in the list – was:

(//CatName/text())[2]

All foods eaten more then 2:

//*[@Amount>2]

Count cats:

count(//Cat)

All foods containing ‘Chicken’:

//*[contains(@FoodUsed, ‘Chicken’)]

Extension for XPath is XQuery

It is much more complex language to extract and manipulate XML data and transform them into HTML, CSV, SQL, or any other text-based format. I read some manuals of one of native XML database https://exist-db.org/exist/apps/demo/examples/basic/basics.html and wrote a very simple query in XQuery online test http://videlibri.sourceforge.net/cgi-bin/xidelcgi to find what is Picadilla eating:

for $i in $catxml//Cat

where  $i//CatName=”Picadilla”

return (“CAT “, $i//CatName, “EATS”, data($i//@FoodUsed), data($i//@Amount), ” TIMES A DAY”, data($i//@Date))

and answer was

CAT

Picadilla

EATS

ChickenLiver

5

TIMES A DAY

10-OCT-2017

XQueryTest

Of course, this language is much more powerful, you can analyze data and write computation functions there. My target from playing was to see if it is possible to learn this querying, and I see – it is, just some more time needed.

JSON

Another must-have-to-know is JSON format, yet another way to store information as text only in an organized and human-readable manner. Document databases such as MongoDB use JSON documents in order to store records, just as tables and rows store records in a relational database.

JSON format files can easily be sent to and from a server, and used as a data format by any programming language.

We can store any number of properties for an object in JSON format.

It is shorter as XML, however they both have similarities.

  • Both JSON and XML are “self describing” (human readable)
  • Both JSON and XML are hierarchical (values within values)
  • Both JSON and XML can be parsed and used by lots of programming languages

Differences:

  • JSON is shorter
  • JSON is quicker to read and write
  • JSON can use arrays

and the biggest difference is that XML has to be parsed with an XML parser. JSON can be parsed by a standard JavaScript function. That supports explaining huge popularity of JSON.

JSONPath

Similarly to XPath, there is JSONPath which is a JSON query language.

I took my FoodCalendar XML and converted to JSON via https://www.freeformatter.com/xml-to-json-converter.html.

and wrote a simple query to filter all foods and dates where amount eaten is < 5

$.Cats..Meals[?(@.Amount<5)].[FoodUsed,Date]

http://jsonpath.com/

JSON_path

CSV and FIXED LENGTH FILE – universal formats for tabular data set

At least basic understanding of these universal tabular (not hierarchical) formats is crucial because a lot of NoSQL Big Data files are stored in these formats.

CSV (comma separated) – the one you should know. It will never die because of its simplicity.

CSV has become kind of industry standard, despite it has no universal standard: text file having one record on each line and each field is separated by comma (or ; or TAB or other symbol). Built-in commas (or another separator) separated by double quote. Double quote characters surrounded by double quotes etc.

List of advantages is impressive:

  • compact size – write once the column headers and then no more additional tags in data needed
  • human readable,
  • machine easy generate and easy read,
  • widely used for tabular data
  • most applications support it.

Very popular amongst MS Excel users. Used to transfer data between programs, import and export data, sometimes used as a workaround to export, then modify and import back.

Disadvantages:

  • complex data are too complex to transfer within CSV,
  • poor support of special characters,
  • no datatypes defined (text and numeric is treated the same)
  • when importing into SQL, no distinction between NULL and quotes.

As there are no universal standards, widely hit issue is new line delimiters – Linux uses one, Windows another etc.

Example of formatted XLSX file:

Excel_formatted_table

Saved As CSV:

NR,CatName,Diet,Food,Date,Amount,
1,Picadilla,,"Sausage ""The Best""",12/09/2017,1(?),"Peter, can you check, was it really only one??"
2,Murmor,Y,"Chicken,boiled ©",14-Sep-17,0.5,
3,Fred,N,"Salmon,,fresh",15.09.2017,2,

See:

  • the last comma in the first row – one column is without heading
  • first record contains NULL value for ‘Diet’, contains quotes for Food and comma in comment
  • second record contains special symbol which might (heh, will) be lost when importing to another software
  • dates still different format
  • colors lost, formatting lost

And the same open in Excel again

Cats-csv-Excel1

Double-clicked to expand columns

Cats-csv-Excel2

Fixed-length fields

As name reveals, it is an agreement that first column always has exactly X (5 or 10 or any other value) characters, the second column has exactly Y, the third has exactly Z and so on. The same for all rows.

If the value is shorter than blanks are padded with spaces or any other specified character. Padding can be done on either side or both.

When you use this format be ready to do a lot of configuration like defining each field length and even writing your own code.

I converted my cats.CSV to fixed length file (http://www.convertcsv.com/csv-to-flat-file.htm). I had to do several configuration options like field length and alignment

NR   CatName   DietFood                     Date                     Amount

1    Picadilla     Sausage "The Best"       12/09/2017               1(?)  Peter, can you check, was it really only one??

2    Murmor    Y   Chicken,boiled �         14-Sep-17                0.5

3    Fred      N   Salmon,,fresh            15.09.2017               2

And after opening in Excel –

Excel-fixed-length-file

As you see, a smarter reader software is necessary.

You might ask – hey, let’s add some data definition to the file and then file format is readable automatically by software. Why not :) it is called

DBF (database table file) format

It is a fixed length file having its data definition in the beginning of file in machine readable standardized format. There is no one universal DBF standard. Each file contains a description. Interesting to know this format but I see no reason to learn very details.

I tried to convert online my CSV to DBF but failed. Then I saved CSV to XLSX and converted XLSX to DBF.

Opening converted result with Notepad:

Cats-DBF-Notepad

Excel:

Cats-DBF-Excel

Opening DBF online (http://www.dbfopener.com/):

Cats-dbf-viewer-online

These experiments were enough to illustrate that even small and simple fixed length files need some time to be converted.

And, as you see, copyright symbol lives its own life :)

Disclaimer
This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Advertisements

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out / Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out / Mainīt )

Google+ photo

You are commenting using your Google+ account. Log Out / Mainīt )

Connecting to %s

%d bloggers like this: