Big Data studies

I am blogging about my journey studying in Big Data analytics subprogram in the Computer Science Faculty of Latvian University.

About
Big Data: Phonetic Similarity : Soundex – words are similar if they sound the same 21.12.2017 Let’s discuss today finding a match if the words sound the same – like Meier and Mire. Again, example with three methods used.
Big Data: hybrid similarity measure: the Soft TF/IDF Measure to deal with misspelt words 19.12.2017 Let’s misspell the word “Apple” to “Aple” and try to find which of these strings relate to the same real world entity:
Apple Corporation, USA
IBM (USA) Corporation
Corp. Aple
I calculated by two scoring systems: using Needleman-Wunsch and Jaro-Winkler.
Big Data: combining string and set matching methods. One of hybrid similarity measures – generalised Jaccard index 18.12.2017 What if words contain errors like Kartina instead of Katrina and Matra, not Marta? Can we still find out names in two classes are similar despite these errors? You will find here five different results :)
Big Data: set similarity : TF/IDF scores and Feature vectors to devaluate terms common in other documents 15.12.2017 Which of these strings are most likely about the same real world entity?

Apple Corporation, USA
IBM (USA) Corporation
Corp. Apple

How to teach the computer the context?

Big Data: set similarity : q-grams, Overlap measure, Jaccard index, Jaccard distance 10.12.2017 How could we find which class’ guys have the most similar names to your class guys? Let’s have a look closer on some official methods for set similarity : q-grams & Jaccard.
Big Data: string similarity: dealing with typos (Jaro meausre, ups, measure) 05.12.2017 Have you evre had a typo? How to instruct the computer that Luxmeb is most likely meant Luxembourg? (try to write in google and, see, it knows! :) let’s have look – how.
Big Data: string similarity: best matching substrings between two strings (Smith-Waterman algorithm) 04.12.2017 Quite a simple change to Needleman-Wunsch formula allows to find the best matching substrings between two strings.
Big Data: string similarity: mind the Gap. The Affine Gap. 03.12.2017 Today’s topic is: how to instruct the computer that Alex Jso is similar to Alexander Johansson?
Big Data: string similarity – match, mismatch and gap scores come into play (Needleman-Wunsch measure) 30.11.2017 Imagine big data flowing in your system, unpredictable quality. Reviews, articles, blogs, comments, logfiles… day by day. And you there in the middle trying to solve:
Could Dave Smith in review be the same as David Smith in article?
Is David Lee the same as Davod 1ee ?
Big Data: one of string similarity measures: EDIT DISTANCE (or Levenshtein distance) 23.11.2017 Have you ever thought how does spellchecker, search engine, translation, speech recognition etc. software find replacement options for the word you entered?
Big Data: HADOOP ecosystem. There is no one ‘hadoop.zip’ to install 07.11.2017 Once there was a saying – we say ‘the Party’ and mean Lenin. Today we say Big Data and mean HADOOP which is presumed to be used for processing of 50% of enterprise data all around the world – hundreds to thousands of nodes and petabytes of data.
Big Data: basics of Time series databases 22.10.2017 Paradigm switch to data accumulating instead of historically used updating concept combined with the desire and support to use the data changes over time is the answer why time series databases have experienced a boost nowadays.
Big Data: basics of wide column store (column family) databases 20.10.2017 Column family databases are designed for very, very large volumes of data where speed is crucial (millions of processed records per second and volume is terabytes – petabytes etc.
Big Data: basics of document oriented databases 15.10.2017 Some use cases of document oriented databases and a bit insight into MongoDB very basics
Big Data: some of universal file formats 10.10.2017 All the data – this blog, Facebook messages, comments, Linkedin articles, anything – has to be stored somewhere somehow. How? XML, JSON, CSV, fixed length file – some of The Formats you should know basics when woken up 3AM.
Big Data: enchanted with the idea of graph database power 27.09.2017 Social networking definitely was the rise of NoSQL graph databases glory. It does not need its schema re-defined before adding new data – neither relations, nor data itself. You can extend the network any direction – billions of cats, ups, data items, billions of their relations.
Big Data: learning key-value store basics 26.09.2017 Key-value store concept is based on storing data as a pair – a key and the value. This is called content-agnostic database –  store any value you want – from JSON to XML, from HTML to images, from log files to chat history, from books to videos.
Big Data: domesticating the MapReduce wildcat 24.09.2017 MapReduce – simple, yet extremely powerful technique for distributed Big Data data processing. However I had so many what-if questions when started learning it. Here are some examples in blog (and a few new cats introduced).
Big Data: CAT, ups, CAP theorem. And ACID and BASE transactions basics, also full of cats 19.09.2017 Big Data’s little brothers and sisters – CAP theorem and ACID and BASE transactions basics explained using cats.
Big Data: with respect to NoSQL Zoo 15.09.2017 Besides RDBMS existance and advantages Google built Bigtable, Amazon developed Amazon DynamoDB, NSA built Accumulo etc. It wouldn’t have happened if ultra popular and well-established relational databases had all the capabilities these brands were looking for, would it?
Let’s have an insight into database management systems popularity and trends
Big Data: the curtain rises 13.09.2017 Once upon a time data flow was kind of predictable and controllable. General truth was you define structure and load data there and decline if data doesn’t fit.
Big Data burst wiped away this belief.
However, while the concepts of architecture and techniques are always evolving, the basic needs remains the same.
Big Data has sent you a friend request. Accept or Ignore? 12.09.2017 I will drop in here to reflect on my journey and lessons learned while studying Big Data analytics module in LU Computer Science Master program
Advertisements
%d bloggers like this: