Big Data: basics of document oriented databases

One of the most important concept of NoSQL as the term ‘document’ can be treated in a very broad way: as well as my passport as this blog entry.

Use cases of document oriented databases:

  • Storing log files
  • Storing high volume of different data coming in – eg, one machine is measuring rotations per minute and produces CSV, and the other one measures temperature and produces XML, and yet the one keeps track of rejected operations and produces fixed length files
  • Efficient storing. For example, you have 30 attributes in your user profile like ‘date of birth’, ‘preferred music’, ‘favorite sport’ etc. One of users might have provided only date, the other only sport etc. You do not have to store the whole structure for each user telling ‘date of birth’ and ‘preferred music’ not provided (like you would store NULL or ‘n/a’ in RDBMS table structure). You just store the data you have: user1 ‘favorite sport’.
  • Storing blogs and articles with comments and likes
  • Storing business data where searching by metadata and content is crucial
  • Storing documents which structure can be changed any time – eg, you can easily add a new feature like emotions in your blog platform because you do not have to redefine the whole schema
  • Store new types of documents. You were storing only cats but today you can start storing and querying also dogs, trees, galaxies and stationery
  • Write operation works very fast as DOD have no RDBMS transaction and locking mechanisms overhead
  • Unlimited increase of database size since documents are key value pairs with document id being the key and the document being the value.

The idea of DOD is to provide a scalable framework for storing, inserting, querying and retrieving and updating (at single document level) unlimited amount of self-describing structured or semi-structured data, called “document”. Part of data is the document content itself and part is data about data.

Teacher used MongoDB as example and you see why ( Yellow you see Latvian team’ produced Clusterpoint:


Usually documents are stored in JSON or XML or binary like PDF and MS Word, Excel (eg, MongoDB and CouchDB uses JSON). Each document has unique ID and can have its own structure. There is no predefined schema. However we should understand that document databases are not designed to fit anything. Eg, these databases are not the best for deep nested data because cannot search them effectively.

As I am SQL person, here comes my survival kit (


Relations among data items can be represented in two ways: referencing and embedding. Which one better? Hah, this is The Question any developer and any architect has a lot of pain and lots of guidelines are written and tons of posts. More art than a science. Each has pros and cons. Good news is that you can always change your mind. This is one of reasons why I love programming: I can play a lot. If I were surgeon it would be much harder to recompile.

Referencing to food stored in another document:

 _id: cat_1,
 name: “Picadilla”,
 colour: “brown”,
 food: “catfood_123”,
 amount: 2,
 dateofbirth: “10-OCT-2010”
 _id: catfood_123,
 name: “Tasty Chicken Liver”,
 producer: “Catty Food Inc.”,
 address: “Wildcat Boulevard 17”

Embedding food data in single document:

Notice that food has no ID itself – the _id field is a required field of the parent document, and is typically not necessary for embedded documents. You can add an _id field if you want.

 _id: cat_1,
 name: “Picadilla”,
 colour: “brown”,
    name: “Tasty Chicken Liver”,
    producer: “Catty Food Inc.”,
    address: “Wildcat Boulevard 17”
 amount: 2,
 dateofbirth: “10-OCT-2010”

Some ideas collected about referencing vs embedding:

  • the more that you keep in a single document the better – easy to query
  • any data that is not useful apart from its parent document definitely should be part of the same document
  • separate data into its own collection that are meant to be referred to from multiple places
  • embed is good if you have one-to-one or one-to-many relationships, and reference is good if you have many-to-many relationships
  • embedded documents are easy to retrieve (as everything stored there). When querying by parts of them, limitations exist, like sorting limited for insertion order
  • to my surprise I am reading that no big differences for inserts and updates speed


When you design your schema consider how you will keep your data consistent. Changes to a single document are atomic (complete operation guaranteed), but, when updating multiple documents, it is likely that in a moment of time address of the same Cat food producer may differ amongst cats (NB: but there are a few databases like Clusterpoint which can handle multiple document updates in a transaction). In general, there is no way to lock a record on the server. The only way is you can build into the client’s logic to lock a field.

Remember, NoSQL systems are by desing support BASE, not ACID transactions. It is normal and real that at a moment of time you may see different address of food providers – or different comment content, or different views count.

My favorite example is Candy Crush daily free booster. If I spin it on a tablet and later try on a phone, I get ‘come tomorrow’. But if I spin on all the devices at once then I get a booster in each of devices #lifehack. In RDBMS transaction control mechanism would guarantee that once spinned you may not repeat it.

Very powerful querying

Documents can be queried by any their attributes like

 name = “Picadilla”
 colour: [“brown”, “amber”]

The querying language is powerful and not obvious, even if I am guru of SQL querying. It would take a while to learn it. Here is nice material of SQL mapped to MongoDB queries.

As usual, each query should be checked its speed before going live. Smilar to RDBMS, nice feature is call ‘explain’ for a query to see what is database performing when running the query – which index its using etc.


Inserting new records

Again MongoDB as example.  Insert one:

   { item: "catfood", qty: 10, tags: ["chicken”, “liver"], ingredients: { water: 50, meat: 30, fat: 30, unit:"pct" } }

Insert many:

item: "catfood", qty: 10, tags: ["chicken”, “liver"], ingredients: { water: 50, meat: 30, fat: 30, unit:"pct" } }
item: "dogfood", qty: 23, tags: ["beef"], ingredients: { water: 40, meat: 20, fat: 40, unit:"pct" } }
item: "kittenfood", qty: 3, tags: ["turkey", “fillet”], ingredients: { water: 55, meat: 30, fat: 25, unit:"pct" } }


Nice reason to learn upsert clause: if set to true, creates a new document in case if no document matches the query criteria. multi means if matching documents exist, the operation updates all matching documents.

db.cats.update( { "name": "Meowcha"},
   { "colour": ["brown", "white", "black"] },
   { upsert: true, multi: true } )


We set deletion criteria and remove matching documents. Eg, the following operation removes the first document from the collection cats where amount is greater than 2:

db.cats.remove( { amount: { $gt: 2 } }, true )

Well, I have a had a look on the very, very basics. Can I apply for a DOD or MongoDB expert role now? No :) But I can honestly say I now know much more I knew a month ago and I definitely would not be afraid if told in my project ‘we are going to start using document oriented database tomorrow’.

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Mans viedoklis:

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: