Big Data: basics of Time series databases


Topic, painfully familiar for me as a data warehouse servant: data changing over time. Ticket valid from – to, Policy effective from-to, having sub positions like travel insurance from – to.

Some time ago this was more related to “serious” systems like banking and amongst other assumptions also storage space was weighted. Nowadays Internet of things, self-driving cars etc coming in, the challenges of fast and easy operating with time periods have come into daily life. Eg, what was the behavior of room temperature when heater temperature was increasing and fridge temperature decreasing?

Historically updating was much more used. Like when you logged in a system, the attribute ‘last_successful_login’ was updated.

Nowadays in line with unlimited storage and high performance databases each login is treated as an event and logged as a new event with a timestamp, so system owner can do historical tracking which daytime are you active most often, is your login activity count increasing or decreasing.

Time Series database growing popularity 

Paradigm switch to data accumulating combined with desire to use the data and support of this desire is the answer why time series approach and databases have experienced kind of boost within latest decade for many areas like

  • monitoring software systems: virtual machines growing popularity, services, applications
  • monitoring physical systems: equipment, connected devices, homes, bodies
  • asset tracking applications: vehicles, trucks, packages delivered
  • financial systems: cryptocurrencies, stock prices
  • tracking applications: customer interaction data
  • Business Intelligence tools: key metrics as well as overall state of the business

Time_db_ranking

(https://db-engines.com/en/ranking/time+series+dbms)

A time series database (TSDB) is optimized for handling time series data: each entry is associated with a timestamp, thus it contains data for each point of time. Trivialized example to show the idea:

21-OCT-2017: ticket 12345 valid
22-OCT-2017: ticket 12345 valid
23-OCT-2017: ticket 12345 valid
24-OCT-2017: ticket 12345 valid

instead of classics

ticket 12345 valid_from 21-OCT-2017 valid_to 24-OCT-2017

Better example would be a pulse measurement once per a second.

  1. Each measurement is inserted as a new data entry (well, there might be use case to update or overwrite previous record but, let’s be real, who does that nowadays?)
  2. The data arrives in time order and are stored in time order
  3. Time-intervals can be either regular (as I used once per 1 second) or irregular
  4. Data amount is growing very, very fast (and nobody has neither motivation, nor courage to clear history)

So we can define time-series data as data set which represents how something – a measurement, process etc changes over time. Price changing over time might be a price curve. Energy consumption within a period might be a load profile. Logging temperature values might be a temperature trace.

Time series data querying

TSDB concept queries are specialized to time querying like

  • language near to natural, like average time per minute could be ‘group by time (1 minute)’
  • flexible built-in time aggregations – per second, minute, hour…
  • easy comparing to previous record (in RDBMS we use different workarounds like LAG, OFFSET or querying previous ID by complex calculations)
  • joining by time series automatically like SELECT orders_per_hour.count / errors_per_hour.count from orders_per_hour INNER JOIN errors_per_hour.

RDBMS

In Relational dabatases developers use to set a lot of data integrity controls

  • from error messages in input forms to checking by trigger before save in database,
  • from snapshotting at time moments like end of the month till regular controls if this type of data already exist in data warehouse.

If controls find, eg, two valid documents for a day or a day without any valid document, they automatically create incident and assign to data stewards for investigation as well as may perform any other business rules like marking both documents to ‘invalid’ state or moving them to quarantine zone.

I know a RDBMS DWH where every night batch processes scan all key data and set indicators like ‘data valid today Y/N’ or ‘the most up to date version indicator Y/N’. then instead of time intervals SQL queries use them like

select salary 
from annexes 
where employee_id = 1717 
and top_version_indicator=’Y’

In TSBD instead of that you would query something like

select employee_name / annex_salary 
from employees 
INNER JOIN annexes

High risk of wrong query

Amongst other limitations of RDBMS traditional approach start_date + end_date requires high level of accuracy as it is so easy to build a query which selects wrong data set. Just forget a condition or use > instead of >=…

select employee.name, annex.salary
from employees, contracts, annexes
where employees.employee_id = contracts.employee_id
and contracts.contract_id = annex.contracts_id
and contracts.start_date = (select max(start_date) from contracts2 
                            where contracts2.employee_id = contracts.employee_id)
and contracts.end_date >= sysdate
and annex.start_date >= contracts.start_date
and (annex.end_date is null or  …

I have seen so many wrong this type of queries for many reasons

  • one date truncated to day and the other to minute, thus equality never exists if developer forgets TRUNC
  • date stored in different formats and developer forgets using TO_DATE
  • different convention is the end date inclusive or not – new developers could not even imagine in their worst nightmares that this system stores end date as non-inclusive and they must use < end_date instead of <= end_date
  • when start date equals end date – developers use to use and end_date > start_date
  • some leave NULL if end date unknown, some set to 01-JAN-3000, some to 31-DEC-2500 etc. Some have a total chaos. It is easy to overlook these plugs in queries and frontend and print a ticket valid till 01-JAN-3000
  • different time zones (a general suggestion is to store the data in a universal time and then translate it to the time of the user. There are opposite approaches also, storing the time of submission and the time zone it was submitted in)
  • data sometimes written in database in the moment they appear and sometimes later. Some developers may unreasonable rely that data always are written in the moment they appear.

A good and documented architecture of database and accurate analyst – developer are high valued. You might imagine, why. Who wants business key data presented as a result of wrong query?

Disclaimer

This blog is solely my personal reflections.
Any link I share and any piece I write is my interpretation and may be my added value by googling to understand the topic better.
This is neither a formal review nor requested feedback and not a complete study material.

Mans viedoklis:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Mainīt )

Google photo

You are commenting using your Google account. Log Out /  Mainīt )

Twitter picture

You are commenting using your Twitter account. Log Out /  Mainīt )

Facebook photo

You are commenting using your Facebook account. Log Out /  Mainīt )

Connecting to %s

%d bloggers like this: