DB-Engines repeats it every month, Time Series databases is the fastest growing category in the database market. But what exactly is a Time Series Database?
A number of Open Source projects, but also companies have come to light in the recent years with technologies collectively called Time Series Databases. Is this simply a trend, a new name for existing databases, or is there really something special about those data?
Databases fall into broad categories: relational, document, column stores, graph, and now Time Series. With some effort, any database engine can be used for solving any problem, the challenge being how you model your data and how much pain you are ready to endure when it comes to performance. Because regardless of how generic a database engine pretends to be, there are always trade-offs which were made and some data access patterns, whether read or write, are a better fit than others for a given engine.
This is what gave birth to document stores, the need to access complex structures made of different fields which would have required multiple JOIN operations in a relational database. The document oriented databases basically removed the needs for those costly JOINs, so much that document stores are the most frequent type of NoSQL databases.
The graph databases were also created because of specific access patterns, introducing the notion of relationship and the possibility to query on those. There again, using other types of databases for this purpose could do the trick, but you would have to tweak the model and the query language quite a bit.
Time Series databases follow the same path, those databases were created to address the challenges created by sensors pushing data around the clock, possibly at very high frequency and from an ever growing number of devices. Existing database technology were used at first, and they worked well up to a certain level. Level which could seem very high when it comes to data generated by humans, but which is very rapidly reached when dealing with machine data. Very soon the sole process of creating a time based index, mandatory when dealing with time series, would overwhelm the servers and bring them to a performance level too low for any purpose.
These are the findings that led multiple teams to develop solutions taylored for time series data, the famous time series databases, which do not suffer from neither the flow, the number of series nor the size of the historical datasets they can deal with. The purpose of time series databases is to deal magnificently with data indexed by time that will rarely (if ever) be updated. As time series databases matured, their query capabilities evolved from simple query languages such as SQL or SQL like to more complete data flow languages such as the recent Flux or the more advanced WarpScript™. The purpose of those languages is to enable you to perform complex analytics as close as possible to where the data is, that is why they are embedded into the time series database engines.
In the same way graph databases were created to solve graph related problems which were hard to deal with using traditional databases, time series databases were created to address the unique challenges of time series data. But if you look closely, apart from some solutions which were built from scratch, time series databases were built on top of existing technologies, whether relational or column oriented.
So when you conclude that you will store time series data in Cassandra, PostgreSQL, MongoDB, HBase or Accumulo, think about this decision for a second, and try to understand why time series databases were built, because if you don't, you will quickly find yourself rewriting yet another time series database of your own, because out of the box, the technology you have chosen cannot efficiently handle time series data. To get an idea of what you will have to do, view this video by Cisco about how they had to tweak MongoDB for time series data.
You can have a bias or preference towards one of those database technologies I just mentioned, maybe because your ops team is already familiar with one, but you need to do a little more homework and identify which time series database was built on the technology you chose rather than think you can do without a TSDB.
So if your heart beats for Cassandra, have a look at KairosDB. If HBase is your backend of choice, look at OpenTSDB or Warp 10, if it is Accumulo that has your favors, try Timely. If PostgreSQL is your ideal candidate, have a look at TimeScaleDB, if Riak makes you tick, look at DalmatinerDB. Or look at a time series database built from scratch, such as InfluxDB or kdb+. If you want a hosted solution, there are also plenty of options, from Microsoft Time Series Insights to the recent Amazon Timestream.
Remember that if you do not seriously consider selecting a Time Series Database to add to your technology stack, you will ultimately end up writing your own, and this is probably neither what you intended to do in the first place nor what you really want to do!
Sharing engineering data across a big company is a big challenge. Pushing your datas in a Warp 10 timeseries database is an elegant solution!
Careless data modeling can lead to cardinality issues, learn tips to avoid being trapped in a high cardinality universe.