Careless data modeling can lead to cardinality issues, learn tips to avoid being trapped in a high cardinality universe.
No doubt Time Series databases (TSDBs) are a hot topic these days. We, at SenX have identified over 70 solutions, both commercial and Open Source, in this field!
TSDBs are hot, so you are probably planning on using one for some of your projects. So this word of advice might be just in time for you to avoid being jinxed by the infamous curse of cardinality.
What in the world is this reference to magic you may ask…. Well, we will take you on a journey to understand what this is all about. And we will give you a few hints to survive.
History of TSDBs data model
Not so long ago, at the beginning of the 21st century, a startup in the name of Google was facing tremendous growth in its infrastructure. Its engineers felt they needed a way to track what was going on with those thousands of servers they were operating. For this purpose they designed a solution called borgmon which provided monitoring for borg managed services.
Borgmon borrowed ideas from existing systems. But somehow made it easier and more flexible to collect and analyze metrics over time, i.e. time series. The flexibility of borgmon was possible thanks to the introduction of a very simple data model. Based on the use of labels, it brings context to the thing you were measuring and tracking over time.
The data model introduced by borgmon makes use of the following syntax:
metrics{label0=value0,label1=value1,...,labelN=valueN}
Here metrics
is a name for the quantity you are measuring, for example cpu.load
or disk.free
. The comma separated list of key value pairs are labels which bring context information to the metrics
. What you put into this context is really dependent on the vertical you are working in. Are you collecting metrics to monitor services executing on machines? You may have labels which bring information such as IP address, rack, datacenter, name of service, etc. If you are collecting data for industrial systems, you may have a machine name or a factory id in those labels.
As you can see, this data model is simple to grasp
It is not surprising it was readily adopted as the model of choice for many solutions created by former Googlers once they left Google, as they were missing a borgmon like solution for their monitoring use cases. The first to adopt it was Benoît Sigoure, creator of OpenTSDB, shortly before I created Artimon (which inspired Warp 10). And a few years before Prometheus, even though the latter added quotes around label values thus cluttering the syntax with useless characters and worsening global warming… But anyway, this data model is now the de facto standard for time series databases. There is of course some exceptions, such as TimescaleDB which has a more relational model due to their foundations.
Understanding the cardinality issue
The flexibility brought by the labels has unfortunately a downside. To understand it, we must detail a bit what happens behind the scenes in a TSDB.
The naming scheme of your time series, namely the metrics or class and the set of labels, is there to allow you to easily access your series by issuing queries which bear criteria on those elements. In order to be able to identify the series matching your query criteria, a Time Series Database must maintain an index of sort with all those class names and label names and values. And this is where things get tricky…
At the end of the day, things you store and index will occupy some space. Unfortunately there is nothing you can do about it. And this inevitable consumption of resources is what brings the cardinality issue. Cardinality because it is related to the number of distinct series you have defined. And issue, well because it is a major problem you might be facing!
This issue affects various Time Series Databases in different ways. Some like Warp 10 can scale to tremendous amounts of series. Others like InfluxDB struggle with a few millions (they even withdrew a graph showing memory needs, probably because it was too scary). And despite announcing they found a potion to support one billion series they are still admitting two years later they still have an upper bound of 30 million series, 30x less than a billion.
How to get jinxed
Once you understand the benefit of labels, you might be tempted to use and abuse them. That's exactly when things go bad. Even though labels can technically bear many values, you should always have in mind that each label combination counts as one series and that the more series you have the closer you get from a cardinality issue.
Having a high number of distinct values for labels is not really the problem. What really is problematic is how you combine labels and how predictable your valid combinations are. A few examples might help you understand better.
Example #1
As a first example, assume you are monitoring 1000 servers, spread across 5 datacenters and 30 racks, and you are measuring 5 parameters per server. The total number of series is 5000, as in the end you are measuring only 5 parameters per server. If for each server you store its IP address, the rack it is in and the datacenter where the rack is located in labels, you only have 1000 unique combinations of labels. Your servers do not jump from rack to rack and your racks from datacenter to datacenter. So even though the possible label combinations are 150,000 (1000 x 5 x 30), you only have 1000 which will ever be used, one per server. So in this case you do not have any cardinality issue.
Example #2
Our second example will build on our first one. We now want to track network traffic among those 1000 servers. We want to count packets which flow between those servers. And to track the source and destination addresses and ports of each packet. Since we now understand how to use labels, we design a model with a single class, namely packets
which will track the count of packets in a given time interval, and a set of labels containing the source IP (srcip
), the destination IP (dstip
), and the source (srcport
) and destination (dstport
) ports.
Contrary to the previous example, the number of label combinations which could occur is harder to estimate. It really depends on the network traffic patterns in your infrastructure. And to worsen the case, the ways the TCP/IP stack works will assign random source ports to connections. So we can estimate the upper bound of our label combinations, and this is where things become scary. We have 1000 servers which can talk to 1000 servers each, on a set of 65535 ports from a set of 65535 ports (roughly), so the possible number of combinations becomes 1000 * 1000 * 65535 * 65535
which equates to over 4000 trillions….
Now you have a cardinality problem, and a huge one! Not only will you not be able to store your data, even using Warp 10, but assuming you could store them, it would take forever to query them. Your data would be spread in many files, forcing the storage backend to open a lot of them and perform many seeks before it could retrieve the data for a given server for example.
Breaking the curse
Now you understand what the curse of cardinality is. It is time to learn how to break it, so you can have a peaceful use of your Time Series Database.
The first rule to respect is to always have an idea of how many of the possible label combinations will indeed be used. If you cannot answer this question, do some research or chose different labels, so you can!
The second rule is that if the first rule gave you an estimate that you deem too high, whether for technical or financial reasons, you should model your data differently. Depending on what Time Series Database you use this might not be an easy task. Some solutions might lack the tools you need to correctly manipulate the model you need to adopt!
One of the ways you can avoid being jinxed is by identifying the labels which have exploding combinations and turn them into values of new series instead of labels. This would allow you to limit the number of series you create while still having all the information handy. In the case of the network traffic example, you could store the packet counts in two series, dstpackets
and srcpackets
, one for each direction, with a single label ip
, and the source and destination ports associated with those counts in two other series srcport
and dstport
with the same ip
label. The same information would be tracked. The only constraint would be that you record the values with identical timestamps across series to enable you to match them at query time. This would lead to 4000 series being created.
Learn more about when do you need a Time Series Database |
In a solution like Warp 10 which has Multivariate values support, the solution is even simpler. You could store the packet count jointly with the source and destination ports. When retrieving those data, you can ventilate them in separate series using MVINDEXSPLIT
and even regenerate series in memory with the original model using PIVOT
, without risking hitting the cardinality issue since this is not done in the storage engine but only during the analytics phase.
Key takeaways
Hopefully by now you have a good understanding of what the curse of cardinality is and what you should do to avoid it.
Remember the two simple rules we gave you. Explore the solutions your Time Series Database of choice has to offer to tackle that cardinality issue. And if you have not yet done so, consider giving Warp 10 a try as it offers the industry most advanced Time Series platform with proven production support for more series than any other solution.
Co-Founder & Chief Technology Officer