Automatic Time Series archiving

Learn how to manage the life cycle of your Time Series data by setting up automatic archiving of old data to an external system such as S3

Managing the life cycle of your Time Series data is important. It helps to ensure your system does not require permanent capacity upgrades as you keep adding data. Low-end time series databases provide simplistic ways of managing that life cycle through periodic downsampling and purges via what is usually called retention policies.

This basic approach works for low value monitoring data. But it is unsuited to operational data. With data from industrial IoT, the original data should be retained for years for legal purposes.

This blog post will present a way to perform automatic archiving of Time Series data stored in Warp 10. This will allow you to limit the amount of disk space used and the overall cost of data storage, while still enabling access to the original data.

Overview

The principle behind the automatic archiving of Time Series data in Warp 10 is rather simple. A periodic WarpScript job will run to identify Geo Time Series which need to be archived. The data for those GTS will be fetched for the period to archive. It will be compressed and stored in an external service such as Amazon S3. Then the data will be deleted from the Warp 10 instance.

Keeping track of the process

Warp 10 uses a data model rather common in the world of time series database. Each series has metadata. They contain a name (called a class in Warp 10), and a set of key value pairs called labels (some other TSDB called those tags). Collectively, the class and labels uniquely identify the Geo Time Series. This means that modifying or adding a label changes the series you refer to. For this reason, labels (or tags in other TSDB) cannot be used for attaching mutable metadata to series. This limitation has been addressed in Warp 10 with attributes. Attributes are another set of key value pairs which are attached to a Geo Time Series. But contrary to the labels, attributes are not part of the series identifier and can therefore be modified at will.

We therefore use attributes for tracking the state of each series with respect to the archiving and purging process. The archive and purge stages end with the modification of dedicated attributes .lastarchive and, .lastpurge which contain the name of the last period for which archiving or purging was performed.

Compressing Time Series Data

Ever since Facebook published what is commonly known as the Gorilla paper, the TSDB world has been assuming that this type of compression was the panacea for time series data. Unfortunately, results vary greatly if your time series do not have regular timestamps (evenly spaced by exactly 1 minute, for example) or varying values.

WarpScript provides functions such as WRAPRAW and WRAPRAWOPT which compress Geo Time Series even in the event of irregular timestamps or values with great variability. The achieved compression is a little less efficient than the delta of delta and XOR schemes used in the Gorilla paper, when in the ideal case for Gorilla (regular timestamps and values with minor changes). But it is more efficient in all other cases. So the trade-off is usually worth it, given the ideal case is rarely met.

The automatic archiving mechanism can therefore rely on those WRAP functions to create blobs (byte arrays) which can for example be stored in an external object store.

Storing Time Series data in S3

Through the use of the S3 extension, Warp 10 has the ability to interact with an S3 compatible object store directly from WarpScript.

The S3 extension provides functions to store and read data from S3 buckets. This means that the time series data can be archived in S3. But they can also later be retrieved from S3 and rematerialized in WarpScript just as if they were stored in Warp 10.

Given the price of Amazon S3 storage and the compression ratios attained by the WRAP functions, this makes offloading time series data from Warp 10 to S3 a very cost-effective operation while retaining full analytics capabilities.

Once part of a time series is offloaded to S3, the read process in WarpScript should mix the results of a FETCH function with the retrieval of archived data via calls to S3LOAD, UNWRAP and MERGE. The resulting Geo Time Series can then be manipulated just as if retrieved by a single FETCH call.

The bucket name under which to store the data is your choice in the code doing the actual archiving.

Putting it all together

Thanks to the WarpFleet Resolver and the SenX WarpFleet macro repository, the glue to perform the automatic archiving of your data is available.

The senx/arch/archive macro available if you have the SenX WarpFleet macro repository configured (https://warpfleet.senx.io/macros). It expects the following parameters as input:

@param RTOKEN The read token to use for fetching the data to archive
@param WTOKEN The write token to use for updating the attributes
@param CLASS_SELECTOR The class selector (a STRING) to use for identifying Geo Time Series™ to archive
@param LABELS_SELECTOR A map of label selectors to further identify GTS to archive
@param ARCHIVE_MACRO Macro or text representation of WarpScript™ code to execute for archiving data
@param PURGE_MACRO Macro or text representation of WarpScript code to execute for purging data
@param MIN_DELAY Delay (in platform time units) that must have elapsed before a time period is archived

The ARCHIVE_MACRO and PURGE_MACRO are called with the following parameters:

@param GTS The Geo Time Series to archive or purge
@param PERIOD The specification of the period to archive or purge (either YYYY, YYYYMM or YYYYMMDD)

Those macros should produce an error if the purge or archiving process failed. If they do not end in error, the attributes below will be updated to reflect the last period which was archived/purged.

The senx/arch/period macro takes as input a period and outputs the start and end timestamps of the period and the name of the next and previous periods.

The senx/arch/archive macro expects the Geo Time Series to have the following attributes:

.lastarchive Name of period which was last archived
.lastpurge Name of period which was last purged

The archiving will only take place if the values of both of these attributes is identical. Meaning that the last archived period was also purged.

The purge will only take place if the value of .lastpurge is less than that of .lastarchive.

With these checks, a period that was just archived must be purged before the next period can be archived.

The values of those attributes should be initialized to a value before the first period to archive/purge.

The automatic execution of the process is then simply triggered by a runner script calling the senx/arch/archive macro:

'READ_TOKEN'
'WRITE_TOKEN'
'~.*' // Consider any class
{ 'archid' '~.*' } // Label selector - we only consider GTS with an 'archid' attribute or label
"'S3_ENDPOINT' 'S3_ACCESS_KEY' 'S3_SECRET_KEY' @senx/arch/s3'" // S3 archiving macro
'@senx/arch/purge' // Purge macro
1 d // Perform archive no earlier than one day after the period to archive has ended

// Call the archive macro
@senx/arch/archive

This will store each GTS blob under archid-period in S3.

Note that you could adapt this to archive your data in a completely different manner. You could for example downsample the data instead of storing the raw data in S3 if that is compatible with your use case.

Takeaways

You were presented with a compression mechanism for Time Series. This mechanism has overall better performance than the Gorilla type compression commonly used in TSDB.

You have also learned how to schedule periodic archiving of your Time Series data to S3 using only Warp 10 and some WarpScript code and WarpFleet macros.

The senx/arch/purge macro used in the example performs a call to DELETE, if you want to reclaim space immediately upon deletion when using the standalone version, please contact sales@senx.io to license the LevelDB extension which permits just that.