Data replication with Warp 10

Use data replication to secure your data. Configure Warp 10 to enable High Availability!

Data replication with Warp 10

The easiest way to start a Warp 10 platform is by using the standalone version or the docker image. Follow the Getting Started if you are not familiar with Warp 10 installation.

Great, you have one single instance running! But what if you want data replication to secure your data?

You can of course, deploy the distributed version of Warp 10, but that imply deploying also Kafka, HBase and ZooKeeper. These tools are great but needs more skills and knowledge to be operated.

A simpler option is to start 2 or more Warp 10 standalone version and configure them to replicate data together. Here is a review step by step of how to set up this feature.

What is datalog?

The datalog system comes with 2 different mechanisms:

  • A logger
  • A forwarder

The logger logs each data operation such as UPDATE, DELETE or META and write them into a file. The forwarder transfers data to one or several Warp 10 instances according to the configuration settings.

All the dedicated configuration keys are set in the ${standalone_home}/etc/conf.d/20-datalog.conf file.

Enable the logger mechanism

First, you have to enable the logger mechanism to track each modification of your data. Once enabled, the logger will generate one file per request to the update/delete/meta endpoint. The same thing, if you do calls to UPDATE, DELETE or META functions in WarpScript or FLoWS.

Now, we are going to set the directory where the modification will be written and a unique identifier for the first datalog instance:

//
// Datalogging directory. If set, every data modification action (UPDATE, META, DELETE) will produce
// a file in this directory with the timestamp, the token used and the action type. These files can then
// be used to update another instance of Warp 10
//
datalog.dir = ${standalone.home}/datalog

//
// Unique id for this datalog instance.
//
datalog.id = datalog-0

After a restart, Warp 10 begin to log each data modification.

Logger behavior

  1. The logger writes one file per request in the datalog.dir.
  2. It creates a hard link to each forwarder source directory.
  3. Last, the file is deleted from the datalog.dir.

As we did not set any forwarder for the moment, you will not see any remaining, no hard link has been created.

So this unique step is useless without a forwarder configured.

Enable forwarding mechanism

If you take a look at the whole 20-datalog.conf, you will see a lot of keys starting with 'datalog.forwarder.'.

Each of these keys when enabled needs to have a suffix which correspond to the forwarder identifier you want to configure. If you have many forwarders, you have to repeat the key with ad hoc suffix.

Bellow, one instance use 2 forwarders (this is a partial configuration)

...
//
// Comma separated list of datalog forwarders. Configuration of each forwarder is done via datalog configuration
// keys suffixed with '.name' (eg .xxx, .yyy), except for datalog.psk which is common to all forwarders.
//
datalog.forwarders = datalog-1, datalog-2

//
// Directory where datalog files to forward reside. If this property and 'datalog.forwarder.dstdir' are set, then
// the DatalogForwarder daemon will run.
// When running multiple datalog forwarders, all their srcdir MUST be on the same device as the 'datalog.dir' directory
// as hard links are used to make the data available to the forwarders
//
datalog.forwarder.srcdir.datalog-1 = ${standalone.home}/datalog/datalog-1
datalog.forwarder.srcdir.datalog-2 = ${standalone.home}/datalog/datalog-2
...

You can configure one instance to forward to many forwarders. This allows you having a flexible model of forwarding data:

  • 1 to 1
  • 1 to many
  • Star model
  • Ring model

For each forwarder, you need to set at least the sources/destination directories and all the endpoints.

Minimal configuration for one forwarder:

...
datalog.forwarders = datalog-1
datalog.forwarder.srcdir.datalog-1 = ${standalone.home}/datalog/datalog-1
datalog.forwarder.dstdir.datalog-1 = ${standalone.home}/datalog_done/datalog-1
datalog.forwarder.endpoint.update.datalog-1 = http://host:port/api/v0/update
datalog.forwarder.endpoint.meta.datalog-1= http://host:port/api/v0/meta
datalog.forwarder.endpoint.delete.datalog-1 = http://host:port/api/v0/delete
...

You have to create the source and destination directory and then after restart, your first forwarder is set… If you make some insertion and/or deletion, you will see in the source directory some files appearing.

For the moment, we have not configured another instance, so the files remain in the source directory, and you see the directory growing as the request arrived. And as the datalog-1 is not ready yet, you will see some error messages because the datalog-1 is unreachable.

All the directory (datalog.dir, srcdir, dstdir) for one instance must be in the same device. The datalog mechanism strongly use hard link to avoid files duplication.

Configuring another instance

First, you have to set up a new instance of Warp 10. This instance must share the same secrets as the previous one to be able to receive data from it (and the forwarder must target its endpoints).

Copy the file ${standalone.home}/etc/conf.d/00-secrets.conf from the first instance to the second one.

Restart the second instance, and that's it! Now, you have one instance replicating data to the other one. This is a unidirectional data replication, but it's a start.

On the first instance, files in the forwarder source directory will be moves to the destination directory when requests have correctly been sent.

Bidirectional replication

To enable the replication in the other way, copy the ${standalone.home}/etc/conf.d/20-datalog.conf from the first instance to the second one. Replace datalog-0 with datalog-1 and datalog-1 with datalog-0 Create the ad hoc source and destination directory. Adapt the endpoint host and port to the one where your datalog-0 is reachable.

Restart the second instance, and congratulations, you have now a bidirectional data replication.

Fine-tuning

Forward logging requests

By default, one instance does not forward what it has received from another instance. But for some complex replication architecture, you may want to enable this.

//
// Set this property to 'false' to skip logging forwarded requests or to 'true' if you want to log them to
// forward them to an additional hop.
//
datalog.logforwarded = true

This is a global parameter, so if you activate it, you will forward all received forward requests… And you may get a loop!

Avoid loops

If you enable the parameter above, you have to deal with loops. So, set for each forwarder which identifier to ignore in order to break loop.

//
// Comma separated list of datalog ids which should not be forwarded. This is used to avoid loops.
//
datalog.forwarder.ignored.datalog-1 = datalog-0,datalog-2

Delete forwarded and/or ignored file

It is recommended not to enable these keys at first, this will help to understand what is happening with the log file. Of course, directories size will grow, but you may change these keys when pushing to production stage. Again, these keys had to be suffixed by the identifier of the forwarder concerned.

//
// Set to 'true' to delete the datalog files which were forwarded. If set to false or absent, such files will be moved
// to 'datalog.forwarder.dstdir'.
//
datalog.forwarder.deleteforwarded.datalog-1 = true

//
// Set to 'true' to delete the datalog files which were ignored. If set to false or absent, such files will be moved
// to 'datalog.forwarder.dstdir'.
//
datalog.forwarder.deleteignored.datalog-2 = true

Restrictions and limitations

If you are using a version of Warp 10 older than the 2.9.0, we strongly encourage you to update to the latest version because of a memory leak which appears when doing a lot of deletes.

Performance restrictions

The data are forwarded in a multithreading mode. But to preserve the order, modifications concerning the same tuple (application, producer, owner) are forwarded using the same thread. For example, if you update some data, and after delete these data, the delete operation cannot be forwarded first. Thus, this can lead to a bottleneck if you have heavy usage with the same tuple (application, producer, owner) for your use-cases.

We are aware of this issue, and the datalog mechanism has been completely rewritten to avoid this bottleneck, add more flexibility and improve the global behavior. This improvement is in a beta stage.

Takeaways

Now, you are able to configure several Warp 10 instances with High availability using the Datalog mechanism. We will see in a next blog post how to shard data across servers.

The new version of the datalog will be release with the 3.0 version of Warp 10.

Let us know if you want to have a beta preview of the new Datalog system.