To address data processing needs of the world of sensors and Internet of things, SenX has developed Warp 10™ and WarpScript™, a suite of softwares based on data models known as “Time Series” and “Geo Time Series™”. In this article, we dive into the historical reasons why the industry is shifting toward time series models.
In order to cope with these demanding needs, organisations must transform the way they think their information systems. Only by doing so will they be able to compete with other major digital players in the world, which is the real challenge that is underlying. That is why the “all-connected” world where we live in is witnessing the advent of time series databases.
Diversity of data types and their processing
Within companies, data is generally managed in large “relational” databases, also known as structured or SQL databases. Derived from business applications, they are structured with strong functional dominance. Formats and semantics within these databases are specific to each sector according to its needs. Mostly, every company uses relational databases, which are managed by their IT department (or an external company).
Until the mid-2000s, the question of changing the data model did not arise, and any new project would naturally fit in an existing mold.
The growth of messaging and social media has changed the situation. Soon, the volumes of data (texts, sounds, images, videos) became considerable while avoiding the constraints of business applications. With this in mind, the major players on the Internet (mainly Google and Facebook) had to imagine other database structures: NoSQL databases.
The rise of sensors, and more generally of the Internet of Things, which generate volumes of data with a very variable frequency in each case, has created other constraints. Not only must we be able to process data that can be transmitted thousands of times per second, but we also need to cope with a large number of sensors which emit flows in parallel with different characteristics. In addition to a massive data processing capacity, it is necessary to analyze the considerable flows of raw information to be able to derive value.
This set of constraints has led to the emergence of a particular model of data: time series. They are already known, in a special format, to economic statisticians. It is no longer a question of organizing data according to functional criteria linked to business types, but by focusing on their temporal occurrence. The main input key of the datum is its timestamp with an accuracy and synchronization which should be as close to perfect as possible.
To these different types of data (structured, unstructured, temporal) we must add a fourth one. This one covers graphical representation in two or three dimensions (geographic information systems, 3D modeling, video games, virtual worlds, augmented reality …). Ultimately, four families of data are translated into different technologies and four visions that have little in common.
Faced with strong growth in data volumes, all players claim to be doing “Big Data”. However, we use this terminology, now popularized by the media, to refer to very different realities.
Moving away from IT silos
In typical information systems, applications and professional services produce structured and formatted data according to business specifics. Each sector of activity has its modes of representation and semantics.
Relational databases have adapted to data manipulation requirements. Their structure makes it possible to confront two essential constraints:
- Manage the links between different types of data to facilitate the development of business applications (purchases, sales, financial movements, operations management, etc.)
- Guarantee the integrity of the data via mechanisms that guarantee the return to the initial state in the event of a problem in the middle of an operation.
This organization of business lines has led to a “silo-ed” architecture with dependency links between business needs, applications that respond to them, and data that stem from them. The world’s most widely used softwares suite is marketed by the German SAP group, which has a large number of modules structured within an ERP.
Then, Business Intelligence (BI) has considered to suspend silo-based information systems, since they begin to extract value from data coming from one or more business applications in a transversal manner, at great expense. For example, costly (and sometimes sophisticated) mechanisms (ETL) for extracting data into data warehouses have been implemented to retrieve data that can be cross referenced (e.g. online consultations by customers with their purchases, after-sales interventions …). These tools have made the success and fortune of companies like Business Objects, SAP or SAS.
This two-tiered computing with a main (business) function and a secondary (BI) function is bound to disappear. The idea is that the data will have multiple lives and multiple uses and that information systems must be organized accordingly.
The organization of data in silos that is still predominant today is not suited to this evolution.
Issues associated with relational architectures
The use of data for multiple purposes – and therefore the end of silos – is subject to several barriers in classical architectures:
This is the reality of traditional information systems that are based on relational databases suited to so-called transactional applications (e.g. updating of a customer account, a stock, etc.). However, these transactional data will soon represent less than 5% of the total volume of data that companies process.
Time series: a disruptive model for the organization of data
Initially intended for sensor data, time series databases are a subset of non-relational databases. They are “column-oriented” whereas relational databases are “line-oriented”. Without going into technical details, we will remember that column-oriented databases offer advantages such as better adaptation to data compression, easy addition of new data sources, optimized use of storage disk access by decreasing the need for memory which cost can become prohibitive, parallelization of computing jobs on several servers … The most used column-oriented technology to date is HBase, which is a sub-project of a group popularized under the name of Hadoop. This set constitutes an ecosystem that turns information systems into a new world made up of new approaches and tools.
With time (or geo time) series, the idea is that many data, beyond sensor data, could be modelled with series. Indeed, we can approach any business activity as a succession of unitary events in time and space. Therefore, rather than focusing on the business sense of the data (with its object, its format, its semantics …) resulting from an activity, we should consider that the first structure of the data lies in its precised date of emission, and optionally its location.
In doing so, any process becomes a succession of events or micro-events that generate data which are as many time series or Geo Time Series™.
By relating all data back to time (and geo location when available), it becomes possible to cross many and heterogeneous data streams in a much easier way than in conventional information systems.
More generally, the five barriers mentioned above can be overcome much more easily:
This evolution is the one that represents the migration of a world of transactions towards a world of flows.
The architecture of two-tiered information systems is being erased in favor of a data hub (the datalake). It will aggregate all event measurements and will feed applications such as business applications, financial applications, marketing … or others.
A few application cases:
- Electricity management – in particular with the growth of intermittent renewable energies – is increasingly based on applications and services. They use measurements across the entire network from production to smart meters that return consumption metrics. Managing the chain from production to consumption – including transmission, distribution, load management, storage, electric vehicle … – needs real-time adjustments that requires cross-referencing data sources in significant volumes. Only time series based architectures are able to meet this challenge.
- Maintenance applied to industrial, telecom, IT, and building environments generally relies on subsystems, each monitoring a specific area (e.g. in a building: electricity, heating, air conditioning, security). These subsystems must be consolidated for an overview, which is often partial. Time series make it possible to cross the whole data for the purpose of, for example, detecting and understanding defects.
- In agriculture and agro-industry, the usage of smart objects is growing: sensors in agricultural crops (dosage of fertilizers, water …), farm buildings management, animals feeding management … All this information needs to be crossed with external information like the weather forecast that is typically a Geo Time Series™.
On the other side the industry, companies use more and more sensors to manage industrial machinery, logistics and retail processes.
Beyond specific applications, time series technologies give the ability to detect any weak signals in the whole chain regarding traceability issue.
- Connected vehicles and – more particularly, autonomous car of the near future – will be equipped with hundreds of sensors. The data produced, characterized in time and space, will require complex embedded processes. Both to meet challenges of autonomous control and security,
- Cyber-security is increasingly going to require the ability to analyze a wide variety of data. It may affect network operation, access to DNS servers linking the text address of a site to its IP address, illicit trafficking, hacking, location, load of systems… Too often these subjects are treated independently of each other. Mixing measurements in time series is the only way to detect weak signals, anomalies and other information. A way that is too rarely explored in this context except for some specific projects.
Neutral and secured data infrastructures for smart cities
Beyond its technical advantages, the organization of data as time series makes it possible to more clearly formalize the distinction between the data and the applications that use them.
This data / application distinction allows:
- A separation between managing a data infrastructure – a “datalake” qualified as neutral – and application-driven services,
- The ability to perform data analysis horizontally, including cross referencing, aggregation over time or space, weak signal detection, anomaly detection, error prediction… regardless of the specifics of business applications,
- Implementing access security mechanisms that more clearly manage the rights that applications can have to access or aggregate data in time or space.
This approach makes it possible to imagine cross-referencing data from different sources – e.g. two business divisions within a group – while protecting the rights of each. This is one of Warp 10™‘s major assets which distinguishes the roles of producer, right-holder and user of the data with a mechanism of time-limited encrypted tokens that manage these rights.
This facilitates cross-referencing between data from different players while managing access rights in accordance with the law and the specific demands of each actor. This approach goes beyond the open data framework that will not be sufficient to build “smart city” services.
Toward a structured evolution of data management and governance
The technical advantages of time series databases and the possibility of building neutral data infrastructures are likely to radically change the management of data.
For the time being, information systems retain a central architecture based on relational databases onto which datalakes that meet the needs of innovative projects are grafted. Tomorrow, they will constitute the heart of information systems of most companies, as it is already the case for digital giants.
This reconfiguration of data management is the operational engine of the digital transformation we are witnessing today.
Co-Founder & Chief Executive Officer