Navigating the rapids of Big Data stream opportunities
Big Data processing technologies are typically built to respond to human generated data emanating from web-based systems, such as Facebook. Consequently, the standard approach is to batch data stored across distributed file systems. However, with ‘smart’ technologies such as car-to-car communications, the data volume generated from Machine-to-Machine interactions (M2M) far outstrips that coming from people. There is a need for a new approach with global scalability, speed, usability for non-experts, and able to implement complex analytic tasks in real-time over distributed data sources. The EU funded FERARI project was set up to provide such a fit-for-purpose system. Developing a powerful, modular and elastic architecture. One of the most significant challenges for processing M2M data is its continuous data stream generation at a very high volume, precluding storage. This means that the transient data is often processed on-the-fly, without being stored. Even if data could be sent to a central location (or to a cloud system) there would still be bottlenecks along the network, incurring further costs and delays. These hurdles are likely to become even more pronounced as the size of local sensors for collecting data also increases. The project’s answer was to break its approach into a series of related objectives. Firstly, to cultivate ‘In-Situ processing’ which the project coordinator Dr Michael Mock describes as, ‘Data stream processing which takes place close to the site where the data is generated, hence avoiding network congestion and delays.’ Allied to this, the project adopted Complex Event Processing (CEP). By collating data from multiple sources, patterns were detected which led to identification of pre-determined situations (events), which then immediately triggered programmed responses. Yet combining these two objectives, CEP technology with in-situ processing, proved to be one of the biggest challenges of the project. As Dr Mock explains, ‘Existing CEP technology is not suited to run on distributed Big Data systems, instead, it is intended for use on single, mostly very powerful computers’. The project’s solution was to run the CEP engine (processing with Proton - IBM’s PROactive Technology Online) on top of the Big Data streaming platform, Apache Storm. Additionally, it developed a Query Planner that optimised the CEP engine to translate a single, global CEP ‘expression’ into a set of CEP expressions that can be distributed throughout the FERARI system for evaluation. To enable flexibility, the FERARI architecture is modular, with its framework components separate from the underlying Big Data streaming platform. Thus, the framework can be adapted to any underlying platform. From scenario testing to machine learning The FERARI approach was applied to two challenging test scenarios; the analysis of mobile phone fraud in telecommunication networks and real-time health monitoring in clouds and large data centres. As Dr Mock concludes, ‘The scenarios have been successfully evaluated on real-world data. For instance, it was shown on anonymised mobile phone records, provided by the project partner HT Croatian Telekom, that fraud detection can be achieved with the FERARI system in sub-second latency.’ He goes on to say that, ‘These achievements will enable European industry to build leading products in various application domains, in which it is crucial to evaluate and monitor huge amounts of data being produced continuously, such as in the Internet of Things or in Industry 4.0.’ The FERARI framework has been released as open source with docker software containers for easy installation on any machine, from a personal computer to a cluster or cloud system, allowing scientific and business communities to explore and use it. The team have also made a guide available to explain installation and usage, as well as providing an instructive running example. Despite the superiority of this system over other technologies, it still relies on manual input from domain experts to create the algorithmic rules. Pointing to the future Dr Mock posits that, ‘Another step forward would be to learn relevant rules with machine learning techniques from the data. Similarly, for configuring the in-situ processing methods. This is where we now want to put our energies.’
Keywords
FERARI, Big Data streams, smart technology, distributed systems, Internet of Things, industry 4.0, high data volume