Real-Time Streaming Data Analytics For IoT

What is Fast Data?

A few years ago, we remembered the time when it was just impossible to analyze petabytes of data. The emergence of Hadoop made it possible to run analytical queries on our vast amount of historical data.

As we know, Big Data is a buzz from last few years, but Modern Data Pipelines are always receiving data at a high ingestion rate. So this constant flow of data at high velocity is termed as Fast Data.

So Fast data is not about just volume of data like Data Warehouses in which data is measured in GigaBytes, TeraBytes or PetaBytes. Instead, we measure volume but concerning its incoming rate like MB per second, GB per hour, TB per day. So Volume and Velocity both are considered while talking about Fast Data.

Real-Time Streaming Data

Nowadays, there are a lot of Data Processing platforms available to process data from our ingestion platforms. Some support streaming of data and other supports real streaming of data which is also called Real-Time data.

Streaming means when we can process the data at the instant as it arrives and then processing and analyzing it at ingestion time. But in streaming, we can consider some amount of delay in streaming data from ingestion layer.

But Real-time data needs to have tight deadlines regarding time. So we usually believe that if our platform can capture any event within 1 ms, then we call it as real-time data or real streaming.

But When we talk about taking business decisions, detecting frauds and analyzing real-time logs and predicting errors in real-time, all these scenarios comes to streaming. So Data received instantly as it arrives termed as Real-time data.

Real-Time Data Streaming Tools & Frameworks

So in the market, there are a lot of open sources technologies available like Apache Kafka in which we can ingest data at millions of messages per sec. Also Analyzing Constant Streams of data is also made possible by Apache Spark Streaming, Apache Flink, Apache Storm.

Apache Spark Streaming is the tool in which we specify the time-based window to stream data from our message queue. So it does not process every message individually. We can call it as the processing of real streams in micro batches.

Whereas Apache Storm and Apache Flink can stream data in real-time.

Why Real-Time Streaming?

As we know that Hadoop, S3 and other distributed file systems are supporting data processing in huge volumes and also we can query them using their different frameworks like Hive which uses MapReduce as their execution engine.

Why we Need Real-Time Streaming?

A lot of organizations are trying to collect as much data as they can regarding their products, services or even their organizational activities like tracking employees activities through various methods used like log tracking, taking screenshots at regular intervals.

So Data Engineering allows us to convert this data into basic formats and Data Analysts then turn this data into useful results which can help the organization to improve their customer experiences and also boost their employee's productivity.

But when we talk about log analytics, fraud detection or real-time analytics, this is not the way we want our data to be processed.The actual value data is in processing or acting upon it at the instant it receives.

Imagine we have a data warehouse like hive having petabytes of data in it. But it allows us to just analyze our historical data and predict future.

So processing of huge volumes of data is not enough. We need to process them in real-time so that any organization can take business decisions immediately whenever an important event occurs. This is required in Intelligence and surveillance systems, fraud detection, etc.

Earlier handling of these constant streams of data at high ingestion rate is managed by firstly storing the data and then running analytics on it.

But organizations are looking for the platforms where they can look into business insights in real-time and act upon them in real-time.

Alerting platforms are also built on the top of these real-time streams. But Effectiveness of these platform lies in the fact that how honestly we are processing the data in real-time.

Use Of Reactive Programming & Functional Programming

Now when we are thinking of building our alerting platforms, anomaly detectionengines, etc. on the top of our real-time data, it is vital to consider the style of programming you are following.

Nowadays, Reactive Programming and Functional Programming are at their boom.

What is Reactive Programming?

So, we can consider Reactive Programming as subscriber and publisher pattern. Often, we see the column on almost every website where we can subscribe to their newsletter, and whenever the newsletter is posted by the editor, whosoever have got subscription will get the newsletter via email or some other way.

So the difference between Reactive and Traditional Programming is that the data is available to the subscriber as soon as it receives. And it is made possible by using Reactive Programming model. In Reactive Programming, whenever any events occur, there are certain components (classes) that had registered to that event. So instead of invoking target elements by event generator, all targets automatically get triggered whenever any event occurs.

What is Functional Programming?

Now when we are processing data at high rate, concurrency is the point of concern. So the performance of our analytics job highly depends upon memory allocation/deallocation. So in Functional Programming, we don’t need to initialize loops/iterators on our own.

We will be using Functional Programming styles to iterate over the data in which CPU itself takes care of allocation and deallocation of data and also makes the best use of memory which results in better concurrency or parallelism.

Real-Time Big Data Streaming Architecture

While Streaming and Analyzing the real-time data, there are chances that some messages can be missed or in short, the problem is how we can handle data errors.

So, there are two types of architectures which are used while building real-time pipelines.

Lambda Architecture for Big Data

This architecture was introduced by Nathan Marz in which we have three layers to provide real-time streaming and compensate any data error occurs if any. The three layers are Batch Layer, Speed layer, and Serving Layer.

So data is routed to batch layer and speed layer by our data collector concurrently. So Hadoop is our batch layer, and Apache Storm is our speed layer. And NoSQL data store like Cassandra, MongoDB is our serving layer in which analyzed results will be stored.

So the idea behind these layers was that the seed layer would be providing real-time results into serving layer and if any data errors or any data is missed while stream processing, then batches job will compensate that and the MapReduce job will run at the regular interval and updates our serving layer, so providing accurate results.

Kappa Architecture for Big Data

Now The above Lambda architecture solves our problem for data error and also provide flexibility to provide real-time and accurate results to the user.

But Apache Kafka founders raise the question on this Lambda architecture, they loved the benefits provided by the lambda architecture, but they also state that it is tough to build the pipeline and maintain analysis logic in both batch and speed layer.

So If we use frameworks like Apache spark streaming, Flink, Beam they provide support for both batch and real-time streaming. So it will be straightforward for developers to maintain the logical part of the data pipeline.