XenonStack: Data ingestion

Showing posts with label Data ingestion. Show all posts

Wednesday, 20 November 2019

Data ingestion 11/20/2019 12:06:00 pm

Ingestion and Processing of Data for Big Data & IoT Solutions

Introduction

In the era of the Internet of Things and Mobility, with a vast volume of data becoming available at a fast velocity, there must be the need for an efficient analytics system.

Also, the variety of data is coming from various sources in various formats, such as sensors, logs, structured data from an RDBMS, etc. In the past few years, the generation of new data has drastically increased. More applications are being built, and they are generating more data at a faster rate.

Earlier, Data Storage was costly, and there was an absence of technology which could efficiently process the data. Now the storage costs have become cheaper, and the availability of technology to process Big Data is a reality.

What is Big Data

According to the Author Dr. Kirk Borne, Principal Data Scientist, Big Data Definition is Everything, Quantified, and Tracked. Let’s pick that apart -

Everything – Means every aspect of life, work, consumerism, entertainment, and play is now recognized as a source of digital information about you, your world, and anything else we may encounter.

Quantified – This means we are storing those "everything” somewhere, mostly in digital form, often as numbers, but not always in such formats. The quantification of features, characteristics, patterns, and trends in all things is enabling data mining, machine learning, statistics, and discovery at an unprecedented scale on a unique number of items. The Internet of Things is just one example, but the Internet of Everything is even more impressive.

Tracked – This means we don’t merely quantify and measure everything just once, but we do so continuously. This includes - tracking your sentiment, your web clicks, your purchase logs, your geo-location, your social media history, etc. or tracking every car on the road, or every motor in a manufacturing plant or every moving part on an airplane, etc. Consequently, we see the emergence of smart cities, smart highways, personalized medicine, personalized education, precision farming, and so much more.

All of these quantified and tracked data streams will enable

Smarter Decisions
Better Products
Deeper Insights
Greater Knowledge
Optimal Solutions
Customer-Centric Products
Increased Customer Loyalty
More Automated Processes, more accurate Predictive and Prescriptive Analytics
Better models of future behaviors and outcomes in Business, Government, Security, Science, Healthcare, Education, and more.

Big Data Defines three of D2D’s

Data-to-Decisions
Data-to-Discovery
Data-to-Dollars

The 10 V's of Big Data

Big Data Framework

The Best Way for a solution is to "Split The Problem." Big Data solutions can be well understood using Layered Architecture. The Layered Architecture is split into different layers, where each layer performs a particular function.

This Architecture helps in designing the Data Pipeline with different requirements of either Batch Processing System or Stream Processing System. This architecture consists of 6 layers which ensure a secure flow of data.

Data Ingestion Layer - This layer is the first step for the data coming from variable sources to start its journey. Data here is prioritized and categorized, which makes data flow smoothly in further layers.
Data Collector Layer - In this Layer, more focus is on the transportation of data from the ingestion layer to the rest of the data pipeline. This is the Layer, where components are decoupled so that analytic capabilities may begin.
Data Processing Layer - In this main layer focus is to specialize in the data pipeline processing system, or we can say the data we have collected in the previous layer is to be processed in this layer. Here we do some magic with the data to route them to a different destination, classify the data flow, and it’s the first point where the analytic may take place.
Data Storage Layer - Storage becomes a challenge when the size of the data you are dealing with becomes significant. Several possible solutions can rescue from such problems. Finding a storage solution is very much important when the size of your data becomes large. This layer focuses on "where to store such large data efficiently."
Data Query Layer - This is the layer where strong analytic processing takes place. Here the main focus is to gather the data value so that they are made to be more helpful for the next layer.
Data Visualization Layer - The visualization or presentation tier, probably the most important tier, where the data pipeline users may feel the VALUE of DATA. We need something that will grab people’s attention, pull them into, make your findings well-understood.

1. Data Ingestion Layer

Data ingestion is the first step for building a Data Pipeline and also the most onerous task in the System of Big Data. In this layer, we plan the way to ingest data flows from hundreds or thousands of sources into the Data Center. As the Data coming from Multiple sources at variable speed, in different formats.

That's why we should adequately ingest the data for successful business decisions making. It's rightly said that "If starting goes well, then half of the work is already done."

1.1 What is Big Data Ingestion?

Big Data Ingestion involves connecting to various data sources, extracting the data, and detecting the changed data. It's about moving data - and especially the unstructured data - from where it is originated, into a system where it can be stored and analyzed.

We can also say that Data Ingestion means taking data coming from multiple sources, and putting it somewhere it can be accessed. It is the beginning of the Data Pipeline, where it obtains or import data for immediate use.

Data can be streamed in real-time or ingested in batches. When data is ingested in real-time, then, as soon as data arrives, it is ingested immediately. When data is ingested in quantities, data items are ingested in some chunks at a periodic interval of time. Ingestion is the process of bringing data into the Data Processing system.

Effective Data Ingestion process begins by prioritizing data sources, validating individual files, and routing data items to the correct destination.

1.2 Challenges Faced with Data Ingestion

As the number of IoT devices increases, both the volume and variance of Data Sources are expanding rapidly. So, extracting the data such that it can be used by the destination system is a significant challenge in terms of time and resources. Some of the other challenges faced by Data Ingestion are -

When numerous Big Data sources exist in the different format, it's the biggest challenge for the business to ingest data at the reasonable speed and further process it efficiently so that data can be prioritized and improves business decisions.
Modern Data Sources and consuming applications evolve rapidly.
Data produced changes without notice independent of consuming application.
Data Semantic Change over time as the same Data Powers new cases.
Detection and capture of changed data - This task is difficult, not only because of the semi-structured or unstructured nature of data but also due to the low latency needed by specific business scenarios that require this determination.

That's why it should be well designed assuring following things -

Able to handle and upgrade the new data sources, technology and applications
Assure that consuming application are working with correct, consistent, and trustworthy data.
Allows rapid consumption of data
Capacity and reliability - The system needs to scale according to input coming, and also it should be fault-tolerant.
Data volume: Though storing all incoming data is preferable, there are some cases in which aggregate data.

1.3 Data Ingestion Parameters

Data Velocity - Data Velocity deals with the speed at which data flows in from different sources like machines, networks, human interaction, media sites, social media. The flow of data can be massive or continuous.
Data Size - Data size implies an enormous volume of data. Data is generated by different sources that may increase timely.
Data Frequency (Batch, Real-Time) - Data can be processed in real-time or batch. In real-time processing as data received at the same time, it further proceeds, but in batch time, data is stored in batches, fixed at some time interval, and then also moved.
Data Format (Structured, Semi-Structured, Unstructured) - Data can be in different formats, mostly it can be structured format, i.e. tabular one or unstructured format, i.e. images, audios, videos or semi-structured, i.e. JSON files, CSS files, etc.

1.4 Big Data Ingestion Key Principles

To complete the process of Data Ingestion, we should use the right tools for that and most important that tools should be capable of supporting some of the key principles written below -

Network Bandwidth - Data Pipeline must be able to compete with business traffic. Sometimes traffic increases or sometimes decreases, so Network bandwidth scalability is the biggest Data Pipeline challenge. Tools are required for bandwidth throttling and compression capabilities.
Unreliable Network - Data Ingestion Pipeline takes data with multiple structures, i.e. images, audios, videos, text files, tabular files data, XML files, log files, etc. and due to the variable speed of data coming, it might travel through the unreliable network. Data Pipeline should be capable of supporting this also.
Heterogeneous Technologies and System - Tools for Data Ingestion Pipeline must be able to use different data sources technologies and different operating systems.
Choose Right Data Format - Tools must provide data serialization format, which means as data comes in the variable format, so converting them into a single format will provide a more comfortable view to understand or relate the data.
Streaming Data - It depends upon business necessity whether to process the data in batch or streams or real-time. Sometimes we may require both processing. So, tools must be capable of supporting both.

1.5 Data Serialization

Different types of users have different types of data consumer needs. Here we want to share variable data, so we must plan how the user can access data in a meaningful way. That's why a single image of variable data optimizes the data for human readability.

Approaches used for this are -

Apache Thrift - It's an RPC Framework containing Data Serialization Libraries.
Google Protocol Buffers - It can use the unique generated source code to quickly write and read structured data to and from a variety of data streams and using a variety of languages.
Apache Avro - The more recent Data Serialization format that combines some of the best features which previously listed. Avro Data is self-describing and uses a JSON-schema description. This schema is included with the data itself and natively supports compression. Probably it may become a de facto standard for Data Serialization.

1.6 Data Ingestion Tools

1.6.1 Apache Flume - Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

It has a simple and flexible architecture based on streaming data flows. It is robust and faults tolerant with tunable reliability mechanisms and many failovers and recovery mechanisms.

It uses a simple, extensible data model that allows for an online analytic application. Its functions are -

Stream Data - Ingest streaming data from multiple sources into Hadoop for storage and analysis.
Insulate System - Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination
Scale Horizontally - To ingest new data streams and additional volume as needed.

1.6.2 Apache Nifi - Apache Nifi provides an easy to use, powerful, and reliable system to process and distribute data. Apache NiFi supports robust and scalable directed graphs of data routing, transformation, and system mediation logic. Its functions are -

Track data flow from beginning to end.
The seamless experience between design, control, feedback, and monitoring
Secure because of SSL, SSH, HTTPS, encrypted content.

1.6.3 Elastic Logstash - Elastic Logstash is an open-source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously transforms it, and then sends it to your “stash," i.e., Elasticsearch.

It quickly ingests from your logs, metrics, web applications, data stores, and various AWS services and done in continuous, streaming fashion. It can Ingest Data of all Shapes, Sizes, and Sources.

2. Data Collector Layer

In this Layer, more focus is on transportation data from the ingestion layer to the rest of the Data Pipeline. Here we use a messaging system that will act as a mediator between all the programs that can send and receive messages.

Here the tool used is Apache Kafka. It's a new approach in message-oriented middleware.

2.1 Apache Kafka

It is used for building real-time data pipelines and streaming apps. It can process streams of data in real-time and store streams of data safely in a distributed replicated cluster.

Kafka works in combination with Apache Storm, Apache HBase, and Apache Spark for real-time analysis and rendering of streaming data.

2.2 What is a Data Pipeline?

Data Pipeline the main component of Data Integration. All transformation of data happens in the Data Pipeline.
It is a Python-based tool that streams and transforms real-time data to service that needs it.
Data Pipeline Automate the movement and transformation of data. Data Pipeline is a Data Processing engine that runs inside your application.
It is used to transform all the incoming data in a standard format so that we can prepare it for analysis and visualization. Data Pipeline is built on Java Virtual Machine (JVM).
So, a Data Pipeline is a series of steps that your data moves through. The output of one step in the process becomes the input of the next. Data, typically raw data, goes on one side, goes through a series of steps.
The steps of a Data Pipeline can include cleaning, transforming, merging, modeling, and more, in any combination.

2.2.1 Functions of Data Pipeline

Ingestion - Data Pipeline Helps in bringing data into your system. It means taking unstructured data from where it is originated into a system where it can be stored and analyzed for making business decisions.
Data Integration - Data Pipeline also helps in bringing different types of data together.
Organization - Organizing data means an arrangement of data, this arrangement is also made in the Data Pipeline.
Refining the data - It's also one of the processes where we can enhance, clean, improve the raw data.
Analytics - After refining the useful data, Data Pipeline provides us the processed data on which we can apply the operations on raw data and can make business decisions accurately.

2.2.2 Need Of Data Pipeline

A Data Pipeline is a software that takes data from multiple sources and makes it available to be used strategically for making business decisions.

Primarily reasons for the need for data pipeline is because it's tough to monitor Data Migration and manage data errors. Other reasons for this are below -

Certain Business - Critical Analysis is only possible when combining data from multiple sources. For making business decisions, we should have a single image of all the data coming.
Connections - All the time data keeps on increasing, new data came, and old data modified, so each new integration can take anywhere from a few days to a few months to complete.
Accuracy - The only way to build trust with data consumers is to make sure that your data is auditable. One best practice that’s easy to implement is to never discard inputs or intermediate forms when altering data.
Latency - The fresher your data, the more agile your company’s decision-making can be. Extracting data from APIs and databases in real-time can be difficult, and many target data sources, including large object stores like Amazon S3 and analytics databases like Amazon Redshift, are optimized for receiving data in chunks rather than a stream.
Scalability - Data can be increased or decreased with time we can't say for on Monday data will come less, and the rest of the days begins a lot for processing. So, the usage of data is not uniform. What we can do is making our pipeline infinitely scalable that able to handle any amount of data coming at variable speed.

2.2.3 Use cases for Data Pipeline

Data Pipeline is useful to several roles, including CTOs, CIOs, Data Scientists, Data Engineers, BI Analysts, SQL Analysts, and anyone else who derives value from a unified real-time stream of user, web, and mobile engagement data. So, use cases for data pipeline are given below -

For Business Intelligence Teams
For SQL Experts
For Data Scientists
For Data Engineers
For Product Teams

2.3 Apache Kafka is Good for 2 Things

Building Real-Time streaming Data Pipelines that reliably get data between systems or applications
Building Real-Time streaming applications that transform or react to the streams of data.

2.3.1 Common use cases of Apache Kafka -

Stream Processing
Website Activity Tracking
Metrics Collection and Monitoring
Log Aggregation

2.3.2 Features of Apache Kafka

One of the features of Kafka is durable Messaging.
Apache Kafka relies heavily on the filesystem for storing and caching messages: rather than maintain as much as possible in memory and flush it all out to the filesystem, all data is immediately written to a persistent log on the filesystem without necessarily flushing to disk.
Apache Kafka solves the situation where the producer is generating messages faster than the consumer can reliably consume them.

2.3.3 How Apache Kafka Works

Kafka System design acts as a Distributed commit log, where incoming data is written sequentially on disk. There are four main components involved in moving data in and out of Apache Kafka -

Topics - Topic is a user-defined category to which messages are published.
Producers - Producers publish messages to one or more topics.
Consumers - Consumers subscribe to topics and process the published messages.
Brokers - Brokers that manage the persistence and replication of message data.

3. Data Processing Layer

In the previous layer, we gathered the data from different sources and made it available to go through the rest of the pipeline.

In this layer, our task is to do magic with data, as now data is ready; we only have to route the data to different destinations.

In this main layer, the focus is to specialize Data Pipeline processing system, or we can say the data we have collected by the last segment in this next layer. We have to do processing on that data.

Processing can be done in 3 ways i.e.

3.1 Batch Processing System

A pure batch processing system for off-line analytic. For doing this tool used is Apache Sqoop.

3.2 Apache Sqoop

It efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores.

Apache Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

3.2.1 Functions of Apache Sqoop are -

Import sequential data sets from mainframe
Data imports
Parallel data Transfer
Fast data copies
Efficient data analysis
Load balancing

3.3 Near Real-Time Processing System

A pure online processing system for on-line analytic. For this type of processing tool, i.e. used is Apache Storm. The Apache Storm cluster makes decisions about the criticality of the event and sends the alerts to the alert system (dashboard, e-mail, other monitoring systems).

3.3.1 Apache Storm - It is a system for processing streaming data in real-time. It adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning, and continuous monitoring of operations.

3.3.2 Features of Apache Storm

Fast – It can process one million 100 byte messages per second per node.
Scalable – It can do parallel calculations that run across a cluster of machines.
Fault-tolerant – When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
Easy to operate – It consists of Standard configurations that are suitable for production on day one. Once deployed, Storm is easy to operate.
Hybrid Processing system - This consists of Batch and Real-time processing System capabilities. This type of processing tool used is Apache Spark and Apache Flink.

3.4 Apache Spark

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning, or SQL workloads that require quick iterative access to datasets.

With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared data set in Hadoop.

3.5 Apache Flink

Flink is an open-source framework for distributed stream processing that Provides accurate results, even in the case of out-of-order or late-arriving data. Some of its features are -

It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining an exactly-once application state.
Performs at large scale, running on thousands of nodes with excellent throughput and latency characteristics.
It's streaming data flow execution engine, APIs, and domain-specific libraries for Batch, Streaming, Machine Learning, and Graph Processing.

3.5.1 Apache Flink Use Cases

Optimization of e-commerce search results in real-time
Stream processing-as-a-service for data science teams
Network/Sensor monitoring and error detection
ETL for Business Intelligence Infrastructure

4. Data Storage Layer

Next, the primary issue is to keep data in the right place based on usage. We have relational Databases that were a successful place to store our data over the years.

But with the new big data strategic enterprise applications, you should no longer be assuming that your persistence should be relational.

We need different databases to handle the different varieties of data, but using different databases creates overhead. That's why there is an introduction to the new concept in the database world, i.e. the Polyglot Persistence.

4.1 Polyglot Persistence

Polyglot persistence is the idea of using multiple databases to power a single application. Polyglot persistence is the way to share or divide your data into various databases and leverage their strength together.

It takes advantage of the strength of different databases. Here various types of data are arranged in different ways. In short, it means picking the right tool for the right use case.

It’s the same idea behind Polyglot Programming, which is the idea that applications should be written in a mix of languages to take advantage of the fact that different styles are suitable for tackling various problems.

4.1.1 Advantages of Polygon Persistence -

Faster response times - Here, we leverage all the features of databases in one app, which makes the response times of your app very fast.
Helps your app to scale well - Your app scales exceptionally well with the data. All the NoSQL databases scale well when you model databases accurately for the data that you want to store.
A rich experience - You have a vibrant experience when you harness the power of multiple databases at the same time. For example, if you want to search on Products in an e-commerce app, then you use ElasticSearch, which returns the results based on relevance, which MongoDB cannot do.

4.2 Tools used for Data Storage

4.2.1 HDFS

HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers.
HDFS also makes applications available to parallel processing. HDFS is built to support claims with large data sets, including individual files that reach into the terabytes.
It uses a master/slave architecture, with each cluster consisting of a single NameNode that manages file system operations and supporting DataNodes that manage data storage on individual compute nodes.
When HDFS takes in data, it breaks the information down into separate pieces and distributes them to different nodes in a cluster, allowing for parallel processing.
The file system also copies each piece of data multiple times and distributes the copies to individual nodes, placing at least one copy on a different server rack.
HDFS and YARN form the data management layer of Apache Hadoop.

4.2.1.1 Features of HDFS

It is suitable for distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and data node help users to quickly check the status of the cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

4.2.2 Gluster file systems (GFS)

As we know, the right storage solutions must provide elasticity in both storage and performance without affecting active operations.

Scale-out storage systems based on GlusterFS are suitable for unstructured data such as documents, images, audio and video files, and log files.GlusterFS is a scalable network filesystem.

Using this, we can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks.

It's Open Source.
You can deploy GlusterFS with the help of commodity hardware servers.
Linear scaling of performance and storage capacity.
Scale storage size up to several petabytes, which can be accessed by thousands for servers.

4.2.2.1 Use Cases For GlusterFS include

Cloud Computing
Streaming Media
Content Delivery

4.2.3 Amazon S3

Amazon Simple Storage Service (Amazon S3) is object storage with a simple web service interface to store and retrieve any amount of data from anywhere on the web.
It is designed to deliver 99.999999999% durability and scale past trillions of objects worldwide.
Customers use S3 as primary storage for cloud-native applications, as a bulk repository, or "data lake," for analytics, as a target for backup & recovery and disaster recovery, and with serverless computing.
It's simple to move large volumes of data into or out of S3 with Amazon's cloud data migration options.
Once data is stored in Amazon S3, it can be automatically tiered into lower cost, longer-term cloud storage classes like S3 Standard - Infrequent Access and Amazon Glacier for archiving.

5. Data Query Layer

This is the layer where strong analytic processing takes place. This is a field where interactive queries are necessaries, and it’s a zone traditionally dominated by SQL expert developers. Before Hadoop, we had minimal storage due to which it takes a long analytics process.

Learn More About "Big Data Ingestion"

Explore Our "Big Data Analytics Services"

XenonStack 0

Monday, 11 September 2017

Data lake 9/11/2017 05:18:00 pm

What is Data Collection and Ingestion?

Data Collection and Data Ingestion are the processes of fetching data from any data source which we can perform in two ways -

Real-time Streaming
Batch Streaming

In Today’s World, Enterprises are generating data from different Sources and building Real Time Data lake; we need to Integrate various sources of Data into One Stream.

In this Blog We are sharing how to Ingest, Store and Process Twitter Data using Apache Nifi and in Coming Blogs, we will be Sharing Data Collection and Ingestion from Below Sources

Data ingestion From Logs
Data Ingestion from IoT Devices
Data Collection and Ingestion from RDBMS (e.g., MySQL)
Data Collection and Ingestion from ZiP Files
Data Collection and Ingestion from Text/CSV Files

Objectives for the Data Lake

A Central Repository for Big Data Management
Reduce costs by offloading analytical systems and archiving cold data
Testing Setup for experimenting with new technologies and data
Automation of Data pipelines
MetaData Management and Catalog
Tracking measurements with alerts on failure or violations
Data Governance with clear distinction of roles and responsibilities
Data Discovery, Prototyping, and experimentation

Different Objectives and Goals of Data Ingestion.

Apache NiFi

Apache NiFi provides an easy to use, the powerful, and reliable system to process and distribute the data over several resources.

Apache NiFi is used for routing and processing data from any source to any destination. The process can also do some data transformation.

It is a UI based platform where we need to define our source from where we want to collect data, processors for the conversion of the data, a destination where we want to store the data.

Each processor in NiFi have some relationships like success, retry, failed, invalid data, etc. which we can use while connecting one processor to another. These links help in transferring the data to any storage or processor even after the failure by the processor.

Benefits of Apache NiFi

Real-time/Batch Streaming
Support both Standalone and Cluster mode
Extremely Scalable, extensible platform
Visual Command and Control
Better Error handling

Core Features of Apache NiFi

Guaranteed Delivery - A core philosophy of NiFi has been that even at very high scale, guaranteed delivery is a must. It is achievable through efficient use of a purpose-built persistent write-ahead log and content repository.

Data Buffering / Back Pressure AND Pressure Release - NiFi supports buffering of all queued data as well as the ability to provide back pressure as those lines reach specified limits or to an age of data as it reaches a specified age (its value has perished).

Prioritized Queuing - NiFi allows the setting of one or more prioritization schemes for how data from a queue is retrieved. The default is oldest first, but it can be configured to pull newest first, largest first, or some other custom scheme.

Flow Specific QoS - There are points of a data flow where the data is critical, and it is less intolerant. There are also times when it must be processed and delivered within seconds to be of any value. NiFi enables the fine-grained flow particular configuration of these concerns.

Data Provenance - NiFi automatically records, indexes, and makes available provenance data as objects flow through the system even across fan-in, fan-out, transformations, and more. This information becomes extremely critical in supporting compliance, troubleshooting, optimization, and other scenarios.

Recovery / Recording a rolling buffer of fine-grained history - NiFi’s content repository is designed to act as a rolling buffer of history. As Data ages off, it is removed from the content repository or as space is needed.

Visual Command and Control - NiFi enables the visual establishment of data flows in real-time. And provide UI based approach to build different data flow.

Flow Templates - It also allows us to create templates of frequently used data streams. It can also help in migrating the data flows from one machine to another.

Security - NiFi supports Multi-tenant Authorization. The authority level of a given data flow applies to each component, allowing the admin user to have a fine grained level of access control. It means each NiFi cluster is capable of handling the requirements of one or more organizations.

Parallel Stream to Multiple Destination - With NiFi we can move data to multiple destinations at one time. After processing the data stream, we can route the flow to the various destinations using NiFi’s processor. It can be helpful when we need to back our data on multiple destinations.

What is Apache Nifi - A Complete Introduction to Benefits and Core Features of Apache Nifi

NiFi Clustering

When we require moving a large amount of data, then the only single instance of NiFi is not enough to handle that amount of data. So to handle this we can do clustering of the NiFi Servers, this will help us in scaling.

We just need to create the data flow on one node, and this will make a copy of this data flow on each node in the cluster.

NiFi introduces Zero-Master Clustering paradigm in Apache NiFi 1.0.0. A previous version of Apache NiFi based upon a single “Master Node” (more formally known as the NiFi Cluster Manager).

If the master node gets lost, data continued to flow, but the application was unable to show the topology of the flow, or show any stats. But in Zero-Master we can make changes from any node of the cluster. And if master node disconnects, then automatically any active node is elected as Master Node.

Each node has the same the data flow, so they work on the same task as the other nodes are working, but each operates on the different datasets.

In NiFi cluster, one node is elected as the Master(Cluster Coordinator), and another node sends heartbeats/status information to the master node. This node is responsible for the disconnection of the other nodes that do not send any pulse/status information.

This election of the master node is done via Apache Zookeeper. And In the case when the master nodes get disconnected, Apache Zookeeper elects any active node as the master node.

Data Collection and Ingestion from Twitter using Apache NiFi to Build Data Lake

Fetching Tweets with NiFi’s Processor

NiFi’s ‘GetTwitter’ processor is used to fetch tweets. It uses Twitter Streaming API for retrieving tweets. In this processor, we need to define the endpoint which we need to use. We can also apply filters by location, hashtags, particular IDs.

Twitter Endpoint - Here we can set the endpoint from which data should get pulled. Available parameters -

- Sample Endpoint - Fetch public tweets from all over the world.

- Firehose Endpoint - This is same as streaming API, but it ensures 100% guarantee delivery of tweets with filters. - Filter Endpoint - If we want to filter by any hashtags or keywords

Consumer Key - Consumer key provided by Twitter.
Consumer Secret - Consumer Secret provided by Twitter.
Access Token - Access Token provided by Twitter.
Access Token Secret - Access Token Secret provided by Twitter.
Languages - Languages for which tweets should fetch out.
Terms to Filter - Hashtags for which tweets should fetch out.
IDs to follow - Twitter user IDs that should be followed.

Now processor GetTwitter is ready for transmission of the data(tweets). From here we can move our data stream to anywhere like Amazon S3, Apache Kafka, ElasticSearch, Amazon Redshift, HDFS, Hive, Cassandra, etc. NiFi can move data multiple destinations parallelly.

Data Integration Using Apache NiFi and Apache Kafka

For this, we are using NiFi processor ‘PublishKafka_0_10’.

In the Scheduling Tab, we can configure how many concurrent tasks to be executed and schedule the processor.

In Properties Tab, we can set up our Kafka broker URLs, topic name, request size, etc. It will write data to the given topic. For the best results, we can create a Kafka topic manually of a defined partitions.

Apache Kafka can be used to process data with Apache Beam, Apache Flink, Apache Spark.

Data Integration Using Apache Nifi and Apache Kafta

Data Integration Using Apache Nifi and Apache Kafka

Data Integration Using Apache NiFi to Amazon RedShift with Amazon Kinesis Firehose Stream

Now we integrate Apache NiFi to Amazon Redshift. NiFi uses Amazon Kinesis Firehose Delivery Stream to store data to Amazon Redshift.

This delivery Stream should get utilized for moving data to Amazon Redshift, Amazon S3, Amazon ElasticSearch Service. We need to specify this while creating Amazon Kinesis Firehose Delivery Stream.

Now we have to move data to Amazon Redshift, so firstly we need to configure Amazon Kinesis Firehose Delivery Stream. While delivering data to Amazon Redshift, firstly the data is provided to Amazon S3 bucket, and then Amazon Redshift Copy command is used to move data to Amazon Redshift Cluster.

We can also enable data transformation while creating Kinesis Firehose Delivery Stream. In this, we can also backup the data to another Amazon S3 bucket other than an intermediate bucket.

So for this, we will use processor PutKinesisFirehose. This processor will use that Kinesis Firehose stream for delivering data to Amazon Redshift. Here we will configure AWS credentials and Kinesis Firehose Delivery Stream.

Data-Integration-Using-Apache-NiFi-Amazon-RedShift-with-Amazon-Kinesis-Firehose-Stream

Data Integration Using Apache NiFi to Amazon S3

PutKinesisFirehose sends data to both Amazon Redshift and uses Amazon S3 as the intermediator. Now if someone only wants to use Amazon S3 as the storage so NiFi can also use for sending data to Amazon S3 only. For this, we need to use NiFi processor PutS3Object. In it, we have to configure our AWS credentials, bucket name, and path, etc.