XenonStack

A Stack Innovator

Post Top Ad

Friday, 29 November 2019

11/29/2019 04:51:00 pm

D3.js Library Overview, Best D3.js Use Cases — XenonStack


What is D3.js?


D3.js stands for Data-Driven Documents is a front-end visualization library in javascript for creating interactive and dynamic web-based data visualizations, it also has an active community behind it for which it is very famous. It uses HTML, CSS a, and SVG to bring data to life and mainly it is for manipulating DOM objects, focusing on objects and concepts, not just the pre-built charts and graphs. It is mostly compatible with popular web browsers, like Chrome or Firefox.
It can even create different shapes arcs, lines, arcs, rectangles, and points. The essential feature of d3.js that it provides the beautiful fully customized visualizations. It is a suite of small modules for data visualization and analysis. These modules work well together, but we should pick and choose the parts that we only need. D3's most complex concept is selections and if we are using a virtual DOM framework like React (and don’t need transitions), so don’t need selections at all.

A basic Introduction how to use d3.js?


So, we need to create a visualization with d3, set up workspace inside a container, create x and y-axes, process data and draw graphs and charts using functions. We can also add different attributes and styles for datapoints or lines. When creating basic charts and graphs, D3 is not complicated, for customization need to add more code. More complex visualizations need lots of logic, layouts, and data formatting as these are the keys that we want our visualization to speak for.
D3 also can be paired with a WebGL library, which makes more standard capabilities regarding dynamic and interactiveness. We can even animate one element, based on transitions similar to those done in CSS.

Why we use D3.js?


Today different kind of charting libraries and BI tools are available for creating visualizations, so the question arises why to use d3.js for creating visualizations, because of its key features versatility, full customization, and interactive visualization, even the exactly the data visualization that can be made by graphic designers.
Data visualization is very hard then we think. It is easy to draw shapes, but sometimes visualizations require full customization such as a need to bring subtle ticks and different smooth curves between data points. That is where d3.js comes in that allows us to explore the design space quickly. It is also not only a charting library but acts as a versatile coding environment that can be best used for making complex visualizations.

When to use D3.js?



As d3.js becomes complicated sometimes, programming in d3 should
 be done from scratch and requires a steep learning curve, but due to its significant advantages, we need to use it and get to decide when to use that library. We should use D3.js when our web application is interacting with data. We can explore D3.js for its graphing capabilities and make it more usable. It can be added to the front end of our web application as the backend server will generate the data and front-end part of the application where the user interacts with data use D3.js.

Some of the use cases of D3.js

We above discussed about data visualization basics, d3.js front-end visualization library concepts, where and when to approach d3.js library, now go through some of the use cases of d3 as d3 with complex code also provide reusable code that can be used again in other visualizations, d3 also can be used to react, storytelling with customized and visualizations the most crucial use case can also be achieved with d3. Some of the use cases of the d3 library are discussed below:

Reusable charts with D3.js

During creating visualizations, we also need to take care of reusability of the charts or anything that made in visualization. Let’s discuss how D3.js library provide the reusable charts, and firstly we should know what can be a reusable chart, some of the characteristics that a reusable chart has -
Built charts in an independent way -
We need to make all chart elements associated with data points of the dataset independently. This has to do with the way D3 associates data instances with DOM elements.
  • Repeatable — Instantiate the chart more than one time, as chart visualizing chart with different datasets.
  • Modifiable — Source code of charts can be easily be refactored by other developers according to different needs.
Configurable
We need to modify only the appearance and behavior of the graph without changing the code.

Some of the best practices to make reusable charts with d3.js

Built charts in an independent way
To make the chart reusable with d3.js, it should be repeatable, modifiable, configurable and extensible.
To make chart repeatable in d3.js we can use the object-oriented property and approach chart.prototype.render and also use this property during a call the functions.
To make the modifiable chart make source code with simple transformations with d3.js built-in functions, so that path to modification in the system becomes clear and easy to be modified by the other developers.
Easy modification path can be achieved by using various selection functions .enter(), .exit() and .transition().
enter() selection — When a dataset contains more items than the DOM elements, data items stored in entering the selection.
For example -
We need to make some modification to our dataset. We add one more data item to the array, as our bar chart contains still only four bars of data, that time choose data element from entering the selection.
A room visualization, with several chairs that are DOM elements and guests that are data items, sit on the chairs which are data joined with DOM elements. The enter selection in the waiting area for data items that enter the room but cannot be seated, as there are not enough chairs. As to arrange more chairs, where to create new bar div and add it to DOM is what done by d3 selector enter.
exit() selection — As we discuss how we could add new items to a data set dynamically and update the visualization. In the same way, we can remove items from the data set and allow D3 deal with the corresponding DOM elements or, following our room/chair method, take away those chairs that are not needed anymore because some guests decided to leave. The exit selection does this. It contains only those data items that are about to leave the data set.
Configurable -
Consider the visualization of the bubble chart, to make it reusable, only the size of the chart and input dataset needs to be customized.
Define Size of the chart
1 var bubbleChart = function () { 2 var width = 500, 3 height = 500; 4 function bubblechart(select){ 5 } 6 return chart; 7 }
We want to create charts of different sizes without the need to change the code. Create the charts as follows
1 bubble.html 1 var chart = bubbleChart().width(500).height(500);
Now we will define accessors for the width and height variables in the bubble_chart.js file -
1 // bubble_chart.js 2 var bubbleChart = function () { 3 var width = 500 4 height = 500; 5 function chart(select){ 6 } 7 chart.width = function(val) { 8 if (!argu.len) { return width; } 9 width = val; 10 return chart; 11 } 12 chart.height = function(val) { 13 if (!argu.len) { return height; } 14 height = val; 15 return chart; 16 } 17 return chart; 18 }

Role of Data Visualization in D3.js

Over the last few years, data visualization growing more and more in a day to day experience. We see data visualization daily in our social media, news articles and every day at work. Data visualization is that it translates quantitative data into graphic representations using libraries i.e D3.js Library. Data visualization includes tasks of designing, creating and deploying is not just a single profession at all but shows a combination of work of designers, analysts, engineers, and journalists. As for data visualization tasks engineers use different javascript libraries, and analysts use various business intelligence tools.
Nowadays we able to collect and analyze more and more data than ever before. Big data is a hot topic even now and a major to study. But to be able to understand and digest all these kind of numbers, we need visualization and a platform or framework that makes all kinds of visualizations possible, no matter how much data there is need to process, that’s where d3.js and other visualizations tools come in.
Only plotting the charts with visualization libraries and tools is not enough; we need storytelling art here also. As most of the libraries and tools exist here not provides the effective display of quantitative information, where D3.js most successful library comes in that already tells half of the story when we start developing the code for it.

Data Visualization using D3.js with React

As D3 enter, exit and update pattern provides full control to the developer for managing the DOM. We can manage when should element is added to the screen and when it is removed and how to update the element. It is working fine when the updates of data elements are simple, but it gets complex when there exists a lot of data elements to keep track of, and the data elements to update vary from one user action to another. One solution to manage the DOM data elements is manually counted which elements require updates, but it also becomes complex we could not keep the count in our heads, as that is manually defining the DOM tree which is not recommended, so there is need to integrate react and D3 together for complex visualizations.
As react that updates virtual DOM are exactly like D3 enter, exit and update pattern. So let’s use d3 with react, react for what to enter and exit operations and d3 for update patterns. Discuss react with d3 implementation in a few steps -

Enter and Exit pattern with React

React follows the concept of dynamic children (reuse code between components and inheritance) that is similar to D3 data binding property. It allows passing in a key per children to track order of children as like vital function pass into d3.data and uses that to calculate what should be added and removed when data changes.
Consider the example here as we need to render two rectangles, and a text element with D3, code for it in D3 looks like below -
1 var graph = d3.select('svg').append('g') 2 .class('graph', true); 3 var expenses_graph = graph.selectAll('g.expense'); 4 var enter_ele = expenses_graph 5 .data(expensesData, (expense_graph) => expense_graph.id) 6 .enter().append('g') 7 .class('expense_graph', true); 8 enter_ele.append('rect') 9 .class('expenseRect', true); 10 enter_ele.append('text'); 11 expenses.exit().remove();
Now, integrate the d3 code with react components, that is like -
1 class ExpenseComponent extends React.Component { 2 render() { 3 return ( 4 ); 5 } 6 } 7 class GraphComponent extends React.Component { 8 render() { 9 var expenses_graph = expensesData && expensesData.map((expense) => { 10 return (); 11 }); 12 return ( 13 {expenses_graph} 14 ); 15 }
Firstly when we see the react code, it looks complicated, but it also provides some great things such as -
It allows us to make components for elements, so it becomes easier to reuse code, that kind of reusability code also done in old D3 code, but react makes it explicit.
It allows us to keep track of what components looks like and reflect the structure of the DOM.
Another thing that can be achieved with it is we do not need to think about entering and exiting again. When we show and hide parts of component depend on data, with react will only draw only the elements that we need, when we need it accordingly in a straight manner, the below code in react will show -
1 class ExpComponent extends React.Component { 2 render() { 3 return ( 4 {this.props.data.name && ()} 5 {this.props.data.name && ()} 6 ); 7 } 8 }

Updating and transitioning with D3

With entering and exit selections we have a structure of the components, need to fill in the attributes.D3 also manage to update the attributes.In React component, call the enter code from component(), and update code from update(). In this way, as soon as the elements are inserted into the DOM, we can use D3 to set the starting attributes, when data changes, we will use D3 transition patterns the element to next set of attributes. We should also keep in mind whatever the react keep tracking of d3 cannot manipulate it.
So the update and transition pattern help to keep make ownership between these two, where React manages the structure and D3 helps in maintains the attributes. In that way, D3 will transition elements update its positions, its fill color and update its size all things without conflicting with react workflow.

Storytelling with d3.js interactive visualizations

We will discuss the d3.js library, its features, why we need it and reusability. As we need to keep in mind the target audience when made any visualization, because our task is not only to rendered visualization, but it should be fully explainable, our approached audience can understand it wisely, so how the audience can understand it, that can be only achieved with storytelling. D3.js interactive and beautiful visualizations help in the fantastic narration of data. Let’s discuss the chord diagrams complex visualization in a different interactive way and see how it helps in storytelling -
Consider the problem here, as all people in India are using phones, many will switch to a new phone after some time, a question arises how do users change and how this differs per brand, these kinds of different problems can be answered by visualizing the dataset by using chord diagrams visualization in d3.js.
The below chord diagram shows the relationship in terms of switching behavior between different phone brands. The circle is divided into eight brands — the arc length of every group shows brand market share. The outer side rim of the chord diagram shows a percentage per brand. It indicates that Samsung shares 38%, Apple on second with share 19% and Nokia on third with a 16% share. The chords are directed in a diagram as 8.7% of users who now have Samsung, used to have Nokia, only 1.2% opposite. The chords placed between the arcs visualizes users switching behavior between all brands in both directions.
For example, the blue chord that connecting Samsung and Nokia in the left section shows the users that moved from Samsung to Nokia and from Nokia to Samsung. The visualization shows that Nokia lost its share to Samsung, as 8.7% of all users that used Nokia now own Samsung.

Insights from a mobile consumer survey chord diagram visualization

When we made the customized visualization visualizing the flow of brands, come to more conclusions and insights that are discussed below -
  • Both Apple and Samsung brands are capturing users from Nokia and Other Brands.
  • Only a few users losing by Apple, the number of users gained is twice the number of users lost.
  • HTC brand is acquiring users from Nokia, LG brand and losing users to Samsung and Huawei brand.
  • Nokia brand acquiring more users than it loses the users.

Approaches to Data Visualization

Data visualization helps users to translate quantitative data into graphic representations using leading Data Visualization Techniques. To know more about Data Visualization we recommend taking the following steps -

Wednesday, 20 November 2019

11/20/2019 12:06:00 pm

Ingestion and Processing of Data for Big Data & IoT Solutions


Introduction

In the era of the Internet of Things and Mobility, with a vast volume of data becoming available at a fast velocity, there must be the need for an efficient analytics system.
Also, the variety of data is coming from various sources in various formats, such as sensors, logs, structured data from an RDBMS, etc. In the past few years, the generation of new data has drastically increased. More applications are being built, and they are generating more data at a faster rate.
Earlier, Data Storage was costly, and there was an absence of technology which could efficiently process the data. Now the storage costs have become cheaper, and the availability of technology to process Big Data is a reality.

What is Big Data

According to the Author Dr. Kirk Borne, Principal Data Scientist, Big Data Definition is Everything, Quantified, and Tracked. Let’s pick that apart -
  • Everything – Means every aspect of life, work, consumerism, entertainment, and play is now recognized as a source of digital information about you, your world, and anything else we may encounter.
  • Quantified – This means we are storing those "everything” somewhere, mostly in digital form, often as numbers, but not always in such formats. The quantification of features, characteristics, patterns, and trends in all things is enabling data mining, machine learning, statistics, and discovery at an unprecedented scale on a unique number of items. The Internet of Things is just one example, but the Internet of Everything is even more impressive.
  • Tracked – This means we don’t merely quantify and measure everything just once, but we do so continuously. This includes - tracking your sentiment, your web clicks, your purchase logs, your geo-location, your social media history, etc. or tracking every car on the road, or every motor in a manufacturing plant or every moving part on an airplane, etc. Consequently, we see the emergence of smart cities, smart highways, personalized medicine, personalized education, precision farming, and so much more.
All of these quantified and tracked data streams will enable
  • Smarter Decisions
  • Better Products
  • Deeper Insights
  • Greater Knowledge
  • Optimal Solutions
  • Customer-Centric Products
  • Increased Customer Loyalty
  • More Automated Processes, more accurate Predictive and Prescriptive Analytics
  • Better models of future behaviors and outcomes in Business, Government, Security, Science, Healthcare, Education, and more.

Big Data Defines three of D2D’s

  • Data-to-Decisions
  • Data-to-Discovery
  • Data-to-Dollars

The 10 V's of Big Data


Big Data Framework

The Best Way for a solution is to "Split The Problem." Big Data solutions can be well understood using Layered Architecture. The Layered Architecture is split into different layers, where each layer performs a particular function.
This Architecture helps in designing the Data Pipeline with different requirements of either Batch Processing System or Stream Processing System. This architecture consists of 6 layers which ensure a secure flow of data.
Big Data Framework
  1. Data Ingestion Layer - This layer is the first step for the data coming from variable sources to start its journey. Data here is prioritized and categorized, which makes data flow smoothly in further layers.
  2. Data Collector Layer - In this Layer, more focus is on the transportation of data from the ingestion layer to the rest of the data pipeline. This is the Layer, where components are decoupled so that analytic capabilities may begin.
  3. Data Processing Layer - In this main layer focus is to specialize in the data pipeline processing system, or we can say the data we have collected in the previous layer is to be processed in this layer. Here we do some magic with the data to route them to a different destination, classify the data flow, and it’s the first point where the analytic may take place.
  4. Data Storage Layer - Storage becomes a challenge when the size of the data you are dealing with becomes significant. Several possible solutions can rescue from such problems. Finding a storage solution is very much important when the size of your data becomes large. This layer focuses on "where to store such large data efficiently."
  5. Data Query Layer - This is the layer where strong analytic processing takes place. Here the main focus is to gather the data value so that they are made to be more helpful for the next layer.
  6. Data Visualization Layer - The visualization or presentation tier, probably the most important tier, where the data pipeline users may feel the VALUE of DATA. We need something that will grab people’s attention, pull them into, make your findings well-understood.

1. Data Ingestion Layer


Data ingestion is the first step for building a Data Pipeline and also the most onerous task in the System of Big Data. In this layer, we plan the way to ingest data flows from hundreds or thousands of sources into the Data Center. As the Data coming from Multiple sources at variable speed, in different formats.
That's why we should adequately ingest the data for successful business decisions making. It's rightly said that "If starting goes well, then half of the work is already done."

1.1 What is Big Data Ingestion?

Big Data Ingestion involves connecting to various data sources, extracting the data, and detecting the changed data. It's about moving data - and especially the unstructured data - from where it is originated, into a system where it can be stored and analyzed.
We can also say that Data Ingestion means taking data coming from multiple sources, and putting it somewhere it can be accessed. It is the beginning of the Data Pipeline, where it obtains or import data for immediate use.
Data can be streamed in real-time or ingested in batches. When data is ingested in real-time, then, as soon as data arrives, it is ingested immediately. When data is ingested in quantities, data items are ingested in some chunks at a periodic interval of time. Ingestion is the process of bringing data into the Data Processing system.
Effective Data Ingestion process begins by prioritizing data sources, validating individual files, and routing data items to the correct destination.

1.2 Challenges Faced with Data Ingestion

As the number of IoT devices increases, both the volume and variance of Data Sources are expanding rapidly. So, extracting the data such that it can be used by the destination system is a significant challenge in terms of time and resources. Some of the other challenges faced by Data Ingestion are -
  • When numerous Big Data sources exist in the different format, it's the biggest challenge for the business to ingest data at the reasonable speed and further process it efficiently so that data can be prioritized and improves business decisions.
  • Modern Data Sources and consuming applications evolve rapidly.
  • Data produced changes without notice independent of consuming application.
  • Data Semantic Change over time as the same Data Powers new cases.
  • Detection and capture of changed data - This task is difficult, not only because of the semi-structured or unstructured nature of data but also due to the low latency needed by specific business scenarios that require this determination.
That's why it should be well designed assuring following things -
  • Able to handle and upgrade the new data sources, technology and applications
  • Assure that consuming application are working with correct, consistent, and trustworthy data.
  • Allows rapid consumption of data
  • Capacity and reliability - The system needs to scale according to input coming, and also it should be fault-tolerant.
  • Data volume: Though storing all incoming data is preferable, there are some cases in which aggregate data.

1.3 Data Ingestion Parameters

  • Data Velocity - Data Velocity deals with the speed at which data flows in from different sources like machines, networks, human interaction, media sites, social media. The flow of data can be massive or continuous.
  • Data Size - Data size implies an enormous volume of data. Data is generated by different sources that may increase timely.
  • Data Frequency (Batch, Real-Time) - Data can be processed in real-time or batch. In real-time processing as data received at the same time, it further proceeds, but in batch time, data is stored in batches, fixed at some time interval, and then also moved.
  • Data Format (Structured, Semi-Structured, Unstructured) - Data can be in different formats, mostly it can be structured format, i.e. tabular one or unstructured format, i.e. images, audios, videos or semi-structured, i.e. JSON files, CSS files, etc.

1.4 Big Data Ingestion Key Principles

To complete the process of Data Ingestion, we should use the right tools for that and most important that tools should be capable of supporting some of the key principles written below -
  • Network Bandwidth - Data Pipeline must be able to compete with business traffic. Sometimes traffic increases or sometimes decreases, so Network bandwidth scalability is the biggest Data Pipeline challenge. Tools are required for bandwidth throttling and compression capabilities.
  • Unreliable Network - Data Ingestion Pipeline takes data with multiple structures, i.e. images, audios, videos, text files, tabular files data, XML files, log files, etc. and due to the variable speed of data coming, it might travel through the unreliable network. Data Pipeline should be capable of supporting this also.
  • Heterogeneous Technologies and System - Tools for Data Ingestion Pipeline must be able to use different data sources technologies and different operating systems.
  • Choose Right Data Format - Tools must provide data serialization format, which means as data comes in the variable format, so converting them into a single format will provide a more comfortable view to understand or relate the data.
  • Streaming Data - It depends upon business necessity whether to process the data in batch or streams or real-time. Sometimes we may require both processing. So, tools must be capable of supporting both.

1.5 Data Serialization

Different types of users have different types of data consumer needs. Here we want to share variable data, so we must plan how the user can access data in a meaningful way. That's why a single image of variable data optimizes the data for human readability.
Approaches used for this are -
  • Apache Thrift - It's an RPC Framework containing Data Serialization Libraries.
  • Google Protocol Buffers - It can use the unique generated source code to quickly write and read structured data to and from a variety of data streams and using a variety of languages.
  • Apache Avro - The more recent Data Serialization format that combines some of the best features which previously listed. Avro Data is self-describing and uses a JSON-schema description. This schema is included with the data itself and natively supports compression. Probably it may become a de facto standard for Data Serialization.

1.6 Data Ingestion Tools

1.6.1 Apache Flume - Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
It has a simple and flexible architecture based on streaming data flows. It is robust and faults tolerant with tunable reliability mechanisms and many failovers and recovery mechanisms.
It uses a simple, extensible data model that allows for an online analytic application. Its functions are -
  • Stream Data - Ingest streaming data from multiple sources into Hadoop for storage and analysis.
  • Insulate System - Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination
  • Scale Horizontally - To ingest new data streams and additional volume as needed.
1.6.2 Apache Nifi - Apache Nifi provides an easy to use, powerful, and reliable system to process and distribute data. Apache NiFi supports robust and scalable directed graphs of data routing, transformation, and system mediation logic. Its functions are -
  • Track data flow from beginning to end.
  • The seamless experience between design, control, feedback, and monitoring
  • Secure because of SSL, SSH, HTTPS, encrypted content.
1.6.3 Elastic Logstash - Elastic Logstash is an open-source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously transforms it, and then sends it to your “stash," i.e., Elasticsearch.
It quickly ingests from your logs, metrics, web applications, data stores, and various AWS services and done in continuous, streaming fashion. It can Ingest Data of all Shapes, Sizes, and Sources.

2. Data Collector Layer


In this Layer, more focus is on transportation data from the ingestion layer to the rest of the Data Pipeline. Here we use a messaging system that will act as a mediator between all the programs that can send and receive messages.
Here the tool used is Apache Kafka. It's a new approach in message-oriented middleware.

2.1 Apache Kafka

It is used for building real-time data pipelines and streaming apps. It can process streams of data in real-time and store streams of data safely in a distributed replicated cluster.
Kafka works in combination with Apache Storm, Apache HBase, and Apache Spark for real-time analysis and rendering of streaming data.

2.2 What is a Data Pipeline?

  • Data Pipeline the main component of Data Integration. All transformation of data happens in the Data Pipeline.
  • It is a Python-based tool that streams and transforms real-time data to service that needs it.
  • Data Pipeline Automate the movement and transformation of data. Data Pipeline is a Data Processing engine that runs inside your application.
  • It is used to transform all the incoming data in a standard format so that we can prepare it for analysis and visualization. Data Pipeline is built on Java Virtual Machine (JVM).
  • So, a Data Pipeline is a series of steps that your data moves through. The output of one step in the process becomes the input of the next. Data, typically raw data, goes on one side, goes through a series of steps.
  • The steps of a Data Pipeline can include cleaning, transforming, merging, modeling, and more, in any combination.
2.2.1 Functions of Data Pipeline
  • Ingestion - Data Pipeline Helps in bringing data into your system. It means taking unstructured data from where it is originated into a system where it can be stored and analyzed for making business decisions.
  • Data Integration - Data Pipeline also helps in bringing different types of data together.
  • Organization - Organizing data means an arrangement of data, this arrangement is also made in the Data Pipeline.
  • Refining the data - It's also one of the processes where we can enhance, clean, improve the raw data.
  • Analytics - After refining the useful data, Data Pipeline provides us the processed data on which we can apply the operations on raw data and can make business decisions accurately.
2.2.2 Need Of Data Pipeline
A Data Pipeline is a software that takes data from multiple sources and makes it available to be used strategically for making business decisions.
Primarily reasons for the need for data pipeline is because it's tough to monitor Data Migration and manage data errors. Other reasons for this are below -
  • Certain Business - Critical Analysis is only possible when combining data from multiple sources. For making business decisions, we should have a single image of all the data coming.
  • Connections - All the time data keeps on increasing, new data came, and old data modified, so each new integration can take anywhere from a few days to a few months to complete.
  • Accuracy - The only way to build trust with data consumers is to make sure that your data is auditable. One best practice that’s easy to implement is to never discard inputs or intermediate forms when altering data.
  • Latency - The fresher your data, the more agile your company’s decision-making can be. Extracting data from APIs and databases in real-time can be difficult, and many target data sources, including large object stores like Amazon S3 and analytics databases like Amazon Redshift, are optimized for receiving data in chunks rather than a stream.
  • Scalability - Data can be increased or decreased with time we can't say for on Monday data will come less, and the rest of the days begins a lot for processing. So, the usage of data is not uniform. What we can do is making our pipeline infinitely scalable that able to handle any amount of data coming at variable speed.
2.2.3 Use cases for Data Pipeline


Data Pipeline is useful to several roles, including CTOs, CIOs, Data Scientists, Data Engineers, BI Analysts, SQL Analysts, and anyone else who derives value from a unified real-time stream of user, web, and mobile engagement data. So, use cases for data pipeline are given below -
  • For Business Intelligence Teams
  • For SQL Experts
  • For Data Scientists
  • For Data Engineers
  • For Product Teams

2.3 Apache Kafka is Good for 2 Things

  • Building Real-Time streaming Data Pipelines that reliably get data between systems or applications
  • Building Real-Time streaming applications that transform or react to the streams of data.
2.3.1 Common use cases of Apache Kafka -
  • Stream Processing
  • Website Activity Tracking
  • Metrics Collection and Monitoring
  • Log Aggregation
2.3.2 Features of Apache Kafka
  • One of the features of Kafka is durable Messaging.
  • Apache Kafka relies heavily on the filesystem for storing and caching messages: rather than maintain as much as possible in memory and flush it all out to the filesystem, all data is immediately written to a persistent log on the filesystem without necessarily flushing to disk.
  • Apache Kafka solves the situation where the producer is generating messages faster than the consumer can reliably consume them.
2.3.3 How Apache Kafka Works


Kafka System design acts as a Distributed commit log, where incoming data is written sequentially on disk. There are four main components involved in moving data in and out of Apache Kafka -
  • Topics - Topic is a user-defined category to which messages are published.
  • Producers - Producers publish messages to one or more topics.
  • Consumers - Consumers subscribe to topics and process the published messages.
  • Brokers - Brokers that manage the persistence and replication of message data.

3. Data Processing Layer



In the previous layer, we gathered the data from different sources and made it available to go through the rest of the pipeline.
In this layer, our task is to do magic with data, as now data is ready; we only have to route the data to different destinations.
In this main layer, the focus is to specialize Data Pipeline processing system, or we can say the data we have collected by the last segment in this next layer. We have to do processing on that data.
Processing can be done in 3 ways i.e.

3.1 Batch Processing System

A pure batch processing system for off-line analytic. For doing this tool used is Apache Sqoop.

3.2 Apache Sqoop

It efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores.
Apache Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

3.2.1 Functions of Apache Sqoop are -
  • Import sequential data sets from mainframe
  • Data imports
  • Parallel data Transfer
  • Fast data copies
  • Efficient data analysis
  • Load balancing

3.3 Near Real-Time Processing System

A pure online processing system for on-line analytic. For this type of processing tool, i.e. used is Apache Storm. The Apache Storm cluster makes decisions about the criticality of the event and sends the alerts to the alert system (dashboard, e-mail, other monitoring systems).
3.3.1 Apache Storm - It is a system for processing streaming data in real-time. It adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning, and continuous monitoring of operations.
3.3.2 Features of Apache Storm
  • Fast – It can process one million 100 byte messages per second per node.
  • Scalable – It can do parallel calculations that run across a cluster of machines.
  • Fault-tolerant – When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
  • Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
  • Easy to operate – It consists of Standard configurations that are suitable for production on day one. Once deployed, Storm is easy to operate.
  • Hybrid Processing system - This consists of Batch and Real-time processing System capabilities. This type of processing tool used is Apache Spark and Apache Flink.

3.4 Apache Spark

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning, or SQL workloads that require quick iterative access to datasets.
With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared data set in Hadoop.

3.5 Apache Flink

Flink is an open-source framework for distributed stream processing that Provides accurate results, even in the case of out-of-order or late-arriving data. Some of its features are -
  • It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining an exactly-once application state.
  • Performs at large scale, running on thousands of nodes with excellent throughput and latency characteristics.
  • It's streaming data flow execution engine, APIs, and domain-specific libraries for Batch, Streaming, Machine Learning, and Graph Processing.
3.5.1 Apache Flink Use Cases
  • Optimization of e-commerce search results in real-time
  • Stream processing-as-a-service for data science teams
  • Network/Sensor monitoring and error detection
  • ETL for Business Intelligence Infrastructure

4. Data Storage Layer


Next, the primary issue is to keep data in the right place based on usage. We have relational Databases that were a successful place to store our data over the years.
But with the new big data strategic enterprise applications, you should no longer be assuming that your persistence should be relational.
We need different databases to handle the different varieties of data, but using different databases creates overhead. That's why there is an introduction to the new concept in the database world, i.e. the Polyglot Persistence.

4.1 Polyglot Persistence

Polyglot persistence is the idea of using multiple databases to power a single application. Polyglot persistence is the way to share or divide your data into various databases and leverage their strength together.
It takes advantage of the strength of different databases. Here various types of data are arranged in different ways. In short, it means picking the right tool for the right use case.
It’s the same idea behind Polyglot Programming, which is the idea that applications should be written in a mix of languages to take advantage of the fact that different styles are suitable for tackling various problems.
4.1.1 Advantages of Polygon Persistence -
  • Faster response times - Here, we leverage all the features of databases in one app, which makes the response times of your app very fast.
  • Helps your app to scale well - Your app scales exceptionally well with the data. All the NoSQL databases scale well when you model databases accurately for the data that you want to store.
  • A rich experience - You have a vibrant experience when you harness the power of multiple databases at the same time. For example, if you want to search on Products in an e-commerce app, then you use ElasticSearch, which returns the results based on relevance, which MongoDB cannot do.

4.2 Tools used for Data Storage

4.2.1 HDFS
  • HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers.
  • HDFS also makes applications available to parallel processing. HDFS is built to support claims with large data sets, including individual files that reach into the terabytes.
  • It uses a master/slave architecture, with each cluster consisting of a single NameNode that manages file system operations and supporting DataNodes that manage data storage on individual compute nodes.
  • When HDFS takes in data, it breaks the information down into separate pieces and distributes them to different nodes in a cluster, allowing for parallel processing.
  • The file system also copies each piece of data multiple times and distributes the copies to individual nodes, placing at least one copy on a different server rack.
  • HDFS and YARN form the data management layer of Apache Hadoop.
4.2.1.1 Features of HDFS
  • It is suitable for distributed storage and processing.
  • Hadoop provides a command interface to interact with HDFS.
  • The built-in servers of namenode and data node help users to quickly check the status of the cluster.
  • Streaming access to file system data.
  • HDFS provides file permissions and authentication.
4.2.2 Gluster file systems (GFS)
As we know, the right storage solutions must provide elasticity in both storage and performance without affecting active operations.
Scale-out storage systems based on GlusterFS are suitable for unstructured data such as documents, images, audio and video files, and log files.GlusterFS is a scalable network filesystem.
Using this, we can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks.
  • It's Open Source.
  • You can deploy GlusterFS with the help of commodity hardware servers.
  • Linear scaling of performance and storage capacity.
  • Scale storage size up to several petabytes, which can be accessed by thousands for servers.
4.2.2.1 Use Cases For GlusterFS include
  • Cloud Computing
  • Streaming Media
  • Content Delivery
4.2.3 Amazon S3
  • Amazon Simple Storage Service (Amazon S3) is object storage with a simple web service interface to store and retrieve any amount of data from anywhere on the web.
  • It is designed to deliver 99.999999999% durability and scale past trillions of objects worldwide.
  • Customers use S3 as primary storage for cloud-native applications, as a bulk repository, or "data lake," for analytics, as a target for backup & recovery and disaster recovery, and with serverless computing.
  • It's simple to move large volumes of data into or out of S3 with Amazon's cloud data migration options.
  • Once data is stored in Amazon S3, it can be automatically tiered into lower cost, longer-term cloud storage classes like S3 Standard - Infrequent Access and Amazon Glacier for archiving.

5. Data Query Layer


This is the layer where strong analytic processing takes place. This is a field where interactive queries are necessaries, and it’s a zone traditionally dominated by SQL expert developers. Before Hadoop, we had minimal storage due to which it takes a long analytics process.

Learn More About "Big Data Ingestion"

Explore Our "Big Data Analytics Services"

Monday, 4 June 2018

6/04/2018 03:53:00 pm

Unit Testing, TDD and BDD in Machine Learning and Deep Learning

Unit Testing, TDD and BDD in Machine Learning and Deep Learning

Introduction to Test Driven Development (TDD)

A pattern built for development in performance testing is known as Test Driven Development. It is a process that enables the developers to write code and estimate the intended behavior of the application.
The requirements for the Test Driven Development process are mentioned below-
  • Detect the change in intended behavior.
  • A rapid iteration cycle that produces working software after each iteration.
  • To identify the bugs. If a test is not failing, but still a bug is found, then it is not considered as a bug, it will be considered as a new feature.
Tests can be written for functions and methods, whole classes, programs, web services, whole machine learning pipelines, neural networks, random forests, mathematical implementations and many more.
You May also Love to Read Overview of Artificial Neural Networks and its Applications

Test Driven Development Lifecycle

The TDD cycle enables the programmer to write functions in small modules. The small test modules consist of three sections that are described below -
  • Failed Test (RED) - The First step of TDD is to make a failure test of the application. In terms of Machine Learning, a failure test might be the output of an algorithm that always predicts the same thing. It is a kind of baseline test for Machine Learning algorithms.
  • Pass the Failed Test (GREEN) - After writing the failed test, next move is to pass the written failed test. The failed test is divided into a number of small failed tests and then tested by passing random values and dummy objects.
  • Refactoring the Code - After passing the failed test, there is a need to refactor the code. To implement the refactoring process, one must keep in mind that while making changes in the code the behavior should not be affected.
If the developer is adding special handling feature in the code such as an if statement, the code will no longer follow refracting process. If while refactoring the code, the previous test alters then the code has to pass the test process cycle.

Acceptance Test Driven Development (ATDD)

ATDD stands for Acceptance Test Driven Development. This technique is done before starting the development and includes customers, testers, and developers into the loop.
These all together decided acceptance criteria and work accordingly to meet the requirements.
ATDD helps to ensure that all project members understand what needs to be done and implemented. The failing tests provide us quick feedback that the requirement is not being met.

Advantages of Acceptance Test Driven Development

  • As we have ATDD very first, so it helps to reduce defect and bug fixing effort as the project progresses.
  • ATDD only focus on ‘What’ and not ‘How’. So, this makes it very easier to meet customer’s requirements.
  • ATDD makes developers, testers, and customers to work together, this helps to understand what is required from the system

 

Importance of Test Driven Development in Machine and Deep Learning

Many times, the code doesn’t raise an error. However, the result of the answers won’t be as expected or the other way around the output we get is not exactly what we wanted.
Let us assume that we want to use a package and we start to import the same. There is a chance that the imported package must have already been imported and we are importing it again.
Therefore, to avoid such a situation and we want to test if the package we wanted to import is already imported or not. So, when we submit the whole code to the test case, the test case should be able to find if the package is already imported or not. This is to avoid duplication.
Similarly as above, when we wanted to use the pre-trained models for predictions, the models sometimes will be huge and we want to load the model only once and in the process, if we load multiple times, the processing speed gets slowed down due to occupying more memory which actually is not required. Even in this case duplication has to be avoided.
Other cases that we could look at are the sufficient conditions. If we create a function, the function will take in an input and returns an output. So, when we use the concept of necessary and sufficient conditions, we’re interested in knowing the sufficient condition to say that the function is working properly. To give an example of a necessary condition, each step in the function should be error free.
If we create a function and on giving the input if it raises an error for ex: indentation error, the function is not well defined. So, one of the necessary conditions is error free steps. But, if the function runs successfully and gives an output, does that mean we have the correct answer?
Let’s say, we have two functions in a package, addition, and multiplication but the developer has actually given the code of addition for multiplication and vice-versa(a typo while defining the function).
If we use the function directly we will get the result; we won’t get the expected results. So, we could create a test case where given any two known inputs and the known output, if not one, for a few test examples, we can set the condition saying if all the test cases pass, then the given function is correct.

Simple Testing Module in Python

First of all, simple testing module implemented in Python is described which is further used for TDD in Machine Learning and Deep Learning.
To start writing the test, one has to first write the fail test. The simple failing test is described below -
In the above example, a NumGues object is initiated. Before running the testing script, the script is saved by the name which is ended with _tests.py. Then move to the current directory and run the following command -

Continue Reading:XenonStack/Blog