XenonStack

A Stack Innovator

Post Top Ad

Showing posts with label machine learning. Show all posts
Showing posts with label machine learning. Show all posts

Thursday, 19 December 2019

12/19/2019 05:38:00 pm

Chatbot Development and Platform with Machine Learning




Overview of Building ChatBots with Deep Learning

A ChatBot is an implementation of Conversational Interface Intelligently comprising of Machine Learning, Deep Learning as their backbone. ChatBots hold variety including be Textual, Voice and Image-based interactions.
The growth of chatbots has opened up new areas of customer engagement and new methods of fulfilling business in the form of conversational commerce. It is the most useful technology that businesses can rely on, possibly following the old models and producing apps and websites redundant. A chatbot is a computer program that copies human communications in its natural format including text or spoken language using AI procedures such as Natural Language Processing, image, video processing, and audio analysis. The most impressive characteristic of the bots is that they learn from past interactions and become intelligent and more intelligent over time. Chatbots works in two ways- rule-based and smart machine-based. Rule-based chatbots give predefined responses from a database, based on the keywords used for the research. However, smart machine based chatbots receive their capabilities from Artificial Intelligence and Cognitive Computing and adapt their operation based on customer interactions.
The chatbot can be of two types: Goal-oriented (such as Siri, Alexa, Cortana, etc.) and General Conversation (Microsoft Tay bot).
Conversation framework of ChatBot acts in three stages:

Business Challenge for Building Chatbots

  • A fixed set of answers
  • Integration of ChatBot
  • Understanding of Problem
  • Security Issues
  • Lack of Human Behaviour and Intentions
  • Managing a ChatBot
  • NLP limitations

Solution Offered for Implementing ChatBots with Deep Learning

Deep Learning which is galvanized by the functioning of the human brain, has composite engineering and used for the imitation of the data. Neural Network acts as the elementary brick of Deep Learning. A Neural Network is an Artificial Model of the human brain network modeled using hardware and software.

ChatBots Implementation Techniques

  • Streaming of incoming data through the backend
  • Create a model using Natural Language Processing
  • Create a Natural Conversational Flow
  • Add features to automate the process
  • Invite Customers to Join
 

Monday, 16 December 2019

12/16/2019 05:20:00 pm

Predictive Maintenance Using Machine learning Techniques



Understanding Predictive Maintenance Applications

Predictive maintenance nowadays gaining popularity among enterprises that predict failure of the system, and the actions could include corrective actions, the replacement of the system, or even planned failure. This helps enterprises to cost savings, greater predictability, and the improved availability of the systems. Predictive maintenance sidesteps both the limits and maximizes the use of its sources. Predictive maintenance also detects the irregularities and failure patterns and provide real-time alerts. These signals can facilitate the efficient maintenance of those components. AI-enabled Predictive Maintenance is uncommon in that instead of just predicting impending failure, it also attempts to provide outcome-focused instructions for operations and maintenance from analytics. Let’s explore the areas where predictive maintenance can be used:
Predictive maintenance covers diverse application areas, such as -
  • Manufacturing industry
  • Information and technology
  • Aerospace
  • Heavy-Machinery sector
  • Predicting the future performance of a subsystem or a component to make RUL (Remaining Useful Life) estimation.
In this use case, we will guide you through how to build a machine learning platform for predictive maintenance.

Business Challenge for Enabling Predictive Maintenance

  • Monitoring of Assets in Real-Time via sensor data patterns to predict the breakdown of Assets.
  • Production systems deteriorate with time and need maintenance.
  • The regular way to keep the system good is to apply preventive maintenance practices, in the case of clearly detected malfunctions or equipment breakdowns. All this affects the quality, cost and in general, productivity.
Other than this, the uncertainty of machine reliability at any given time also impacts on product/production delivery times.

Predictive Maintenance Analytics Pipeline

Collecting targeted data
The targeted data reside in remote locations and get into the analysis pipeline including sensors, meters, supervisory control, etc. Collect data from all of the remote data sources to learn and continually make better, more informed business decisions.
Determining Analytics Pipeline
Establish an Advanced Analytics Pipeline based on the specific operation. Cloud analytics should be balanced to reduce the burden of streaming perishable PdM data on Cloud Deployment. Follow a distributed approach to detect and respond to local events at Cloud dataflow consumer step, take immediate action on Streaming data, while simultaneously integrating additional data sources in the Cloud.

Technology Stack -

  • Python
  • Flask
  • Cloud IoT Core
  • Cloud Pub/Sub

Source: 
XenonStack/Use-Cases

Friday, 13 December 2019

12/13/2019 05:03:00 pm

Advanced Threat Analytics and Intelligence

Overview of Advanced Threat Analytics and Intelligence

The security aspect has changed dramatically over recent years. The cyber-attacks nowadays have become more pervasive, persistent, and proficient than ever at escaping and contaminating traditional security architecture. Cyber threats have become more complex and complicated. Many companies meet stealthy attacks in their systems. These attacks are targeted towards intellectual property and consumer information theft or encryption of important data for ransom. Therefore, to protect your IT assets, you must know what is coming, secure your digital interactions, detect, and manage inevitable breaches, and safeguard business chain and regulative compliance.
Threat Detection is the art of identifying attacks on a computer. While there are a large variety of Cyber Security attacks, most of them fit into one of four categories -
  • Probe
  • Denial of Service (DoS)
  • User to Root
  • Remote to User
Hence, companies are looking for Cyber Security Services and Solutions to ensure the security of their IT network. In this use case, we will guide you through how we built an effective cybersecurity and threat detection system using machine learning.

Apache Metron Overview

Apache Metron is a cybersecurity application framework that provides the ability to ingest, process and store various security data feeds at a scale level to detect cyber anomalies and enable organizations to take action against them rapidly.

Apache Spot Architecture for Cyber Security

Apache Spot is a cybersecurity project, aimed to bring Advanced Analytics to all IT Telemetry data on an open, scalable platform. Apache Spot expedites the threat detection, investigation, and remediation via machine learning and consolidates all enterprise security data into a comprehensive IT telemetry hub based on open data models.

Threat Detection Using Deep Learning

A multi-layered Deep Learning-based system is very robust, scalable and adaptable. All the identified incidents & patterns are denoted by a risk score, to help investigate the breach, control data loss and take precautionary actions for the future.

Threat Detection Using Machine Learning

A Machine Learning-based Threat Detection system automates the process of extracting insights from file samples through better generalization at identifying unknown variations. It also helps in reducing human analysis time.

Challenges to Real-Time Cyber Threat Intelligence

  • To perform Real-Time Threat Intelligence on trillions of messages per year.
  • Storing and Processing the unstructured security data.
  • Combine Machine Learning and Predictive Analytics to perform Real-Time Threat Analytics.

Solution Offerings for Threat Detection and Cyber Security

Threat Analytics and Intelligence by automating the process of Threat Detection and Analysis. Following steps are performed to Automate the process -
  • Network Dataset
  • Pre-Processing of Data
  • Feature Extraction
  • Reduce Data Amount
  • Improve Accuracy
  • Avoid Overfitting

Training and Testing of Data Using Classification Models

  • Decision Tree
  • Random Forest
  • Naive Bayes
  • KNN
  • Result Analysis

Wednesday, 11 December 2019

12/11/2019 05:15:00 pm

Sentiment Analytics Solutions with Deep Learning



Overview of Sentiment and Intent Analysis

  • Sentiment Analysis is termed as contextual mining of text to identify and extract information, understand the social sentiment of a brand. It is a text classification tool to analyze incoming messages and to depict positive, negative or neutral sentiments.
  • Sentiment Analysis using Natural Language Processing involves Supervised Learning, Neural Network Approach. Sentiment Analysis using Deep Learning will include Visual Keras Deep Learning Approach.
  • Intent Analysis involves understanding the emotions and intent of a user. It involves choosing the right events, tracking behavior against retention, identification of user’s need, bringing Real-Time Data Insights deriving value from Predictive Analytics. Intent Analysis using Automated Text Classification with Machine Learning involves Supervised Text Classification, Unsupervised Text Classification. Intent Analysis using Deep Learning involves Convolutional Neural Networks.

Business Challenge for Sentiment Analysis Adoption

  • Sarcasm Detection
  • Evaluate text to predict emotions
  • Parallel Computing for Massive Data
  • To improve algorithm precision

Solution Offered for Real Time Analysis

Real-Time Solution focusing Twitter trends and tweets involving –
  • Web Scraping to crawl Data from Twitter using Tweepy in Python.
  • Natural Language Processing to clean Textual Data and Feature Extraction.
The various steps included are –
  • Sentence Tokenization
  • Word Tokenization
  • Regular Expressions
  • Removing Stopwords
  • Working on n-grams
 Algorithms and Models use Supervised Learning algorithms in Text Mining trained on massive volume of data for better feature extraction and better accuracy to predict one’s attribute.

Wednesday, 13 September 2017

9/13/2017 10:54:00 am

What is SEMANTIC ANALYSIS?

The word semantic is a Linguistic term. It means something related to meaning in a language or logic.
In a natural language, semantic analysis is relating the structures and occurrences of the words, phrases, clauses, paragraphs etc and understanding the idea of what’s written in particular text. Does the formation of the sentences, occurrences of the words make any sense?
The challenge we face in the technologically advanced world is to make the computer understand the language or logic as much as the human does.
Semantic analysis requires rules to be defined for the system. These rules are same as the way we think about a language and we ask the computer to imitate. For example, “apple is red” is a simple sentence which a human understands that there is something called as Apple and it is red in color and the human knows that red means color.
For a computer, this is an alien language. The concept of linguistics here is this sentence formation has a structure in it. Subject-Predicate-object or in short form s-p-o. Where "apple" is subject, "is" is predicate and "red" are objects. Similarly, there are other linguistic nuances that are used in the semantic analysis.


Need for Semantic Analysis

The reason why we want the computer to understand as much as we do is that we have a lot of data and we have to make the most out of it.
Let us strictly restrict ourselves to text data. Extracting appropriate data (results) based on the query is one of the challenging tasks. This data can be a whole document or just an answer to a query and that depends on the query itself.
Assume that we have million text documents in our database and if we have a query for which the answer is in the documents. The challenges are
  • Getting the appropriate documents
  • Listing them in the ranked order
  • Giving the answer to the query if it is specific

Difference between Keyword-based Search and Semantic Search

In a search engine, a keyword based search is the searching technique which is implemented on the text documents based on the words that are found in the query. The query is initially processed for text cleaning and preprocessing and then based on the words used in the query the searching is done on the documents.
The documents are returned based on the most number of matches of the query words with documents.
In semantic search, we take care of the frequency of the words, syntactic structure of the natural language and other linguistic elements. In semantic search, the system understands the exact requirement of the search query.
When we search for “Usain Bolt” in Google, it returns the most appropriate documents and web pages regarding the famous athlete despite much more people with the same name since the search engine understands that we are searching for an athlete.
Keyword Based Search

Now, if we are a little specific in our search and search for Usain Bolt birthday, Google returns it as,
Semantic based Search

So, since Usain Bolt is quite a famous figure it might not be a surprising aspect for us. But there are a large number of other famous personalities and it is close to impossible to store all the information manually and show up accurately when a query is given by the user.
Moreover, the search query may not be constant. Each individual may query differently. Semantic techniques are applied here to store the data and fetch the results upon querying.
Let us see a different way of querying the above on Google
Different Ways of Searching on Google
Different Ways of Searching on Google

From above figures, it is evident that whatever way you give the search query, the search engine understands the intent of the user.


Semantic Search based on Domain Ontology

Earlier, we have seen search efficiency of Google which searches irrespective of any particular domain. Searches of this kind are based on open information extraction. What if we require a search engine for a specific domain?
The domain may be anything. A college, A particular sport, a specific subject, a famous location, tourist spots etc. For example, suppose we have a college and we want to create a search engine only for that college such that any text query regarding the college is answered by the search engine. For this purpose, we create domain ontology.


What is Ontology?

An ontology is set of concepts, their definitions, descriptions, properties, and relations. The relations here are relations among concepts and relations among relations.


How do we create Ontology?

Before starting to create an ontology, we first choose the domain of consideration. We list out all the concepts related to that domain along with the relations. We have a data structure which is already defined to represent the ontology. Ontology is created as .owl files.
An OWL file consists of concepts as classes and for classes, there are subclasses, properties, instances, data types and much more. All this information will be in XML form. For simplicity, there are tools available to create ontologies like Protege.

Creating Ontology


Storing the Unstructured Text Data in RDF Form

Ontology is created based on the concepts and we are ready to use this to find out the appropriate document for the query in a search engine. The text documents which are available in unstructured form need a structure and we call it as semantic structure.
Thanks to RDF (Resource Description Framework). RDF is a structure where we store the information given in text into triples form. These triples are similar to the triples that we have discussed earlier i.e. s-p-o form.
Machine Learning and Text Analysis process is used to extract data required and store in the form of triples. This way the knowledge base is ready. Both the ontology as well as structured form of text data as RDF’s
Storing Unstructured Data in RDF Form

Architecture for Real-Time Semantic Search Engine
Implementation of the architecture on "Computer Science" Domain -
The complete architecture for the search engine would be "Platform as a Service (PAAS)". Let us consider an example for "Computer Science" as a domain. In this, the user can search for faculty CVs from the desired universities and research areas based on the query. So the steps to build a Semantic Search Engine are -
  • Crawl the documents (DOC, PDF, XML, HTML etc) from various universities and classify faculty profiles
  • Convert the unstructured text present in various formats to structured RDF form as described in earlier sections.
  • Build Ontology for Computer Science Domain
  • Store the data in Apache Jena triple store (Both Ontology and RDF's)
  • Use SparQL query language on the data

Finally, the user can search for the required data by university or research area. Additionally, if a user has a project information i.e. any project of his/her own regarding Computer Science, the user can submit the project and the system analyze the project to identify appropriate faculty profiles working in a similar subject area. Various Big Datacomponents are necessary to make the search engine feasible to search in real-time.

Lingustic and Semantic Search


Data Ingestion using our Web Crawler Service

Starting with the data extraction process, a web crawler was built which scrapes the content from any university or educational websites.This Web Crawler is built using Akka framework which is highly scalable, concurrent and distributed. This also supports almost all type of files like HTML, DOC, PDF, Text Files and even images.
Continue Reading The Full Article at - XenonStack.com/Blog

Tuesday, 12 September 2017

9/12/2017 11:29:00 am

Introduction to Time Series Data

Time Series is defined as a set of observations taken at a particular period of time. For example, having a set of login details at regular interval of time of each user can be categorized as a time series. On the other hand, when the data is collected at once or irregularly, it is not taken as a time series data.
Time series data can be classified into two types -
  • Stock Series - It is a measure of attributes at a particular point in time and taken as a stock takes.
  • Flow Series - It is a measure of activity at a specific interval of time. It contains effects related to the calendar.
Time series is a sequence that is taken successively at the equally pace of time. It appears naturally in many application areas such as economics, science, environment, medicine, etc. There are many practical real life problems where data might be correlated with each other and are observed sequentially at the equal period of time. This is because, if the repeatedly observe the data at a regular interval of time, it is obvious that data would be correlated with each other.

With the use of time series, it becomes possible to imagine what will happen in the future as future event depends upon the current situation. It is useful to divide the time series into historical and validation period. The model is built to make predictions on the basis of historical data and then this model is applied to the validation set of observations. With this process, the idea is developed how the model will perform in forecasting.

Time Series is also known as the stochastic process as it represents the vector of stochastic variables observed at regular interval of time.

Components of Time Series Data

In order to analyze the time series data, there is a need to understand the underlying pattern of data ordered at a particular time. This pattern is composed of different components which collectively yield the set of observations of time series.
The Components of time series data are given below -
  • Trend
  • Cyclical
  • Seasonal
  • Irregular
Components of Time Series Data


Trend - It is a long pattern present in the time series. It produces irregular effects and can be positive, negative, linear or nonlinear. It represents the variations of low frequency and the high and medium frequency of data is filtered out from the time series.

If the time series does not contain any increasing or decreasing pattern, then time series is taken as stationary in the mean.

There are two types of the trend -
  1. Deterministic - In this case, the effects of the shocks present in the time series are eliminated i.e. revert to the trend in long run.
  2. Stochastic - It is the process in which the effects of shocks are never eliminated as they have permanently changed the level of the time series.
The stochastic process having a stationarity around the deterministic process is known as trend stationary process.

Cyclic - The pattern exhibit up and down movements around a specified trend is known as cyclic pattern. It is a kind of oscillations present in the time series. The duration of cyclic pattern depends upon the industries and business problems to be analysed. This is because the oscillations are dependable upon the business cycle.

They are larger variations that are repeated in a systematic way over time. The period of time is not fixed and usually composed of at least 2 months in duration. The cyclic pattern is represented by a well-shaped curve and shows contraction and expansion of data.

Seasonal - It is a pattern that reflects regular fluctuations. These short-term movements occur due to the seasonal factors and custom factors of people. In this case, the data faces regular and predictable changes that occurred at regular intervals of calendar. It always consist of fixed and known period.
The main sources of seasonality are given below -
  • Climate
  • Institutions
  • Social habits and practices
  • Calendar
How is the seasonal component estimated?

If the deterministic analysis is performed, then the seasonality will remain same for similar interval of time. Therefore, it can easily be modelled by dummy variables. On the other hand, this concept is not fulfilled by stochastic analysis. So, dummy variables are not appropriate because the seasonal component changes throughout the time series.

Different models to create a seasonal component in time series are given below -
  • Additive Model - It is the model in which the seasonal component is added with the trend component.
  • Multiplicative Model - In this model seasonal component is multiplied with the intercept if trend component is not present in the time series. But, if time series have trend component, sum of intercept and trend is multiplied with the seasonal component.
Irregular - It is an unpredictable component of time series. This component cannot be explained by any other component of time series because these variational fluctuations are known as random component. When the trend cycle and seasonal component is removed, it becomes residual time series. These are short term fluctuations that are not systematic in nature and have unclear patterns.


Difference between Time Series Data and Cross-Section Data

Time Series Data is composed of collection of data of one specific variable at particular interval of time. On the other hand, Cross-Section Data is consist of collection of data on multiple variables from different sources at a particular interval of time.
Collection of company’s stock market data at regular interval of year is an example of time series data. But when the collection of company’s sales revenue, sales volume is collected for the past 3 months then it is taken as an example of cross-section data.
Time series data is mainly used for obtaining results over an extended period of time but, cross-section data focuses on the information received from surveys at a particular time.


What is Time Series Analysis?

Performing analysis of time series data is known as Time Series Analysis. Analysis is performed in order to understand the structure and functions produced by the time series. By understanding the mechanism of time series data a mathematical model could easily be developed so that further predictions, monitoring and control can be performed.
Two approaches are used for analyzing time series data are -
  • In the time domain
  • In the frequency domain
Time series analysis is mainly used for -
  • Decomposing the time series
  • Identifying and modeling the time-based dependencies
  • Forecasting
  • Identifying and model the system variation


Need of Time Series Analysis

In order to model successfully, the time series is important in machine learning and deep learning. Time series analysis is used to understand the internal structure and functions that are used for producing the observations. Time Series analysis is used for -
  • Descriptive - In this case, patterns are identified in correlated data. In other words, the variations in trends and seasonality in the time series are identified.
  • Explanation - In this understanding and modeling of data is performed.
  • Forecasting - Here, the prediction from previous observations is performed for short term trends.
  • Invention Analysis - In this case, effect performed by any event in time series data is analyzed.
  • Quality Control - When the specific size deviates it provides an alert.


Applications of Time Series Analysis

Applications of Time Series Analysis


Time Series Database and its types
Time series database is a software which is used for handling the time series data. Highly complex data such higher transactional data is not feasible for the relational database management system. Many relational systems does not work properly for time series data. Therefore, time series databases are optimised for the time series data. Various time series databases are given below -
  • CrateDB
  • Graphite
  • InfluxDB
  • Informix TimeSeries
  • Kx kdb+
  • Riak-TS
  • RRDtool
  • OpenTSDB
Types of Time Series Database


What is Anomaly?

Anomaly is defined as something that deviates from the normal behaviour or what is expected. For more clarity let’s take an example of bank transaction. Suppose you have a saving bank account and you mostly withdraw Rs 10,000 but, one day Rs 6,00,000 amount is withdrawn from your account. This is unusual activity for bank as mostly, Rs 10,000 is deducted from the account. This transaction is an anomaly for bank employees.
The anomaly is a kind of contradictory observation in the data. It gives the proof that certain model or assumption does not fit into the problem statement.

Different Types of Anomalies

Different types of anomalies are given below -
  • Point Anomalies - If the specific value within the dataset is anomalous with respect to the complete data then it is known as Point Anomalies. The above mentioned example of bank transaction is an example of point anomalies.
  • Contextual Anomalies - If the occurrence of data is anomalous for specific circumstances, then it is known as Contextual Anomalies. For example, the anomaly occurs at a specific interval of period.
  • Collective Anomalies - If the collection of occurrence of data is anomalous with respect to the rest of dataset then it is known as Collective Anomalies. For example, breaking the trend observed in ECG.
Continue Reading The Full Article at - XenonStack.com/Blog

Wednesday, 14 June 2017

6/14/2017 06:27:00 pm

Data Preprocessing and Data Wrangling in Machine Learning and Deep Learning


Introduction


Deep learning and Machine learning are becoming more and more important in today's ERP (Enterprise Resource Planning). During the process of building the analytical model using Deep Learning or Machine Learning the data set is collected from various sources such as a file, database, sensors and much more.

But, the collected data cannot be used directly for performing analysis process. Therefore, to solve this problem Data Preparation is done. This includes two techniques that are listed below:

  • Data Preprocessing
  • Data Wrangling

Data Preparation is an important part of Data Science. It includes two concepts such as Data Cleaning and Feature Engineering. These two are compulsory for achieving better accuracy and performance in the Machine Learning and Deep Learning projects.

Data Preparation for Data Cleaning and Feature Engineering

Data Preprocessing


Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

Therefore, certain steps are executed to convert the data into a tiny clean dataset. This technique is performed before the execution of Iterative Analysis. The set of steps is known as Data Preprocessing. This includes Data Cleaning, Data Integration, Data Transformation and Data Reduction.

Data Wrangling


Data Wrangling is a technique that is executed at the time of making an interactive model. In other words, it is used to convert the raw data into the format that is convenient for the consumption of data.

This technique is also known as Data Munging. This method also follows certain steps such as after extracting the data from different data sources, sorting of data using certain algorithm is performed, decompose the data into a different structured format and finally store the data into another database.

Need of Data Preprocessing


For achieving better results from the applied model in Machine Learning and Deep Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning and Deep Learning model need data in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values has to be managed from the original raw data set.

Another aspect is that dataset should be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one dataset and best out of them is chosen.

Need of Data Wrangling


Data Wrangling is an important aspect for implementing the model. Therefore, data is converted to the proper feasible format before applying any model intro it. By performing filtering, grouping and selecting appropriate data accuracy and performance of the model could be increased.

Another concept is that when time series data has to be handled every algorithm is executed with different aspects. Therefore Data Wrangling is used to convert the time series data into the required format of the applied model. In simple words, the complex data is converted into a usable format for performing analysis into it.

Why is Data Preprocessing used?


Data Preprocessing is necessary because of the presence of unformatted real world data. Mostly real world data is composed of -

  • Inaccurate data (missing data) - There are many reasons for missing data such as data is not continuously collected, a mistake in data entry, technical problems with biometrics and much more.

  • The presence of noisy data (erroneous data and outliers) - The reasons for the presence of noisy data could be a technological problem of gadget that gathers data, a human mistake during data entry and much more.

  • Inconsistent data - The presence of inconsistencies is due to the reasons such that existence of duplication within data, human data entry, containing mistakes in codes or names i.e. violation of data constraints and much more.

Therefore, to handle raw data, Data Preprocessing is performed.

Why is Data Wrangling used?


Data Wrangling is used to handle the issue of Data Leakage while implementing Machine Learning and Deep Learning. First of all, we have to understand what is Data Leakage?

What is Data Leakage in Machine Learning/Deep Learning?


Data Leakage is responsible for the cause of invalid Machine Learning/Deep Learning model due to the over optimization of the applied model.

Data Leakage is the term used when the data from outside i.e. not part of training dataset is used for the learning process of the model. This additional learning of information by the applied model will disapprove the computed estimated performance of the model.

For example when we want to use the specific feature for performing Predictive Analysis but that specific feature is not present at the time of training of dataset then data leakage will be introduced within the model.

Data Leakage can be demonstrated in many ways that are given below:

  • Leakage of data from test dataset to training dataset.
  • Leakage of computed correct prediction to the training dataset.
  • Leakage of future data into the past data.
  • Usage of data outside the scope of applied algorithm

In general, the leakage of data is observed from two main sources of Machine Learning/Deep Learning algorithms such as feature attributes (variables) and training dataset.

How to check the presence of Data Leakage within the applied model?


Data Leakage is observed at the time of usage of complex datasets. They are described below:

  • At the time of dividing time series dataset into training and test, the dataset is a complex problem.
  • Implementation of sampling in a graphical problem is a complex task.
  • Storage of analog observations in the form of audios and images in separate files having a defined size and timestamp.

How is Data Preprocessing performed?


Data Preprocessing is performed to remove the cause of unformatted real world data which are discussed above.

First of all, let's discuss how missing data can be handled. There are three different steps that can be executed which are given below:

  • Ignoring the missing record - It is the simplest and effective method for handling the missing data. But, this method should not be performed at the time when the number of missing values are huge or when the pattern of data is related to the unrecognized basic root of the cause of statement problem.

  • Filling the missing values manually - This is one of the best-chosen methods. But there is one limitation that when there is large dataset and missing values are large then, this method is not efficient as it becomes a time-consuming task.

  • Filling using computed values - The missing values can also be filled by computing mean, mode or median of the observed given values. Another method could be the predictive values that are computed by using any Machine Learning or Deep Learning algorithm. But one drawback of this method is that it can generate bias within the data as the computed values are not accurate with respect to the observed values.

Let's move further and discuss how we can deal with the noisy data. The methods that can be followed are given below:

  • Binning method - In this method sorting of data is performed with respect to the values of the neighborhood. This method is also known as local smoothing.
  • Clustering method - In the approach, the outliers may be detected by grouping the similar data in the same group i.e. in the same cluster.
  • Machine Learning - A Machine Learning algorithm can be executed for smoothing of data. For example, regression algorithm can be used for smoothing of data using a specified linear function.
  • Removing manually - The noisy data can be removed manually by the human being but it is a time-consuming process so mostly this method is not given priority.

To deal with the inconsistent data manually the data is managed using external references and knowledge engineering tools like knowledge engineering process.


How is Data Wrangling performed?


Data Wrangling is performed to minimize the effect of Data Leakage while executing the model. In other words if one consider the complete dataset for normalization and standardization, then the cross-validation is performed for the estimation of the performance of the model leads to the beginning of data leakage.

Another problem is also observed that the test model is also included for feature selection while executing each fold of cross-validation which further generates bias during performance analysis.

The effect of Data Leakage could be minimized by recalculating for the required Data Preparation during the cross-validation process that includes feature selection, outliers detection, and removal, projection methods, scaling of selected features and much more.

Another solution is that dividing the complete dataset into training dataset that is used to train the model and validation dataset which is used to evaluate the performance and accuracy of the applied model.

But, the selection of the model is made by looking at the results of test dataset in the cross validation process. This conclusion will not always be true as the sample of test dataset could vary and the performance of different models are evaluated for the particular type of test dataset. Therefore, while selecting best model test error is overfitting.

To solve this problem, the variance of the test error is determined by using different samples of test dataset. In this way, the best suitable model is chosen.

Steps to perform Data Wrangling

Difference Between Data Preprocessing and Data Wrangling


Data Preprocessing is performed before Data Wrangling. In this case, Data Preprocessing data is prepared exactly after receiving the data from the data source. In this initial transformations, data cleaning or any aggregation of data is performed. It is executed once.

For example, we have a data where one attribute have three variables and we have to convert them into three attributes and delete the special characters from them. This is the concept that is performed before applying any iterative model and will be executed once in the project.

On the other hand, Data Wrangling is performed during the iterative analysis and model building. This concept at the time of feature engineering. The conceptual view of the dataset changes as different models is applied to achieve good analytic model.

For example, we have data containing 30 attributes where two attributes are used to compute another attribute and that computed feature is used for further analysis. In this way, the data could be changed according to the requirement of the applied model.

Tasks of Data Preprocessing


Different steps are involved for Data Preprocessing. These steps are described below:

  • Data Cleaning - This is the first step which is implemented in Data Preprocessing. In this step, the main focus is on handling missing data, noisy data, detection, and removal of outliers, minimizing duplication and computed biases within the data.

  • Data Integration - This process is used when data is gathered from various data sources and data is combined together to form consistent data. This consistent data after performing data cleaning is used for analysis.

  • Data Transformation - This step is used to convert the raw data into a specified format according to the need of the model. The options used for transformation of data are given below:
    • Normalization - In this method, numerical data is converted into specified range i.e. between 0 and 1 so that scaling of data can be performed.
    • Aggregation - The concept can be derived from the word itself, this method is used to combine the features into one. For example combining two categories can be used to form a new category.
    • Generalization - In this case, lower level attributes are converted into a higher level.

  • Data Reduction - After the transformation and scaling of data duplication i.e. redundancy within the data is removed and organize the data in an efficient manner.

Different Tasks of Data Preprocessing

Tasks of Data Wrangling


The tasks of Data wrangling are described below -

  • Discovering - Firstly, data should be understood thoroughly and examine which approach will best suit. For example: if have a weather data when we examine the data it is observed that data is from one area and so main focus is on determining patterns.

  • Structuring - As the data is gathered from different sources, the data will be present in different shapes and sizes. Therefore, there is a need of structuring the data in proper format.

  • Cleaning - Cleaning or removing of data should be performed that can degrade the performance of analysis.

  • Enrichment - Extract new features or data from the given dataset in order to optimize the performance of the applied model.

  • Validating - This approach is used for improving the quality of data and consistency rules so that transformations that are applied to the data could be verified.

  • Publishing - After completing the steps of Data Wrangling, the steps can be documented so that similar steps can be performed for similar kind of data to save time.

Different Tasks of Data Wrangling