XenonStack: data science

Showing posts with label data science. Show all posts

Wednesday, 14 June 2017

machine learning 6/14/2017 06:27:00 pm

Data Preprocessing and Data Wrangling in Machine Learning and Deep Learning

Introduction

Deep learning and Machine learning are becoming more and more important in today's ERP (Enterprise Resource Planning). During the process of building the analytical model using Deep Learning or Machine Learning the data set is collected from various sources such as a file, database, sensors and much more.

But, the collected data cannot be used directly for performing analysis process. Therefore, to solve this problem Data Preparation is done. This includes two techniques that are listed below:

Data Preprocessing
Data Wrangling

Data Preparation is an important part of Data Science. It includes two concepts such as Data Cleaning and Feature Engineering. These two are compulsory for achieving better accuracy and performance in the Machine Learning and Deep Learning projects.

Data Preparation for Data Cleaning and Feature Engineering

Data Preprocessing

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

Therefore, certain steps are executed to convert the data into a tiny clean dataset. This technique is performed before the execution of Iterative Analysis. The set of steps is known as Data Preprocessing. This includes Data Cleaning, Data Integration, Data Transformation and Data Reduction.

Data Wrangling

Data Wrangling is a technique that is executed at the time of making an interactive model. In other words, it is used to convert the raw data into the format that is convenient for the consumption of data.

This technique is also known as Data Munging. This method also follows certain steps such as after extracting the data from different data sources, sorting of data using certain algorithm is performed, decompose the data into a different structured format and finally store the data into another database.

Need of Data Preprocessing

For achieving better results from the applied model in Machine Learning and Deep Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning and Deep Learning model need data in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values has to be managed from the original raw data set.

Another aspect is that dataset should be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one dataset and best out of them is chosen.

Need of Data Wrangling

Data Wrangling is an important aspect for implementing the model. Therefore, data is converted to the proper feasible format before applying any model intro it. By performing filtering, grouping and selecting appropriate data accuracy and performance of the model could be increased.

Another concept is that when time series data has to be handled every algorithm is executed with different aspects. Therefore Data Wrangling is used to convert the time series data into the required format of the applied model. In simple words, the complex data is converted into a usable format for performing analysis into it.

Why is Data Preprocessing used?

Data Preprocessing is necessary because of the presence of unformatted real world data. Mostly real world data is composed of -

Inaccurate data (missing data) - There are many reasons for missing data such as data is not continuously collected, a mistake in data entry, technical problems with biometrics and much more.

The presence of noisy data (erroneous data and outliers) - The reasons for the presence of noisy data could be a technological problem of gadget that gathers data, a human mistake during data entry and much more.

Inconsistent data - The presence of inconsistencies is due to the reasons such that existence of duplication within data, human data entry, containing mistakes in codes or names i.e. violation of data constraints and much more.

Therefore, to handle raw data, Data Preprocessing is performed.

Why is Data Wrangling used?

Data Wrangling is used to handle the issue of Data Leakage while implementing Machine Learning and Deep Learning. First of all, we have to understand what is Data Leakage?

What is Data Leakage in Machine Learning/Deep Learning?

Data Leakage is responsible for the cause of invalid Machine Learning/Deep Learning model due to the over optimization of the applied model.

Data Leakage is the term used when the data from outside i.e. not part of training dataset is used for the learning process of the model. This additional learning of information by the applied model will disapprove the computed estimated performance of the model.

For example when we want to use the specific feature for performing Predictive Analysis but that specific feature is not present at the time of training of dataset then data leakage will be introduced within the model.

Data Leakage can be demonstrated in many ways that are given below:

Leakage of data from test dataset to training dataset.
Leakage of computed correct prediction to the training dataset.
Leakage of future data into the past data.
Usage of data outside the scope of applied algorithm

In general, the leakage of data is observed from two main sources of Machine Learning/Deep Learning algorithms such as feature attributes (variables) and training dataset.

How to check the presence of Data Leakage within the applied model?

Data Leakage is observed at the time of usage of complex datasets. They are described below:

At the time of dividing time series dataset into training and test, the dataset is a complex problem.
Implementation of sampling in a graphical problem is a complex task.
Storage of analog observations in the form of audios and images in separate files having a defined size and timestamp.

How is Data Preprocessing performed?

Data Preprocessing is performed to remove the cause of unformatted real world data which are discussed above.

First of all, let's discuss how missing data can be handled. There are three different steps that can be executed which are given below:

Ignoring the missing record - It is the simplest and effective method for handling the missing data. But, this method should not be performed at the time when the number of missing values are huge or when the pattern of data is related to the unrecognized basic root of the cause of statement problem.

Filling the missing values manually - This is one of the best-chosen methods. But there is one limitation that when there is large dataset and missing values are large then, this method is not efficient as it becomes a time-consuming task.

Filling using computed values - The missing values can also be filled by computing mean, mode or median of the observed given values. Another method could be the predictive values that are computed by using any Machine Learning or Deep Learning algorithm. But one drawback of this method is that it can generate bias within the data as the computed values are not accurate with respect to the observed values.

Let's move further and discuss how we can deal with the noisy data. The methods that can be followed are given below:

Binning method - In this method sorting of data is performed with respect to the values of the neighborhood. This method is also known as local smoothing.
Clustering method - In the approach, the outliers may be detected by grouping the similar data in the same group i.e. in the same cluster.
Machine Learning - A Machine Learning algorithm can be executed for smoothing of data. For example, regression algorithm can be used for smoothing of data using a specified linear function.
Removing manually - The noisy data can be removed manually by the human being but it is a time-consuming process so mostly this method is not given priority.

To deal with the inconsistent data manually the data is managed using external references and knowledge engineering tools like knowledge engineering process.

How is Data Wrangling performed?

Data Wrangling is performed to minimize the effect of Data Leakage while executing the model. In other words if one consider the complete dataset for normalization and standardization, then the cross-validation is performed for the estimation of the performance of the model leads to the beginning of data leakage.

Another problem is also observed that the test model is also included for feature selection while executing each fold of cross-validation which further generates bias during performance analysis.

The effect of Data Leakage could be minimized by recalculating for the required Data Preparation during the cross-validation process that includes feature selection, outliers detection, and removal, projection methods, scaling of selected features and much more.

Another solution is that dividing the complete dataset into training dataset that is used to train the model and validation dataset which is used to evaluate the performance and accuracy of the applied model.

But, the selection of the model is made by looking at the results of test dataset in the cross validation process. This conclusion will not always be true as the sample of test dataset could vary and the performance of different models are evaluated for the particular type of test dataset. Therefore, while selecting best model test error is overfitting.

To solve this problem, the variance of the test error is determined by using different samples of test dataset. In this way, the best suitable model is chosen.

Difference Between Data Preprocessing and Data Wrangling

Data Preprocessing is performed before Data Wrangling. In this case, Data Preprocessing data is prepared exactly after receiving the data from the data source. In this initial transformations, data cleaning or any aggregation of data is performed. It is executed once.

For example, we have a data where one attribute have three variables and we have to convert them into three attributes and delete the special characters from them. This is the concept that is performed before applying any iterative model and will be executed once in the project.

On the other hand, Data Wrangling is performed during the iterative analysis and model building. This concept at the time of feature engineering. The conceptual view of the dataset changes as different models is applied to achieve good analytic model.

For example, we have data containing 30 attributes where two attributes are used to compute another attribute and that computed feature is used for further analysis. In this way, the data could be changed according to the requirement of the applied model.

Tasks of Data Preprocessing

Different steps are involved for Data Preprocessing. These steps are described below:

Data Cleaning - This is the first step which is implemented in Data Preprocessing. In this step, the main focus is on handling missing data, noisy data, detection, and removal of outliers, minimizing duplication and computed biases within the data.

Data Integration - This process is used when data is gathered from various data sources and data is combined together to form consistent data. This consistent data after performing data cleaning is used for analysis.

Data Transformation - This step is used to convert the raw data into a specified format according to the need of the model. The options used for transformation of data are given below:
- Normalization - In this method, numerical data is converted into specified range i.e. between 0 and 1 so that scaling of data can be performed.
- Aggregation - The concept can be derived from the word itself, this method is used to combine the features into one. For example combining two categories can be used to form a new category.
- Generalization - In this case, lower level attributes are converted into a higher level.

Data Reduction - After the transformation and scaling of data duplication i.e. redundancy within the data is removed and organize the data in an efficient manner.

Tasks of Data Wrangling

The tasks of Data wrangling are described below -

Discovering - Firstly, data should be understood thoroughly and examine which approach will best suit. For example: if have a weather data when we examine the data it is observed that data is from one area and so main focus is on determining patterns.

Structuring - As the data is gathered from different sources, the data will be present in different shapes and sizes. Therefore, there is a need of structuring the data in proper format.

Cleaning - Cleaning or removing of data should be performed that can degrade the performance of analysis.

Enrichment - Extract new features or data from the given dataset in order to optimize the performance of the applied model.

Validating - This approach is used for improving the quality of data and consistency rules so that transformations that are applied to the data could be verified.

Publishing - After completing the steps of Data Wrangling, the steps can be documented so that similar steps can be performed for similar kind of data to save time.

Continue Reading About - How Data Wrangling improves Data Analytics?

XenonStack 0

Tuesday, 2 May 2017

neural networks 5/02/2017 10:33:00 am

Understanding Log Analytics, Log Mining & Anomaly Detection

What is Log Analytics

With technologies such as Machine Learning and Deep Neural Networks (DNN), these technologies employ next generation server infrastructure that spans immense Windows and Linux cluster environments.

Additionally, for DNNs, these application stacks don’t only involve traditional system resources (CPUs, Memory), but also graphic processing units (GPUs).

With a non-traditional infrastructure environment, the Microsoft Research Operations team needed a highly flexible, scalable, and Windows and Linux compatible service to troubleshoot and determine root causes across the full stack.

Log Analytics supports log search through billions of records, Real-Time Analytics Stack metric collection, and rich custom visualizations across numerous sources.

These out of the box features paired with the flexibility of available data sources made Log Analytics a great option to produce visibility & insights by correlating across DNN clusters & components.

The relevance of log file can differ from one person to another. It may be possible that the specific log data can be beneficial for one user but irrelevant for the another user.

Therefore, the useful log data can be lost inside the large cluster. Therefore, the analysis of the log file is an important aspect these days.

With the management of real-time data, the user can use the log file for making decisions.

But, as the volume of data increases let's say to gigabytes then, it becomes impossible for the traditional methods to analyze such a huge log file and determine the valid data. By ignoring the log data a huge gap of relevant information will be created.

So, the solution for this problem is to use Deep Learning Neural Network as a training classifier for the log data. With this, it’s not required to read the whole log file data by the human being.

By combining the useful log data with the Deep Learning it becomes possible to gain the relevant optimum performance and comprehensive operational visibility.

Along with the analysis of log data, there is also need to classify the log file into relevant and irrelevant data.

With this approach, time and performance effort could be saved and close to accurate results could be obtained.

Understanding Log Data

Before discussing the analysis of log file first we should understand about the log file.

The log is a data that produces automatically by the system and stores the information about the events that are taking place inside the operating system. It stores the data at every period of time.

The log data can be presented in the form of pivot table or file. In log file or table, the records are arranged according to the time.

Every software applications and systems produce log files. Some of the examples of log files are transaction log file, event log file, audit log file, server logs, etc.

Logs are usually application specific, therefore, log analysis is a much-needed task to extract the valuable information from the log file.

Log Name	Log Data Source	Information within the Log Data
Transaction Log	Database Management System	It consists of information about the uncommitted transactions, changes made by the rollback transactions and the changes that are not updated in the database. This is performed to retain the ACID (Atomicity, Consistency, Isolation, Durability) property at the time of crashes
Message Log	Internet Relay Chat (IRC) and Instant Messaging (IM)	In the case of IRC, it consists of server messages during the time interval the user is being connected to the channel. On the other hand, to enable the privacy of the user IM allows storing the messages in encrypted form as a message log. These logs require a password to decrypt and view.
Syslog	Network Devices such as web servers, routers, switches, printers, etc.	Syslog messages provide the information on the basis of where, when and why i.e. IP-Address, Timestamp and the log message. It contains two bits: facility (source of the message) and security (degree of the importance of the log message)
Server Log File	Web Servers	It is created automatically and contains the information about the user in the form of three stages such as IP-Address of the remote server, timestamp and the document requested by the user
Audit Logs	Hadoop Distributed File System (HDFS) ann Apache Spark.	It will record all the HDFS access activities taking place with the Hadoop platform
Daemon Logs	Docker	It provides details about the interaction between containers, Docker service, and the host machine. By combining these interactions, the cycle of the containers and disruption within the Docker service could be identified.
Pods	Kubernetes	It is a collection of containers that share resources such a single IP_Address and shared volumes.
Amazon CloudWatch Logs	Amazon Web Services (AWS)	It is used to monitor the applications and systems using log data i.e. examine the errors with the application and system. It also used for storage and accessing the log data of the system.
Swift Logs	Openstack	These logs are sent to Syslog and managed by log level. They are used for monitoring the cluster, auditing records, extracting robust information about the server and much more.

Log Analysis Process

The steps for the processing of Log Analysis are described below:

Collection and Cleaning of data
Structuring of Data
Analysis of Data

Collection and Cleaning of data

Firstly, Log data is collected from various sources. The collected information should be precise and informative as the type of collected data can affect the performance. Therefore, information should be collected from real users. Each type of Log contains distinguish the type of information.

After the collection of data, the data is represented in the form of Relational Database Management System (RDMS). Each record is assigned a unique primary key and Entity-Relationship model is developed to interpret the conceptual schema of the data.

Once the log data is arranged in proper manner then, the process of cleaning of data has to be performed. This is because there can be the possibility of the presence of corrupted log data.

The reasons of corruption of log data are given below:

Crashing of disk where log data is stored
Applications are terminated abnormally
Disturbance in the configuration of input/output
Presence of virus in the system and much more

Structuring of Data

Log data is large as well as complex. Therefore, the presentation of log data directly affects their ability to correlate with the other data.

An important aspect is that the log data has the ability to directly correlate with the other log data so that deep understanding of the log data can be interpreted by the team members.

The steps implemented for the structuring of log data are given below:

Clarity about the usage of collected log data
Same assets involve across the data so that values of log data are consistent. This means that naming conventions can be used
Correlation between the objects is created automatically due to the presence of nested files in the log data. It’s better to avoid nested files from the log data.

Analysis of Data: Now, the next step is to analyze the structured form of log data. This can be performed by various methods such as Pattern Recognition, Normalization, Classification using Machine Learning, Correlation Analysis and much more.

Importance of Log Analysis

Indexing and crawling are two important aspects. If the content does not include indexing and crawling, then update of data will not occur properly within time and the chance of duplicates values will be increased.

But, with the use of log analytics, it will be possible to examine the issues of crawling and indexing of data. This can be performed by examining the time taken by Google to crawl the data and at what location Google is spending large time.

In the case of large websites, it becomes difficult for the team to maintain the record of changes that are made on the website. With the use of log analysis, updated changes can be maintained in the regular period of time thus helps to determine the quality of the website.

In Business point of view, frequent crawling of the website by the Google is an important aspect as it point towards the value of the product or services. Log analytics make it possible to examine how often Google views the page site.

The changes that are made in the page site should be updated quickly at that time in order to maintain the freshness of the content. This can also be determined by the log analysis.

Acquiring the real informative data automatically and measuring the level of security within the system.

Knowledge Discovery and Data Mining

In today's generation, the volume of data is increasing day by day. Because of these circumstances, there is a great need to extract useful information from large data that are further use for making decisions. Knowledge discovery and Data Mining are used to solve this problem.

Knowledge discovery and Data mining ate two distinct terms. Knowledge Discovery is a kind of process used for extracting the useful information from the database and Data Mining is one of the steps involved in this process. Data Mining is the algorithm used for extracting the patterns from the data.

Knowledge Discovery involves various steps such as Data Cleaning, Data Integration, Data Selection, Data Transformation, Data Mining, Pattern Evaluation, Knowledge Presentation.

Knowledge Discovery is a process that has total focus on deriving the useful information from the database, interpretation of storage mechanism of data, implementation of optimum algorithms and visualization of results.

This process gives more importance on finding the understandable patterns of data that further used for grasping useful information.

Data Mining involves the extraction of patterns and fitting of the model. The concept behind the fitting of the model is to ensure what type of information is inferred from the processing of model.

It works on three aspects such as model representation, model estimation, and search. Some of the common Data Mining techniques are Classification, Regression, and Clustering.

Log Mining

After performing analysis of logs, now next step is to perform log mining. Log Mining is a technique that uses Data Mining for the analysis of logs.

With the introduction of Data Mining technique for log analysis the quality of analysis of log data increases.

In this way analytics approach moves towards software and automated analytic systems.

But, there are few challenges to perform log analysis using data mining. These are:

Day by day volume of log data is increasing from megabytes to gigabytes or even petabytes. Therefore, there is a need of advanced tools for log analysis.
The essential information is missing from the log data. So, more efforts are needed to extract useful data.
The different number of logs are analyzed from different sources to move deep into the knowledge. So, logs in different formats have to be analyzed.
The presence of different logs creates the problem of redundancy of data without any identification. This leads to the problem of synchronization between the sources of log data.

As shown in fig the process of log mining consist of three phases. Firstly, the log data is collected from various sources like Syslog, Message log, etc. After collecting the log data, it is aggregated together using Log Collector. After aggregation second phase is started.

In this, data cleaning is performed by removing the irrelevant data or corrupted data that can affect the accuracy of the process. After cleaning, log data is represented in the structured form of data (Integrated form) so that queries could be executed on them.

After that, the transformation process is performed to convert into the required format for performing normalization and pattern analysis. Useful patterns are obtained by performing Pattern Analysis.

Various data mining techniques are used such as Association rules, Clustering etc to grasp the useful information from the patterns.

This information is used for decision-making and for alerting the unusual behavior of the pattern by the organization.

Define Anomaly

An anomaly is defined as the unusual behavior or pattern of the data. This unusual indicates the presence of the error in the system. It describes that the actual result is different from the obtained result, thus the applied model does not fit into the given assumptions.

The anomaly is further divided into three categories described below:

Point Anomalies

A single instance of a point is considered as an anomaly when it is farthest from the rest of the data.
Contextual Anomalies

This type of anomaly related to the abnormal behavior of the particular type of context within data. It is commonly observed in time series problems.
Collective anomalies

When the collected instance of data is help for detecting anomalies is considered as collective anomalies.

The system produces logs which contain the information about the state of the system. By analyzing the log data anomalies can be detected so that security of the system could be protected.

This can be performed by using Data Mining Techniques. This is because there is a need of usage of dynamic rules along with the data mining approach.

Network Intrusion Detection using Data Mining

In today's generation, the use of computers has been increased. Due to this, the probability of cyber crime has also increased.

Therefore, a system is developed known as Network Intrusion Detection which enables the security in the computer system.

Continue Reading The Full Article At - XenonStack.com/Blog