XenonStack

A Stack Innovator

Post Top Ad

Monday 18 September 2017

9/18/2017 01:12:00 pm

Overview

In this Blog, We will Cover How to Build Slack like Online Chat using Rocket Chat and Deploy on Containers using Docker and Kubernetes.

Before This, We are Using Rocket Chat application on OpenStack Instances on On-Premises Deployment.

So We Migrated our Existing On-Premises Cloud Infrastructure to Containers based on Docker and Kubernetes.

As per official Docker Documentation, Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications, whether on laptops, data center VMs, or the cloud.

Kubernetes is container orchestration layer on top of container runtime/engine to manage and deploy Containers effectively.


Prerequisites for Rocket Chat Deployment on Kubernetes

For Deployment you need
Kubernetes is for automating deployment, scaling, management, the orchestration of containerized applications. We can use kubernetes cluster or for testing purpose we can also use minikube for Kubernetes.

For Shared Persistent Storage, we are using GlusterFS. GlusterFS is a scalable network file system.

Rocket.Chat is a Web-based Chat Server, developed in JavaScript, using the Meteor full stack framework.

Dockerfile is a text document that contains all the information/commands that what we need to configure any application in the respective container.

The Registry is an online storage for container images and lets you distribute Container images.

We can use any of following Container Registry for storing.

Kubectl is command line tool to manage Kubernetes cluster remotely and you can also configure in your machine follow this link.

Notes - If you are using an official image of Rocket Chat and MongoDB then you can skip Step 1, 2, 3, 4, 5, 6 and move forward to Storage Volume (Step 7).


Step 1 - Create a Rocket Chat Container Custom Image

Create a file name “Dockerfile” for Rockets Chat Container Image.

$ touch Dockerfile

Now Add the Following content to the dockerfile of Rocket Chat application

FROM node:4-slim
MAINTAINER XenonStack
COPY bundle/ /app/
RUN cd /app/programs/server \&& npm install
ENV PORT=3000 \
ROOT_URL=http://localhost:3000
EXPOSE 3000
CMD ["node", "/app/main.js"]

This Rocket Chat Application is based on NodeJS so we need to use NodeJS docker image from docker hub as a base image for Rocket Chat application.

After then we put our custom code of the Rocket Chat application to docker container and install all the required dependencies of rocket chat application to docker container.


Step 2 - Build Rocket Chat Docker Custom Image

$ docker build -t rocketchat:v1.0


Step 3 - Create a MongoDB Container Custom Image

Create a file name “Dockerfile” for MongoDB Container Image in new Folder named MongoDB.

$ mkdir mongodb && cd mongodb

$ touch Dockerfile

Now Add the Following content to the dockerfile of Mongo -
FROM ubuntu
MAINTAINER XenonStack
RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10 && \
echo "deb http://repo.mongodb.org/apt/ubuntu trusty/mongodb-org/3.0 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-3.0.list && \
apt-get update && \
apt-get install -y mongodb-org
VOLUME ["/data/db"]
WORKDIR /data
EXPOSE 27017
CMD ["mongod"]

This MongoDB image has a base image of Ubuntu but we can also use official docker image of MongoDB. We have created this dockerfile for MongoDB Version 3.0 for some compatibility reasons with Rocket Chat Application.

Next, we mount Volume “/data/db” for persistent storage of container.

Next, we expose 27017 port for incoming requests to MongoDB server. Then, we start MongoDB server in dforeground mode so that we can see logs in “stdout” of container.


Step 4 - Building a MongoDB Docker Custom Image

$ docker build -t mongo:v3.0


Step 5 - Adding Container Registry to Docker Daemon

If you are using docker registry other than docker hub to store images then you need to add that container registry to your local docker daemon and kubernetes Docker Nodes also.

There are so many ways to add container registry to docker daemon as per different operating systems.

So i will explain one of them which i'm using daily basis.
$ docker version
Client:
Version: 17.03.1-ce
API version: 1.27
Go version: go1.7.5
Git commit: c6d412e
Built: Mon Mar 27 17:14:09 2017
OS/Arch: linux/amd64 (Ubuntu 16.04)

Now we need to Create a “daemon.json” in below mentioned location

$ sudo nano /etc/docker/daemon.json

And add the following content to it.
{

"insecure-registries": ["<name of your private registry>"]

}

Now Run the following commands to reload systemctl and restart docker daemon.

$ sudo systemctl daemon-reload

$ sudo service docker restart

To verify that your container registry is added to local docker daemon, use the below mentioned steps.

$ docker info

In output of above you get your container registry like this

Insecure Registries:

<your container registry name>

127.0.0.0/8


Step 6 - Pushing Custom PostgreSQL Container Image to Container Registry

Let`s start to uploading our custom images to container registry like

If you have authentication enabled on container registry then you need to login first then we can upload or download images from container registry.

To Login follow below mentioned command

$ docker login <name of your container registry>

Username : xxxx

Password: xxxxx

For AWS ECR you will get registry url, username and password from respective cloud provider when you launch container registry on cloud.

Here is shell script that will add your aws credentials for Amazon ECR.

#!/bin/bash
pip install --upgrade --user awscli
mkdir -p ~/.aws && chmod 755 ~/.aws
cat << EOF > ~/.aws/credentials
[default]
aws_access_key_id = XXXXXX
aws_secret_access_key = XXXXXX
EOF
cat << EOF > ~/.aws/config
[default]
output = json
region = XXXXX
EOF
chmod 600 ~/.aws/credentials
ecr-login=$(aws ecr get-login --region XXXXX)
$ecr-login

Now we need to tag rocketchat images and push them to any of the above mentioned container registry.

To Tag images

$ docker tag rocketchat:v1.0 <name of your registry>/rocketchat:v1.0

$ docker tag mongo:v3.0 <name of your registry>/mongo:v3.0

To Push Images

$ docker push <name of your registry>/rocketchat:v1.0

$ docker push <name of your registry>/mongo:v3.0

Similarly we can push images to any of above mentioned container registry like aws ecr , google container registry or azure container registry etc.


Step 7 - Create a Storage Volume (Using GlusterFS)

Using below mentioned command we create a volume in GlusterFS cluster for MongoDB. As we are using glusterfs as persistent volume to mongodb container so we need to create volume in GlusterFS. We need to add the IP Address or DNS instead of node1 and node2 as you specified in your installation of glusterfs.

$ gluster volume create apt-cacher replica 2 transport tcp k8-master:/mnt/brick1/mongodb-disk k8-1:/mnt/brick1/mongodb-disk
$ gluster volume start mongodb-disk
$ gluster volume info mongodb-disk

deploying rocket chat on kubernetes

Figure - Information of Gluster Volume


Step 8 - Deploy MongoDB on Kubernetes

Deploying MongoDB Single Node on Kubernetes have following prerequisites -
  • Docker Image: We have created a Docker Image for MongoDB in Step 4 and pushed to docker hub or private docker registry.

Continue Reading The Full Article at - XenonStack.com/Blog

Friday 15 September 2017

9/15/2017 11:48:00 am

Introduction

GO is a new language and thanks to GO community, it has grown rapidly since the release of GO 1.0. yet, keeping its most fundamental principle of simplicity intact.

Like other programming languages, GO has the approach for dependency management like npm in Node.js, the pip in Python.

GO user and developer have been using go get all the while. And there were many tools like godep, glide etc. available to compensate the little shortcomings of go get tool. One of the tool is glide and we will see how to use glide for package management.

But in future GO community might see a better dependency tool currently under development github.com/golang/dep.

Here I will walk through how we use different tools for managing packages in GO.
Why GoLang is Used

Workspace

When we set up Golang, we have a single directory where $GOPATH is referencing to. This directory is our workspace, where all the thing about go begins. And the workspace structure looks like

-Work
|
-- bin/
|
-- pkg/
|
-- src/
|
-- project1/
|
-- project2/

bin - It contains executable commands.
pkg - It contains package objects.
src - It contains source files.


$GOPATH

It is an environment variable which specifies the location of our workspace. Inside the directory $GOPATH/src we create a new directory for each project we start to work on.

This $GOPATH exists because:
  • All the import declaration in GO code references to a package through import path, and directories inside $GOPATH/src where go tool can compute all the imported packages.

  • To store the dependencies retrieved by go get.


Simplicity of GO GET

go get is the official GO tool to fetch GO code from a repository and store it in $GOPATH/src.

It provides isolation of packages with different import paths.

And in our source code, we just have to specify our compiler where it should go to get latest sources.
import
(
“fmt”
“github.com/gorilla/mux”
)
go get github.com/gorilla/mux
Dependency Management in GoLang

And it will install the latest commit from the master branch of GitHub repository, these packages are not limited to GitHub.

For example golang.org/x/mobile - libraries and build tools for GO on Android.
go get always fetches the latest code for any package which is not already on the disk.


GO GET Flags

go get also uses flag instruction such as [-u, -insecure, -d, -f, -t, -fix, -v.]
The -u flag instructs get to use the network to update the named packages and their dependencies.

go get -u github.com/gorilla/mux
The -insecure flag permits fetching from repositories and resolving custom domains using insecure schemes such as HTTP.

go get -insecure github.com/name/repo
The -d flag instructs get to stop after downloading the packages; that is, it instructs get not to install the packages.

The -f flag, valid only when -u is set, forces get -u not to verify that each package has been checked out from the source control repository implied by its import path. This can be useful if the source is a local fork of the original.

The -fix flag instructs the get to run the fix tool on the downloaded packages before resolving dependencies or building the codes.

The -t flag instructs get to also download the packages required to build the tests for the specified packages.


Shortcomings in GO GET

We see that the way GO saves all its dependencies, there are some problems in this approach of dependency management like:
  • We won't be able to determine which version of package we need unless it is hosted in complete different repository as go get always fetches from the latest version of a package. This leads to problem when working as team we might end up fetching different version of a package.

  • And since go get installs package at $GOPATH/src directory so we will have to use the same version of a package for different projects as all our projects are under a single directory. So each of the projects will not have different versions of dependencies with them.


Package Management Using Glide

Now that we have seen how GO handles imports and manage packages and we also saw some of the difficulties a developer faces when handling dependencies. Let us tell you how to solve them.

There are many tools available to handle packages which have been used by GO developers like godep, glide etc. and we are going to explain the one we are using : GLIDE.


What is Glide?

Glide is a package management tool for GO language. It downloads dependencies from different sources and then locks the versions so that each team member gets an exact same version to download and updates the dependencies which do not break the project.
Glide in GoLang

Installing Glide

  • At first, we will install glide, using shell script curl https://glide.sh/get | sh
It will get the latest release of glide and the script puts it in GO binaries ($GOPATH/bin or $GOBIN).

Continue Reading The Full Article at - Xenonstack.com/Blog

Thursday 14 September 2017

9/14/2017 03:47:00 pm

Introduction

Microservices is an approach of developing an application by splitting it into smaller services, where each module run in its own process and communicating with each other in a lightweight mechanism. These services are independently deployed and fully automated.
Microservices is a new term introduced which changed the architecture of the software development, moreover, it changes the working culture of the teams the way teams work together.


What are Microservices?

Microservices architecture divides the complex application into smaller modules. This provides a number of benefits over the monolithic architecture.

Microservices architecture deploys independently increase of the other modules. Microservices take this approach to independent services. With microservices, we will easily be able to adopt technology more quickly and understand how new advancements may help us.

In short, the Microservices architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with light weight mechanisms, often an HTTP resource API.

These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.

What Are Microservices

Multiple services work together to form whole application, Services like service A, service B and service C work independently, and they collaborate and form the whole application. These services collaborate for the functioning of the application. These chunks of services are called Microservices.


Benefits of Microservices

The benefits of microservices are many and varied. Many of these benefits can be laid at the door of any distributed system. Microservices, however, tend to achieve these benefits to a greater degree primarily due to how far they take the concepts behind distributed systems and service-oriented architecture.

Traditional Architecture Vs Microservices Architecture

Deployed Independently

The most important benefit of the microservices is that it deployed independently of the other modules. In case if there is need to make changes to the module other modules remain unaffected. If one module stopped working then the application as a whole will not be affected only that one module will be affected.


Parallelize Development

The application which was broken down into smaller modules will be developed parallel to different developer work on different module and development of all the modules is parallel.


Need of Testing in Microservices

It is very important to test the microservices as to be very confident about the assumptions made for each service that it will do what is says to do. Testing of microservices is the first step for making the service reliable for the user. For the internal functional dependency of the microservices, the testing is more important for the service to stay strong.


Challenges of Microservices

Microservices architecture includes numerous tiny independently services which integrate with each other to form the whole application. These microservices interact with each other in the production environment for the functionality of the application, although these small microservices are very simple while communicating with each other there arise many complexities. As the granularity of the application increases.


Testing Strategies

It is essential that a microservice application is built with an awareness of how it can be tested. Having good test coverage gives you more confidence in your code and results in a better continuous delivery pipeline.
As this is a new architectural approach so require a new approach of doing automated testing and quality assurance. New approach divides the specific layer of tests. There is five layer of tests that are performed over microservices -
Different Testing Strategies in Microservices Architecture


Unit Testing

Tests a single class or a set of closely coupled classes. These unit tests can either be run using the actual objects that the unit interacts with or by employing the use of test doubles or mocks.

In Unit testing, the smallest piece of testable software is tested in the application to determine whether it behaves as expected or not. Tests are typically run at the class level or around a small group of related classes. In unit testing, an important distinction is seen based on whether or not the unit under test is isolated from its collaborators.
Unit Test Vs Production Code
Unit tests are usually written by the programmers using their regular tools - the only difference being the use of the same sort of unit testing framework.
There is further two types of testing in Unit Testing
  • Sociable Unit Testing
  • Solitary Unit Testing

Sociable Unit Testing - It focuses on testing the behaviour of modules by observing changes in their state. This treats the unit under test as a black box tested entirely through its interface.

Solitary Unit Testing - It looks at the interactions and collaborations between an object and its dependencies, which are replaced by test doubles.
Imagine you're testing an order class's price method. The price method needs to invoke some functions on the product and customer classes. If you like your unit tests to be solitary, you don't want to use the real product or customer classes here, because a fault in the customer class would cause the order class's tests to fail. Instead you use Test doubles for the collaborators.
Socialbele Tests
Unit testing alone does not provide a guarantee about the behaviour of the system. Unit testing covers the core testing of the each module, however, there is no coverage of the test when these modules collaborate to work together to interact with the remote dependencies.

Continue Reading The Full Article at - Xenonstack.com/Blog

Wednesday 13 September 2017

9/13/2017 10:54:00 am

What is SEMANTIC ANALYSIS?

The word semantic is a Linguistic term. It means something related to meaning in a language or logic.
In a natural language, semantic analysis is relating the structures and occurrences of the words, phrases, clauses, paragraphs etc and understanding the idea of what’s written in particular text. Does the formation of the sentences, occurrences of the words make any sense?
The challenge we face in the technologically advanced world is to make the computer understand the language or logic as much as the human does.
Semantic analysis requires rules to be defined for the system. These rules are same as the way we think about a language and we ask the computer to imitate. For example, “apple is red” is a simple sentence which a human understands that there is something called as Apple and it is red in color and the human knows that red means color.
For a computer, this is an alien language. The concept of linguistics here is this sentence formation has a structure in it. Subject-Predicate-object or in short form s-p-o. Where "apple" is subject, "is" is predicate and "red" are objects. Similarly, there are other linguistic nuances that are used in the semantic analysis.


Need for Semantic Analysis

The reason why we want the computer to understand as much as we do is that we have a lot of data and we have to make the most out of it.
Let us strictly restrict ourselves to text data. Extracting appropriate data (results) based on the query is one of the challenging tasks. This data can be a whole document or just an answer to a query and that depends on the query itself.
Assume that we have million text documents in our database and if we have a query for which the answer is in the documents. The challenges are
  • Getting the appropriate documents
  • Listing them in the ranked order
  • Giving the answer to the query if it is specific

Difference between Keyword-based Search and Semantic Search

In a search engine, a keyword based search is the searching technique which is implemented on the text documents based on the words that are found in the query. The query is initially processed for text cleaning and preprocessing and then based on the words used in the query the searching is done on the documents.
The documents are returned based on the most number of matches of the query words with documents.
In semantic search, we take care of the frequency of the words, syntactic structure of the natural language and other linguistic elements. In semantic search, the system understands the exact requirement of the search query.
When we search for “Usain Bolt” in Google, it returns the most appropriate documents and web pages regarding the famous athlete despite much more people with the same name since the search engine understands that we are searching for an athlete.
Keyword Based Search

Now, if we are a little specific in our search and search for Usain Bolt birthday, Google returns it as,
Semantic based Search

So, since Usain Bolt is quite a famous figure it might not be a surprising aspect for us. But there are a large number of other famous personalities and it is close to impossible to store all the information manually and show up accurately when a query is given by the user.
Moreover, the search query may not be constant. Each individual may query differently. Semantic techniques are applied here to store the data and fetch the results upon querying.
Let us see a different way of querying the above on Google
Different Ways of Searching on Google
Different Ways of Searching on Google

From above figures, it is evident that whatever way you give the search query, the search engine understands the intent of the user.


Semantic Search based on Domain Ontology

Earlier, we have seen search efficiency of Google which searches irrespective of any particular domain. Searches of this kind are based on open information extraction. What if we require a search engine for a specific domain?
The domain may be anything. A college, A particular sport, a specific subject, a famous location, tourist spots etc. For example, suppose we have a college and we want to create a search engine only for that college such that any text query regarding the college is answered by the search engine. For this purpose, we create domain ontology.


What is Ontology?

An ontology is set of concepts, their definitions, descriptions, properties, and relations. The relations here are relations among concepts and relations among relations.


How do we create Ontology?

Before starting to create an ontology, we first choose the domain of consideration. We list out all the concepts related to that domain along with the relations. We have a data structure which is already defined to represent the ontology. Ontology is created as .owl files.
An OWL file consists of concepts as classes and for classes, there are subclasses, properties, instances, data types and much more. All this information will be in XML form. For simplicity, there are tools available to create ontologies like Protege.

Creating Ontology


Storing the Unstructured Text Data in RDF Form

Ontology is created based on the concepts and we are ready to use this to find out the appropriate document for the query in a search engine. The text documents which are available in unstructured form need a structure and we call it as semantic structure.
Thanks to RDF (Resource Description Framework). RDF is a structure where we store the information given in text into triples form. These triples are similar to the triples that we have discussed earlier i.e. s-p-o form.
Machine Learning and Text Analysis process is used to extract data required and store in the form of triples. This way the knowledge base is ready. Both the ontology as well as structured form of text data as RDF’s
Storing Unstructured Data in RDF Form

Architecture for Real-Time Semantic Search Engine
Implementation of the architecture on "Computer Science" Domain -
The complete architecture for the search engine would be "Platform as a Service (PAAS)". Let us consider an example for "Computer Science" as a domain. In this, the user can search for faculty CVs from the desired universities and research areas based on the query. So the steps to build a Semantic Search Engine are -
  • Crawl the documents (DOC, PDF, XML, HTML etc) from various universities and classify faculty profiles
  • Convert the unstructured text present in various formats to structured RDF form as described in earlier sections.
  • Build Ontology for Computer Science Domain
  • Store the data in Apache Jena triple store (Both Ontology and RDF's)
  • Use SparQL query language on the data

Finally, the user can search for the required data by university or research area. Additionally, if a user has a project information i.e. any project of his/her own regarding Computer Science, the user can submit the project and the system analyze the project to identify appropriate faculty profiles working in a similar subject area. Various Big Datacomponents are necessary to make the search engine feasible to search in real-time.

Lingustic and Semantic Search


Data Ingestion using our Web Crawler Service

Starting with the data extraction process, a web crawler was built which scrapes the content from any university or educational websites.This Web Crawler is built using Akka framework which is highly scalable, concurrent and distributed. This also supports almost all type of files like HTML, DOC, PDF, Text Files and even images.
Continue Reading The Full Article at - XenonStack.com/Blog

Tuesday 12 September 2017

9/12/2017 11:29:00 am

Introduction to Time Series Data

Time Series is defined as a set of observations taken at a particular period of time. For example, having a set of login details at regular interval of time of each user can be categorized as a time series. On the other hand, when the data is collected at once or irregularly, it is not taken as a time series data.
Time series data can be classified into two types -
  • Stock Series - It is a measure of attributes at a particular point in time and taken as a stock takes.
  • Flow Series - It is a measure of activity at a specific interval of time. It contains effects related to the calendar.
Time series is a sequence that is taken successively at the equally pace of time. It appears naturally in many application areas such as economics, science, environment, medicine, etc. There are many practical real life problems where data might be correlated with each other and are observed sequentially at the equal period of time. This is because, if the repeatedly observe the data at a regular interval of time, it is obvious that data would be correlated with each other.

With the use of time series, it becomes possible to imagine what will happen in the future as future event depends upon the current situation. It is useful to divide the time series into historical and validation period. The model is built to make predictions on the basis of historical data and then this model is applied to the validation set of observations. With this process, the idea is developed how the model will perform in forecasting.

Time Series is also known as the stochastic process as it represents the vector of stochastic variables observed at regular interval of time.

Components of Time Series Data

In order to analyze the time series data, there is a need to understand the underlying pattern of data ordered at a particular time. This pattern is composed of different components which collectively yield the set of observations of time series.
The Components of time series data are given below -
  • Trend
  • Cyclical
  • Seasonal
  • Irregular
Components of Time Series Data


Trend - It is a long pattern present in the time series. It produces irregular effects and can be positive, negative, linear or nonlinear. It represents the variations of low frequency and the high and medium frequency of data is filtered out from the time series.

If the time series does not contain any increasing or decreasing pattern, then time series is taken as stationary in the mean.

There are two types of the trend -
  1. Deterministic - In this case, the effects of the shocks present in the time series are eliminated i.e. revert to the trend in long run.
  2. Stochastic - It is the process in which the effects of shocks are never eliminated as they have permanently changed the level of the time series.
The stochastic process having a stationarity around the deterministic process is known as trend stationary process.

Cyclic - The pattern exhibit up and down movements around a specified trend is known as cyclic pattern. It is a kind of oscillations present in the time series. The duration of cyclic pattern depends upon the industries and business problems to be analysed. This is because the oscillations are dependable upon the business cycle.

They are larger variations that are repeated in a systematic way over time. The period of time is not fixed and usually composed of at least 2 months in duration. The cyclic pattern is represented by a well-shaped curve and shows contraction and expansion of data.

Seasonal - It is a pattern that reflects regular fluctuations. These short-term movements occur due to the seasonal factors and custom factors of people. In this case, the data faces regular and predictable changes that occurred at regular intervals of calendar. It always consist of fixed and known period.
The main sources of seasonality are given below -
  • Climate
  • Institutions
  • Social habits and practices
  • Calendar
How is the seasonal component estimated?

If the deterministic analysis is performed, then the seasonality will remain same for similar interval of time. Therefore, it can easily be modelled by dummy variables. On the other hand, this concept is not fulfilled by stochastic analysis. So, dummy variables are not appropriate because the seasonal component changes throughout the time series.

Different models to create a seasonal component in time series are given below -
  • Additive Model - It is the model in which the seasonal component is added with the trend component.
  • Multiplicative Model - In this model seasonal component is multiplied with the intercept if trend component is not present in the time series. But, if time series have trend component, sum of intercept and trend is multiplied with the seasonal component.
Irregular - It is an unpredictable component of time series. This component cannot be explained by any other component of time series because these variational fluctuations are known as random component. When the trend cycle and seasonal component is removed, it becomes residual time series. These are short term fluctuations that are not systematic in nature and have unclear patterns.


Difference between Time Series Data and Cross-Section Data

Time Series Data is composed of collection of data of one specific variable at particular interval of time. On the other hand, Cross-Section Data is consist of collection of data on multiple variables from different sources at a particular interval of time.
Collection of company’s stock market data at regular interval of year is an example of time series data. But when the collection of company’s sales revenue, sales volume is collected for the past 3 months then it is taken as an example of cross-section data.
Time series data is mainly used for obtaining results over an extended period of time but, cross-section data focuses on the information received from surveys at a particular time.


What is Time Series Analysis?

Performing analysis of time series data is known as Time Series Analysis. Analysis is performed in order to understand the structure and functions produced by the time series. By understanding the mechanism of time series data a mathematical model could easily be developed so that further predictions, monitoring and control can be performed.
Two approaches are used for analyzing time series data are -
  • In the time domain
  • In the frequency domain
Time series analysis is mainly used for -
  • Decomposing the time series
  • Identifying and modeling the time-based dependencies
  • Forecasting
  • Identifying and model the system variation


Need of Time Series Analysis

In order to model successfully, the time series is important in machine learning and deep learning. Time series analysis is used to understand the internal structure and functions that are used for producing the observations. Time Series analysis is used for -
  • Descriptive - In this case, patterns are identified in correlated data. In other words, the variations in trends and seasonality in the time series are identified.
  • Explanation - In this understanding and modeling of data is performed.
  • Forecasting - Here, the prediction from previous observations is performed for short term trends.
  • Invention Analysis - In this case, effect performed by any event in time series data is analyzed.
  • Quality Control - When the specific size deviates it provides an alert.


Applications of Time Series Analysis

Applications of Time Series Analysis


Time Series Database and its types
Time series database is a software which is used for handling the time series data. Highly complex data such higher transactional data is not feasible for the relational database management system. Many relational systems does not work properly for time series data. Therefore, time series databases are optimised for the time series data. Various time series databases are given below -
  • CrateDB
  • Graphite
  • InfluxDB
  • Informix TimeSeries
  • Kx kdb+
  • Riak-TS
  • RRDtool
  • OpenTSDB
Types of Time Series Database


What is Anomaly?

Anomaly is defined as something that deviates from the normal behaviour or what is expected. For more clarity let’s take an example of bank transaction. Suppose you have a saving bank account and you mostly withdraw Rs 10,000 but, one day Rs 6,00,000 amount is withdrawn from your account. This is unusual activity for bank as mostly, Rs 10,000 is deducted from the account. This transaction is an anomaly for bank employees.
The anomaly is a kind of contradictory observation in the data. It gives the proof that certain model or assumption does not fit into the problem statement.

Different Types of Anomalies

Different types of anomalies are given below -
  • Point Anomalies - If the specific value within the dataset is anomalous with respect to the complete data then it is known as Point Anomalies. The above mentioned example of bank transaction is an example of point anomalies.
  • Contextual Anomalies - If the occurrence of data is anomalous for specific circumstances, then it is known as Contextual Anomalies. For example, the anomaly occurs at a specific interval of period.
  • Collective Anomalies - If the collection of occurrence of data is anomalous with respect to the rest of dataset then it is known as Collective Anomalies. For example, breaking the trend observed in ECG.
Continue Reading The Full Article at - XenonStack.com/Blog

Monday 11 September 2017

9/11/2017 05:18:00 pm

What is Data Collection and Ingestion?

Data Collection and Data Ingestion are the processes of fetching data from any data source which we can perform in two ways -
In Today’s World, Enterprises are generating data from different Sources and building Real Time Data lake; we need to Integrate various sources of Data into One Stream.

In this Blog We are sharing how to Ingest, Store and Process Twitter Data using Apache Nifi and in Coming Blogs, we will be Sharing Data Collection and Ingestion from Below Sources
  • Data ingestion From Logs
  • Data Ingestion from IoT Devices
  • Data Collection and Ingestion from RDBMS (e.g., MySQL)
  • Data Collection and Ingestion from ZiP Files
  • Data Collection and Ingestion from Text/CSV Files

 
Objectives for the Data Lake

  • A Central Repository for Big Data Management
  • Reduce costs by offloading analytical systems and archiving cold data
  • Testing Setup for experimenting with new technologies and data
  • Automation of Data pipelines
  • MetaData Management and Catalog
  • Tracking measurements with alerts on failure or violations
  • Data Governance with clear distinction of roles and responsibilities
  • Data Discovery, Prototyping, and experimentation

Goals of Data Ingestion

Different Objectives and Goals of Data Ingestion.


Apache NiFi

Apache NiFi provides an easy to use, the powerful, and reliable system to process and distribute the data over several resources.

Apache NiFi is used for routing and processing data from any source to any destination. The process can also do some data transformation.

It is a UI based platform where we need to define our source from where we want to collect data, processors for the conversion of the data, a destination where we want to store the data.

Each processor in NiFi have some relationships like success, retry, failed, invalid data, etc. which we can use while connecting one processor to another. These links help in transferring the data to any storage or processor even after the failure by the processor.

Benefits of Apache NiFi

  • Real-time/Batch Streaming
  • Support both Standalone and Cluster mode
  • Extremely Scalable, extensible platform
  • Visual Command and Control
  • Better Error handling

Core Features of Apache NiFi

  • Guaranteed Delivery - A core philosophy of NiFi has been that even at very high scale, guaranteed delivery is a must. It is achievable through efficient use of a purpose-built persistent write-ahead log and content repository.
  • Data Buffering / Back Pressure AND Pressure Release - NiFi supports buffering of all queued data as well as the ability to provide back pressure as those lines reach specified limits or to an age of data as it reaches a specified age (its value has perished).
  • Prioritized Queuing - NiFi allows the setting of one or more prioritization schemes for how data from a queue is retrieved. The default is oldest first, but it can be configured to pull newest first, largest first, or some other custom scheme.
  • Flow Specific QoS - There are points of a data flow where the data is critical, and it is less intolerant. There are also times when it must be processed and delivered within seconds to be of any value. NiFi enables the fine-grained flow particular configuration of these concerns.
  • Data Provenance - NiFi automatically records, indexes, and makes available provenance data as objects flow through the system even across fan-in, fan-out, transformations, and more. This information becomes extremely critical in supporting compliance, troubleshooting, optimization, and other scenarios.
  • Recovery / Recording a rolling buffer of fine-grained history - NiFi’s content repository is designed to act as a rolling buffer of history. As Data ages off, it is removed from the content repository or as space is needed.
  • Visual Command and Control - NiFi enables the visual establishment of data flows in real-time. And provide UI based approach to build different data flow.
  • Flow Templates - It also allows us to create templates of frequently used data streams. It can also help in migrating the data flows from one machine to another.
  • Security - NiFi supports Multi-tenant Authorization. The authority level of a given data flow applies to each component, allowing the admin user to have a fine grained level of access control. It means each NiFi cluster is capable of handling the requirements of one or more organizations.

  • Parallel Stream to Multiple Destination - With NiFi we can move data to multiple destinations at one time. After processing the data stream, we can route the flow to the various destinations using NiFi’s processor. It can be helpful when we need to back our data on multiple destinations.

What is Apache Nifi - A Complete Introduction to Benefits and Core Features of Apache Nifi

 

NiFi Clustering

When we require moving a large amount of data, then the only single instance of NiFi is not enough to handle that amount of data. So to handle this we can do clustering of the NiFi Servers, this will help us in scaling.

We just need to create the data flow on one node, and this will make a copy of this data flow on each node in the cluster.

NiFi introduces Zero-Master Clustering paradigm in Apache NiFi 1.0.0. A previous version of Apache NiFi based upon a single “Master Node” (more formally known as the NiFi Cluster Manager).

If the master node gets lost, data continued to flow, but the application was unable to show the topology of the flow, or show any stats. But in Zero-Master we can make changes from any node of the cluster. And if master node disconnects, then automatically any active node is elected as Master Node.

Each node has the same the data flow, so they work on the same task as the other nodes are working, but each operates on the different datasets.

In NiFi cluster, one node is elected as the Master(Cluster Coordinator), and another node sends heartbeats/status information to the master node. This node is responsible for the disconnection of the other nodes that do not send any pulse/status information.

This election of the master node is done via Apache Zookeeper. And In the case when the master nodes get disconnected, Apache Zookeeper elects any active node as the master node.

 

Data Collection and Ingestion from Twitter using Apache NiFi to Build Data Lake

 

Fetching Tweets with NiFi’s Processor

NiFi’s ‘GetTwitter’ processor is used to fetch tweets. It uses Twitter Streaming API for retrieving tweets. In this processor, we need to define the endpoint which we need to use. We can also apply filters by location, hashtags, particular IDs.
  • Twitter Endpoint - Here we can set the endpoint from which data should get pulled. Available parameters -
 - Sample Endpoint - Fetch public tweets from all over the world.

 - Firehose Endpoint - This is same as streaming API, but it ensures 100% guarantee delivery of tweets with filters. - Filter Endpoint - If we want to filter by any hashtags or keywords
  • Consumer Key - Consumer key provided by Twitter.
  • Consumer Secret - Consumer Secret provided by Twitter.
  • Access Token - Access Token provided by Twitter.
  • Access Token Secret - Access Token Secret provided by Twitter.
  • Languages - Languages for which tweets should fetch out.
  • Terms to Filter - Hashtags for which tweets should fetch out.
  • IDs to follow - Twitter user IDs that should be followed.
Fetching Tweets With Nifi Processor
Now processor GetTwitter is ready for transmission of the data(tweets). From here we can move our data stream to anywhere like Amazon S3, Apache Kafka, ElasticSearch, Amazon Redshift, HDFS, Hive, Cassandra, etc. NiFi can move data multiple destinations parallelly.

 

Data Integration Using Apache NiFi and Apache Kafka

For this, we are using NiFi processor ‘PublishKafka_0_10’.
In the Scheduling Tab, we can configure how many concurrent tasks to be executed and schedule the processor.

In Properties Tab, we can set up our Kafka broker URLs, topic name, request size, etc. It will write data to the given topic. For the best results, we can create a Kafka topic manually of a defined partitions.

Apache Kafka can be used to process data with Apache Beam, Apache Flink, Apache Spark.

Data Integration Using Apache Nifi and Apache Kafta
Data Integration Using Apache Nifi and Apache Kafka
Data Integration Using Apache Nifi and Apache Kafka

 

Data Integration Using Apache NiFi to Amazon RedShift with Amazon Kinesis Firehose Stream

Data-Integration-Using-Apache-NiFi-Amazon-RedShift-with-Amazon-Kinesis-Firehose-Stream

Now we integrate Apache NiFi to Amazon Redshift. NiFi uses Amazon Kinesis Firehose Delivery Stream to store data to Amazon Redshift.

This delivery Stream should get utilized for moving data to Amazon Redshift, Amazon S3, Amazon ElasticSearch Service. We need to specify this while creating Amazon Kinesis Firehose Delivery Stream.

Now we have to move data to Amazon Redshift, so firstly we need to configure Amazon Kinesis Firehose Delivery Stream. While delivering data to Amazon Redshift, firstly the data is provided to Amazon S3 bucket, and then Amazon Redshift Copy command is used to move data to Amazon Redshift Cluster.

We can also enable data transformation while creating Kinesis Firehose Delivery Stream. In this, we can also backup the data to another Amazon S3 bucket other than an intermediate bucket.

So for this, we will use processor PutKinesisFirehose. This processor will use that Kinesis Firehose stream for delivering data to Amazon Redshift. Here we will configure AWS credentials and Kinesis Firehose Delivery Stream.

Data-Integration-Using-Apache-NiFi-Amazon-RedShift-with-Amazon-Kinesis-Firehose-Stream

Data Integration Using Apache NiFi to Amazon S3



Data Integration Using Apache Nifi to Amazon S3

PutKinesisFirehose sends data to both Amazon Redshift and uses Amazon S3 as the intermediator. Now if someone only wants to use Amazon S3 as the storage so NiFi can also use for sending data to Amazon S3 only. For this, we need to use NiFi processor PutS3Object. In it, we have to configure our AWS credentials, bucket name, and path, etc.

Data Integration using Apache Nifi to Amazon S3
Continue Reading The Full Article at - XenonStack.com/Blog

Friday 8 September 2017

9/08/2017 03:23:00 pm


Kubernetes is an open-source container orchestration engine and also an abstraction layer for managing full stack operations of hosts and containers. From deployment, Scaling, Load Balancing and to rolling updates of containerized applications across multiple hosts within a cluster. Kubernetes make sure that your applications are in the desired state.
Kubernetes 1.7 released on 29th June 2017 with some new features and fulfil the most demanding enterprise environments. There are now new features related to security, stateful applications and extensibility. With Kubernetes 1.7 we can now store secrets in namespaces in much better way, We ‘ll discuss that below.


Table of Content - 
  • Introduction to Twelve-Factor App
  • Kubernetes Architecture
  • Kubernetes Components
  • Kubernetes Security
  • Monitoring of Kubernetes
  • Open Source Tools For Kubernetes
  • Applications of Kubernetes
  • What's New in Kubernetes 1.7


Introduction To The Twelve-Factor App For Microservices

In the modern era, software is commonly delivered as a service called web apps, or software as a service. The twelve-factor app is a methodology for building software as a service app that -
  • Can minimize time and cost for new developers joining the project.
  • Offering maximum portability between execution environments.
  • Are suitable for deployment on modern cloud platforms.
  • Obviating the need for servers and systems administration.
  • Minimize divergence between development and production.
  • Enabling continuous deployment for maximum agility.
  • And can scale up without significant changes to tooling and architecture.
The twelve-factor methodology can be applied to apps written in any programming language, and which use any combination of backing services (database, queue, memory cache, etc).


The Twelve Factors

1. Codebase
  • A twelve factor app should have only one codebase per app, but there will be many deploys of the app. A deploy is a running instance of an app.
  • A twelve-factor app is always tracked in a version control system. A copy of the revision tracking database is known as a code repository.
2. Dependencies
  • A twelve factor app never relies on implicit existence of system wide packages. It declares all dependencies, completely and exactly.
3. Config
  • A twelve factor app should store config in the environment. An app’s config is everything that is likely to vary between deploys (staging, production, developer environments etc).
4. Backing Services
  • A backing service is any service the app consumes over the network as part of its normal operation. The code for a twelve-factor app makes no distinction between local and third party services.
  • A deploy of the twelve-factor app should be able to swap out a local MySQL database with one managed by a third party without any changes to the app’s code.
5. Build, Release, Run
  • The twelve-factor app uses strict separation between the build, release, and run stages. Transformation of code repo into an executable bundle is known as a build. Combination of build stage and current config makes release stage and the run stage runs the app in the execution environment.
6. Processes
  • The app is executed in the execution environment as one or more processes. Twelve-factor processes are stateless and share nothing. Any data that needs to persist must be stored in a stateful backing service typically a database.
7. Port Binding
  • The twelve-factor app is completely self-contained and does not rely on runtime injection of a webserver into the execution environment to create a web-facing service. The web app exports HTTP as a service by binding to a port, and listening to requests coming in on that port.
8. Concurrency
  • Processes in the twelve-factor app will be scale out via a process model. Using this model, the developer can architect their app to handle diverse workloads by assigning each type of work to a process type.
9. Disposability
  • The twelve-factor app’s processes are disposable, meaning they can be started or stopped at a moment’s notice. This facilitates fast elastic scaling, rapid deployment of code or config changes, and robustness of production deploys.
10. Development / Production Parity
  • We have to keep development, staging and production as similar as possible. There should be less time gap, personnel gap and tool gap in development and production. It will help in continuous deployment.
11. Logs
  • Treat logs as event streams. A twelve-factor app never concerns itself with routing or storage of its output stream. It should not attempt to write to or manage logfiles. Instead, each running process writes its event stream.
12. Admin Processes
  • Run admin or management tasks as one off processes. Admin tasks are such as running one time scripts, running database migrations, running a console to run arbitrary code.
You May also Love To Read Testing Strategies in Microservices Architecture


Kubernetes Architecture 

Kubernetes Cluster operates in master and worker architecture. In which Kubernetes Master get all management tasks and dispatch to appropriate kubernetes worker node based on given constraints.
  • Master Node
  • Worker Node


Kubernetes Components

Below I have created two sections so that you can understand better what are the components of the kubernetes architecture and where we exactly using them.


Master Node


Kubernetes Master Node Architecture


Kubernetes Master Node Architecture

Kubernetes Architecture
Kubernetes Architecture


Kube API Server

Kubernetes API server is the centre of each and every point of contact to kubernetes cluster. From authentication, authorization, and other operations to kubernetes cluster. API Server store all information in the etcd database which is a distributed data store.
Kube API Server


Etcd

etcd is a database that stores data in the form of key-values. It also supports Distributed Architecture and High availability with a strong consistency model. etcd is developed by CoreOS and written in GoLang. Kubernetes components stores all kind of information in etcd like metrics, configurations and other metadata about pods, service, and deployment of the kubernetes cluster.
ETCD Structure


Kube Controller Manager

The Kube Controller Manager is a component of Kubernetes Cluster which manages replication and scaling of pods. It always tries to make kubernetes system in the desired state by using kubernetes API server.
There are other controllers also in kubernetes system like
  • Endpoints controller
  • Namespace controller
  • Service accounts controller
  • DaemonSet Controller
  • Job Controller


Kube Scheduler

The Kube Scheduler is another main component of Kubernetes architecture. The Kube Scheduler check availability, performance, and capacity of kubernetes worker nodes and make plans for creating/destroying of new pods within the cluster so that cluster remains stable from all aspects like performance, capacity, and availability for new pods.
It analyses cluster and reports back to API Server to store all metrics related to cluster resource utilisation, availability, and performance.
It also schedules pods to specified nodes according to submitted manifest for the pod.


Worker Node

Kubernetes Worker Node Architecture


Kubelet

The Kubernetes kubelet is a worker node component of kubernetes architecture responsible for node level pod management.
API server put HTTP requests on kubelet API to executes pods definition from the manifest file on worker nodes and also make sure containers are running and healthy. Kubelet talks directly with container runtimes like docker or rkt.


Kube Proxy

The Kube Proxy is networking component of the kubernetes architecture. It runs on each and every node of the kubernetes cluster.
  • It handles DNS entry for service and pods.
  • It provides the hostname, IP address to pods.
  • It also forwards traffic from Cluster/Service IP address to specified set of pods.
  • Alter IPtables on all nodes so that different pods can talk to each other or outside world.


Docker

Docker is an open source container runtime developed by docker. To Build, Run, and Share containerized applications. Docker is focused on running a single application in one container and container as an atomic unit of the building block.
  • Lightweight
  • Open-Source
  • Most Popular
Docker RunTime Engine


RKT

Rocket is another container runtime for containerized application. Rocket is developed by CoreOS and have more focus towards security and follow open standards for building Rocket runtime.
  • Open-Source
  • Pod-native approach
  • Pluggable execution environment


Supervisor

It is a lightweight process management system that runs kubelet and container engine in running state.


Fluentd

Fluentd is an open source data collector for kubernetes cluster logs.


Terminology

  • Nodes

Kubernetes Nodes are the worker nodes in the kubernetes cluster. Kubernetes worker node can be a virtual machine or bare metal server.

Node has all the required services to run any kind of pods. Node is also managed by the master node of the kubernetes cluster.

Following are the few services of Nodes
1. Docker
2. Kubelet
3. Kube-Proxy
4. Fluentd
  • Containers

A container is a standalone, executable package of a piece of software that includes everything like code, runtime, libraries, configuration.
1. Supports both Linux and Windows based apps
2. Independent of the underlying infrastructure.

Docker and CoreOS are the main leaders in containers race.
  • Pods

Pods are the smallest unit of kubernetes architecture. It can have more than 1 containers in a single pod. A pod is modelled as a group of Docker containers with shared namespaces and shared volumes.

Example:- pod.yml

apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
  • Deployment

A Deployment is JSON or YAML file in which we declare Pods and Replica Set definitions. We just need to describe the desired state in a Deployment object, and the Deployment controller will change the actual state to the desired state at a controlled rate for you.
We can
  • Create new resources
  • Update existing resources
Example:- deployment.yml

align="justify">apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
-  name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
  • Service

A Kubernetes Service definition is also defined in YAML or JSON format. It creates a logical set of pods and creates policies for each set of pods that what type of ports and what type of IP address will be assigned. The Service identifies set of target Pods by using Label Selector.

align="justify">Example: - service.yml
kind: Service
apiVersion: v1
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376

Continue Reading The Full Article at - XenonStack.com/Blog