XenonStack

A Stack Innovator

Post Top Ad

Wednesday, 24 May 2017

5/24/2017 05:58:00 pm

Overview of Artificial Intelligence and Role of Natural Language Processing in Big Data



Artificial Intelligence Overview




AI refers to ‘Artificial Intelligence’ which means making machines capable of performing intelligent tasks like human beings. AI performs automated tasks using intelligence.

The term Artificial Intelligence has two key components -
    • Automation  
    • Intelligence

Goals of Artificial Intelligence







Stages of Artificial Intelligence


Stage 1 - Machine Learning - It is a set of algorithms used by intelligent systems to learn from experience.

Stage 2 - Machine Intelligence - These are the advanced set of algorithms used by machines to learn from experience. Eg - Deep Neural Networks.

ArtificiaI Intelligence technology is currently at this stage.

Stage 3 - Machine Consciousness - It is self-learning from experience without the need of external data.





Types of Artificial Intelligence



ANI - Artificial Narrow Intelligence - It comprises of basic/role tasks such as those performed by chatbots, personal assistants like SIRI by Apple and Alexa by Amazon.

AGI - Artificial General Intelligence - Artificial General Intelligence comprises of human-level tasks such as performed by self-driving cars by Uber, Autopilot by Tesla. It involves continual learning by the machines.

ASI - Artificial Super Intelligence - Artificial Super Intelligence refers to intelligence way smarter than humans.

What Makes System AI Enabled









Difference Between NLP, AI, ML, DL & NN



AI or Artificial IntelligenceBuilding systems that can do intelligent things.

NLP or Natural Language Processing - Building systems that can understand language. It is a subset of Artificial Intelligence.

ML or Machine Learning - Building systems that can learn from experience. It is also a subset of Artificial Intelligence.

NN or Neural Network - Biologically inspired network of Artificial Neurons.

DL or Deep Learning - Building systems that use Deep Neural Network on a large set of data. It is a subset of Machine Learning.



What is Natural Language Processing?


Natural Language Processing (NLP) is “ability of machines to understand and interpret human language the way it is written or spoken”.

The objective of NLP is to make computer/machines as intelligent as human beings in understanding language.



The ultimate goal of NLP is to the fill the gap how the humans communicate(natural language) and what the computer understands(machine language).

There are three different levels of linguistic analysis done before performing NLP -

Syntax - What part of given text is grammatically true.
Semantics - What is the meaning of given text?
Pragmatics - What is the purpose of the text?

NLP deal with different aspects of language such as

  • Phonology - It is systematic organization of sounds in language.
  • Morphology - It is a study of words formation and their relationship with each other.


Approaches of NLP for understanding semantic analysis

  • Distributional It employs large-scale statistical tactics of Machine Learning and Deep Learning.
  • Frame - Based The sentences which are syntactically different but semantically same are represented inside data structure (frame) for the stereotyped situation.
  • Theoretical This approach is based on the idea that sentences refer to the real word (the sky is blue) and parts of the sentence can be combined to represent whole meaning.
  • Interactive Learning - It involves pragmatic approach and user is responsible for teaching the computer to learn the language step by step in an interactive learning environment. 


The true success of NLP lies in the fact that humans deceive into believing that they are talking to humans instead of computers.

Why Do We Need NLP?


With NLP, it is possible to perform certain tasks like Automated Speech and Automated Text Writing in less time.

Due to the presence of large data (text) around, why not we use the computers untiring willingness and ability to run several algorithms to perform tasks in no time.

These tasks include other NLP applications like Automatic Summarization (to generate summary of given text) and Machine Translation (translation of one language into another)

Process of NLP


In case the text is composed of speech, speech-to-text conversion is performed.

The mechanism of Natural Language Processing involves two processes:
  • Natural Language Understanding
  • Natural Language Generation

Natural Language Understanding


NLU or Natural Language Understanding tries to understand the meaning of given text. The nature and structure of each word inside text must be understood for NLU. For understanding structure, NLU tries to resolve following ambiguity present in natural language:

  • Lexical Ambiguity - Words have multiple meanings
  • Syntactic Ambiguity - Sentence having multiple parse trees.
  • Semantic Ambiguity - Sentence having multiple meanings
  • Anaphoric Ambiguity - Phrase or word which is previously mentioned but has a different meaning.


Next, the meaning of each word is understood by using lexicons (vocabulary) and set of grammatical rules.

However, there are certain different words having similar meaning (synonyms) and words having more than one meaning (polysemy).

Natural Language Generation


It is the process of automatically producing text from structured data in a readable format with meaningful phrases and sentences. The problem of natural language generation is hard to deal with. It is subset of NLP

Natural language generation divided into three proposed stages:-

1. Text Planning - Ordering of the basic content in structured data is done.
2. Sentence Planning - The sentences are combined from structured data to represent the flow of information.
3. Realization - Grammatically correct sentences are produced finally to represent text.

Difference Between NLP and Text Mining or Text Analytics


Natural language processing is responsible for understanding meaning and structure of given text.

Text Mining or Text Analytics is a process of extracting hidden information inside text data through pattern recognition.



Natural language processing is used to understand the meaning (semantics) of given text data, while text mining is used to understand structure (syntax) of given text data.

As an example - I found my wallet near the bank. The task of NLP is to understand in the end that ‘bank’ refers to financial institute or ‘river bank'.

What is Big Data?


According to the Author Dr. Kirk Borne, Principal Data Scientist, Big Data Definition is described as big data is everything, quantified, and tracked.

For More Details on Big Data, Please Read - Ingestion And Processing of Data For Big Data and IoT Solutions

NLP for Big Data is the Next Big Thing


Today around 80 % of total data is available in the raw form. Big Data comes from information stored in big organizations as well as enterprises. Examples include information of employees, company purchase, sale records, business transactions, the previous record of organizations, social media etc.

Though human uses language, which is ambiguous and unstructured to be interpreted by computers, yet with the help of NLP, this huge unstructured data can be harnessed for evolving patterns inside data to better know the information contained in data.

NLP can solve big problems of the business world by using Big Data. Be it any business like retail, healthcare, business, financial institutions.

What is Chatbot?


Chatbots or Automated Intelligent Agents


  • These are the computer program you can talk to through messaging apps, chat windows or through voice calling apps
  • These are intelligent digital assistants used to resolve customer queries in a cost-effective, quick, and consistent manner.

Importance of Chatbots

Chatbots are important to understanding changes in digital customer care services provided and in many routine queries that are most frequently enquired.

Chatbots are useful in a certain scenario when the customer service requests are specific in the area and highly predictable, managing a high volume of similar requests, automated responses.

Working of Chatbot





Knowledge Base - It contains the database of information that is used to equip chatbots with the information needed to respond to queries of customers request.

Data Store - It contains interaction history of chatbot with users.

Continue Reading About AI & NLP  At: XenonStack.com/Blog



Tuesday, 2 May 2017

5/02/2017 10:33:00 am

Understanding Log Analytics, Log Mining & Anomaly Detection


What is Log Analytics


With technologies such as Machine Learning and Deep Neural Networks (DNN), these technologies employ next generation server infrastructure that spans immense Windows and Linux cluster environments.

Additionally, for DNNs, these application stacks don’t only involve traditional system resources (CPUs, Memory), but also graphic processing units (GPUs).

With a non-traditional infrastructure environment, the Microsoft Research Operations team needed a highly flexible, scalable, and Windows and Linux compatible service to troubleshoot and determine root causes across the full stack.

Log Analytics supports log search through billions of records, Real-Time Analytics Stack metric collection, and rich custom visualizations across numerous sources.

These out of the box features paired with the flexibility of available data sources made Log Analytics a great option to produce visibility & insights by correlating across DNN clusters & components.

The relevance of log file can differ from one person to another. It may be possible that the specific log data can be beneficial for one user but irrelevant for the another user.

Therefore, the useful log data can be lost inside the large cluster. Therefore, the analysis of the log file is an important aspect these days.

With the management of real-time data, the user can use the log file for making decisions.

But, as the volume of data increases let's say to gigabytes then, it becomes impossible for the traditional methods to analyze such a huge log file and determine the valid data. By ignoring the log data a huge gap of relevant information will be created.

So, the solution for this problem is to use Deep Learning Neural Network as a training classifier for the log data. With this, it’s not required to read the whole log file data by the human being.

By combining the useful log data with the Deep Learning it becomes possible to gain the relevant optimum performance and comprehensive operational visibility.

Along with the analysis of log data, there is also need to classify the log file into relevant and irrelevant data.

With this approach, time and performance effort could be saved and close to accurate results could be obtained.


Understanding Log Data


Before discussing the analysis of log file first we should understand about the log file.

The log is a data that produces automatically by the system and stores the information about the events that are taking place inside the operating system. It stores the data at every period of time.

The log data can be presented in the form of pivot table or file. In log file or table, the records are arranged according to the time.

Every software applications and systems produce log files. Some of the examples of log files are transaction log file, event log file, audit log file, server logs, etc.

Logs are usually application specific, therefore, log analysis is a much-needed task to extract the valuable information from the log file.

Log Name
Log Data Source
Information within the Log Data
Transaction Log
Database Management System
It consists of information about the uncommitted transactions, changes made by the rollback transactions and the changes that are not updated in the database. This is performed to retain the ACID (Atomicity, Consistency, Isolation, Durability) property at the time of crashes
Message Log
Internet Relay Chat (IRC) and Instant Messaging (IM)
In the case of IRC, it consists of server messages during the time interval the user is being connected to the channel. On the other hand, to enable the privacy of the user IM allows storing the messages in encrypted form as a message log. These logs require a password to decrypt and view.
Syslog
Network Devices such as web servers, routers, switches, printers, etc.
Syslog messages provide the information on the basis of where, when and why i.e. IP-Address, Timestamp and the log message. It contains two bits: facility (source of the message) and security (degree of the importance of the log message)
Server Log File
Web Servers
It is created automatically and contains the information about the user in the form of three stages such as IP-Address of the remote server, timestamp and the document requested by the user
Audit Logs
Hadoop Distributed File System (HDFS) ann Apache Spark.
It will record all the HDFS access activities taking place with the Hadoop platform
Daemon Logs
Docker
It provides details about the interaction between containers, Docker service, and the host machine. By combining these interactions, the cycle of the containers and disruption within the Docker service could be identified.
Pods
Kubernetes
It is a collection of containers that share resources such a single IP_Address and shared volumes.
Amazon CloudWatch Logs
Amazon Web Services (AWS)
It is used to monitor the applications and systems using log data i.e. examine the errors with the application and system. It also used for storage and accessing the log data of the system.
Swift Logs
Openstack
These logs are sent to Syslog and managed by log level. They are used for monitoring the cluster, auditing records, extracting robust information about the server and much more.


Log Analysis Process


The steps for the processing of Log Analysis are described below:
  • Collection and Cleaning of data
  • Structuring of Data
  • Analysis of Data

Collection and Cleaning of data


Firstly, Log data is collected from various sources. The collected information should be precise and informative as the type of collected data can affect the performance. Therefore, information should be collected from real users. Each type of Log contains distinguish the type of information.

After the collection of data, the data is represented in the form of Relational Database Management System (RDMS). Each record is assigned a unique primary key and Entity-Relationship model is developed to interpret the conceptual schema of the data.

Once the log data is arranged in proper manner then, the process of cleaning of data has to be performed. This is because there can be the possibility of the presence of corrupted log data.

The reasons of corruption of log data are given below:
  • Crashing of disk where log data is stored
  • Applications are terminated abnormally
  • Disturbance in the configuration of input/output
  • Presence of virus in the system and much more

Structuring of Data


Log data is large as well as complex. Therefore, the presentation of log data directly affects their ability to correlate with the other data.

An important aspect is that the log data has the ability to directly correlate with the other log data so that deep understanding of the log data can be interpreted by the team members.

The steps implemented for the structuring of log data are given below:
  • Clarity about the usage of collected log data
  • Same assets involve across the data so that values of log data are consistent. This means that naming conventions can be used
  • Correlation between the objects is created automatically due to the presence of nested files in the log data. It’s better to avoid nested files from the log data.

Analysis of Data: Now, the next step is to analyze the structured form of log data. This can be performed by various methods such as Pattern Recognition, Normalization, Classification using Machine Learning, Correlation Analysis and much more.
Log Analysis




  

Importance of Log Analysis


Indexing and crawling are two important aspects. If the content does not include indexing and crawling, then update of data will not occur properly within time and the chance of duplicates values will be increased.

But, with the use of log analytics, it will be possible to examine the issues of crawling and indexing of data. This can be performed by examining the time taken by Google to crawl the data and at what location Google is spending large time.

In the case of large websites, it becomes difficult for the team to maintain the record of changes that are made on the website. With the use of log analysis, updated changes can be maintained in the regular period of time thus helps to determine the quality of the website.

In Business point of view, frequent crawling of the website by the Google is an important aspect as it point towards the value of the product or services. Log analytics make it possible to examine how often Google views the page site.

The changes that are made in the page site should be updated quickly at that time in order to maintain the freshness of the content. This can also be determined by the log analysis.

Acquiring the real informative data automatically and measuring the level of security within the system.

Knowledge Discovery and Data Mining


In today's generation, the volume of data is increasing day by day. Because of these circumstances, there is a great need to extract useful information from large data that are further use for making decisions. Knowledge discovery and Data Mining are used to solve this problem.

Knowledge discovery and Data mining ate two distinct terms. Knowledge Discovery is a kind of process used for extracting the useful information from the database and Data Mining is one of the steps involved in this process. Data Mining is the algorithm used for extracting the patterns from the data.

Knowledge Discovery involves various steps such as Data Cleaning, Data Integration, Data Selection, Data Transformation, Data Mining, Pattern Evaluation, Knowledge Presentation.

Knowledge Discovery is a process that has total focus on deriving the useful information from the database, interpretation of storage mechanism of data, implementation of optimum algorithms and visualization of results.

This process gives more importance on finding the understandable patterns of data that further used for grasping useful information.

Data Mining involves the extraction of patterns and fitting of the model. The concept behind the fitting of the model is to ensure what type of information is inferred from the processing of model.

It works on three aspects such as model representation, model estimation, and search. Some of the common Data Mining techniques are Classification, Regression, and Clustering.


Knowledge Discovery and Data Mining

 

Log Mining


After performing analysis of logs, now next step is to perform log mining. Log Mining is a technique that uses Data Mining for the analysis of logs.

With the introduction of Data Mining technique for log analysis the quality of analysis of log data increases.

In this way analytics approach moves towards software and automated analytic systems.

But, there are few challenges to perform log analysis using data mining. These are:
  • Day by day volume of log data is increasing from megabytes to gigabytes or even petabytes. Therefore, there is a need of advanced tools for log analysis.
  • The essential information is missing from the log data. So, more efforts are needed to extract useful data.
  • The different number of logs are analyzed from different sources to move deep into the knowledge. So, logs in different formats have to be analyzed.
  • The presence of different logs creates the problem of redundancy of data without any identification. This leads to the problem of synchronization between the sources of log data.


Log Mining
















As shown in fig the process of log mining consist of three phases. Firstly, the log data is collected from various sources like Syslog, Message log, etc. After collecting the log data, it is aggregated together using Log Collector. After aggregation second phase is started.

In this, data cleaning is performed by removing the irrelevant data or corrupted data that can affect the accuracy of the process. After cleaning, log data is represented in the structured form of data (Integrated form) so that queries could be executed on them.

After that, the transformation process is performed to convert into the required format for performing normalization and pattern analysis. Useful patterns are obtained by performing Pattern Analysis.

Various data mining techniques are used such as Association rules, Clustering etc to grasp the useful information from the patterns.

This information is used for decision-making and for alerting the unusual behavior of the pattern by the organization.

Define Anomaly


An anomaly is defined as the unusual behavior or pattern of the data. This unusual indicates the presence of the error in the system. It describes that the actual result is different from the obtained result, thus the applied model does not fit into the given assumptions.


The anomaly is further divided into three categories described below:

  • Point Anomalies

    A single instance of a point is considered as an anomaly when it is farthest from the rest of the data.
  • Contextual Anomalies

    This type of anomaly related to the abnormal behavior of the particular type of context within data. It is commonly observed in time series problems.
  • Collective anomalies

    When the collected instance of data is help for detecting anomalies is considered as collective anomalies.

The system produces logs which contain the information about the state of the system. By analyzing the log data anomalies can be detected so that security of the system could be protected.

This can be performed by using Data Mining Techniques. This is because there is a need of usage of dynamic rules along with the data mining approach.

Network Intrusion Detection using Data Mining


In today's generation, the use of computers has been increased. Due to this, the probability of cyber crime has also increased.

Therefore, a system is developed known as Network Intrusion Detection which enables the security in the computer system.

Continue Reading The Full Article At - XenonStack.com/Blog

Wednesday, 26 April 2017

4/26/2017 10:43:00 am

Enabling Real Time Analytics For IoT



What is Fast Data?


A few years ago, we remember the time when it was just impossible to analyze petabytes of data. Then emergence of Hadoop made it possible to run analytical queries on our huge amount of historical data.

As we know Big Data is a buzz from last few years, but Modern Data Pipelines are constantly receiving data at a high ingestion rate. So this constant flow of data at high velocity is termed as Fast Data.

So Fast data is not about just volume of data like Data Warehouses in which data is measured in GigaBytes, TeraBytes or PetaBytes.

Instead, we measure volume but with respect to its incoming rate like MB per second, GB per hour, TB per day. So Volume and Velocity both are considered while talking about Fast Data.

What is Streaming and Real-Time Data


Nowadays, there are a lot of Data Processing platforms available to process data from our ingestion platforms. Some support streaming of data and other supports true streaming of data which is also called Real-Time data.

Streaming means when we are able to process the data at the instant as it arrives and then processing and analyzing it at ingestion time. But in streaming, we can consider some amount of delay in streaming data from ingestion layer.

But Real-time data needs to have tight deadlines in the terms of time. So we normally consider that if our platform is able to capture any event within 1 ms, then we call it as real-time data or true streaming.

But When we talk about taking business decisions, detecting frauds and analyzing real-time logs and predicting errors in real-time, all these scenarios comes to streaming. So Data received instantly as it arrives is termed as Real-time data.
 

Stream & Real Time Processing Frameworks


So in the market, there are a lot of open sources technologies available like Apache Kafka in which we can ingest data at millions of messages per sec. Also Analyzing Constant Streams of data is also made possible by Apache Spark Streaming, Apache Flink, Apache Storm.


Spark Streaming



















Apache Spark Streaming is the tool in which we specify the time-based window to stream data from our message queue. So it does not process every message individually. 

We can call it as the processing of real streams in micro batches.
Whereas Apache Storm and Flink have the ability to stream data in real-time.

Why Real-Time Streaming


As we know that Hadoop, S3 and other distributed file systems are supporting data processing in huge volumes and also we are able to query them using their different frameworks like Hive which uses MapReduce as their execution engine.

Why we Need Real-Time  Streaming?


A lot of organizations are trying to collect as much data as they can regarding their products, services or even their organizational activities like tracking employees activities through various methods used like log tracking, taking screenshots at regular intervals.

So Data Engineering allows us to convert this data into structural formats and Data Analysts then turn this data into useful results which can help the organization to improve their customer experiences and also boost their employee's productivity.

But when we talk about log analytics, fraud detection or real-time analytics, this is not the way we want our data to be processed.The actual value data is in processing or acting upon it at the instant it receives.

Imagine we have a data warehouse like hive having petabytes of data in it. But it allows us to just analyze our historical data and predict future.

So processing of huge volumes of data is not enough. We need to process them in real-time so that any organization can take business decisions immediately whenever any important event occurs. This is required in Intelligence and surveillance systems, fraud detection etc.

Earlier handling of these constant streams of data at high ingestion rate is managed by firstly storing the data and then running analytics on it.

But organizations are looking for the platforms where they can look into business insights in real-time and act upon them in real-time.

Alerting platforms are also built on the top of these real-time streams. But Effectiveness of these platform lies in the fact that how truly we are processing the data in real-time.

Use Of Reactive Programming & Functional Programming


Now when we are thinking of building our alerting platforms, anomaly detection engines etc on the top of our real-time data, it is very important to consider the style of programming you are following.

Nowadays, Reactive Programming and Functional Programming are at their boom.

So, we can consider Reactive Programming as subscriber and publisher pattern. Often, we see the column on almost every website where we can subscribe to their newsletter and whenever the newsletter is posted by the publisher, whosoever have got subscription will get the newsletter via email or some other way.

So the difference between Reactive and Traditional Programming is that the data is available to the subscriber as soon as it receives. And it is made possible by using Reactive Programming model.

In Reactive Programming, whenever any events occur, there are certain components (classes) that had registered to that event. So instead of invoking target components by event generator, all targets automatically get triggered whenever any event occurs.

Now when we are processing data at high rate, concurrency is the point of concern. So the performance of our analytics job highly depends upon memory allocation/deallocation. So in Functional Programming, we don’t need to initialize loops/iterators on our own.

We will be using Functional Programming styles to iterate over the data in which CPU itself takes care of allocation and deallocation of data and also makes the best use of memory which results in better concurrency or parallelism.

Streaming Architecture Matters


While Streaming and Analyzing the real-time data, there are chances that some messages can be missed or in short, the problem is how we can handle data errors.

So, there are two types of architectures which are used while building real-time pipelines.
  • Lambda Architecture:

    This architecture was introduced by Nathan Marz in which we have three layers to provide real-time streaming and compensate any data error occurs if any. The three layers are Batch Layer, Speed layer, and Serving Layer.
    lambda architecture







  





Continue Reading the full Article At - XenonStack.com/Blog

Wednesday, 22 March 2017

3/22/2017 03:04:00 pm

How To Deploy PostgreSQL on Kubernetes


What is PostgreSQL?


PostgreSQL is a powerful, open source Relational Database Management System.

PostgreSQL is not controlled by any organization or any individual. Its source code is available free of charge. It is pronounced as "post-gress-Q-L".

PostgreSQL has earned a strong reputation for its reliability, data integrity, and correctness.
  • It runs on all major operating systems, including Linux, UNIX (AIX, BSD, HP-UX, SGI IRIX, MacOS, Solaris, Tru64), and Windows.
  • It is fully ACID compliant, has full support for foreign keys, joins, views, triggers, and stored procedures (in multiple languages)
  • It includes most SQL:2008 data types, including INTEGER, NUMERIC, BOOLEAN, CHAR, VARCHAR, DATE, INTERVAL, and TIMESTAMP.
  • It also supports storage of binary large objects, including pictures, sounds, or video.
  • It has native programming interfaces for C/C++, Java, .Net, Perl, Python, Ruby, Tcl, ODBC, among others, and exceptional documentation.


Prerequisites


To follow this guide you need -


Step 1 - Create a PostgreSQL Container Image

Create a file name “Dockerfile” for PostgreSQL. This image contains our custom config dockerfile which will look like -

FROM ubuntu:latest
MAINTAINER XenonStack

RUN apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys B97B0AFCAA1A47F044F244A07FCC7D46ACCC4CF8

RUN echo "deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main" > /etc/apt/sources.list.d/pgdg.list

RUN apt-get update && apt-get install -y python-software-properties software-properties-common postgresql-9.6 postgresql-client-9.6 postgresql-contrib-9.6

RUN /etc/init.d/postgresql start &&\
 psql --command "CREATE USER root WITH SUPERUSER PASSWORD 'xenonstack';" &&\
 createdb -O root xenonstack

RUN echo "host all  all 0.0.0.0/0  md5" >> /etc/postgresql/9.6/main/pg_hba.conf

RUN echo "listen_addresses='*'" >> /etc/postgresql/9.6/main/postgresql.conf

# Expose the PostgreSQL port
EXPOSE 5432

# Add VOLUMEs to allow backup of databases
VOLUME  ["/var/lib/postgresql"]

# Set the default command to run when starting the container
CMD ["/usr/lib/postgresql/9.6/bin/postgres", "-D", "/var/lib/postgresql", "-c", "config_file=/etc/postgresql/9.6/main/postgresql.conf"]

This Postgres image has a base image of ubuntu xenial. After that, we create Super User and default databases. Exposing 5432 port will help external system to connect the PostgreSQL server.

Step 2 - Build PostgreSQL Docker Image


$ docker build -t dr.xenonstack.com:5050/postgres:v9.6

Step 3 - Create a Storage Volume (Using GlusterFS)

Using below-mentioned command create a volume in GlusterFS for PostgreSQL and start it.

As we don’t want to lose our PostgreSQL Database data just because a Gluster server dies in the cluster, so we put replica 2 or more for higher availability of data.


$ gluster volume create postgres-disk replica 2 transport tcp k8-master:/mnt/brick1/postgres-disk  k8-1:/mnt/brick1/postgres-disk
$ gluster volume start postgres-disk
$ gluster volume info postgres-disk





Step 4 - Deploy PostgreSQL on Kubernetes

Deploying PostgreSQL on Kubernetes have following prerequisites -
  • Docker Image: We have created a Docker Image for Postgres in Step 2
  • Persistent Shared Storage Volume: We have created a Persistent Shared Storage Volume in Step 3
  • Deployment & Service Files: Next, we will create Deployment & Service Files

Create a file name “deployment.yml” for PostgreSQL. This deployment file will look like -

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: postgres
  namespace: production
spec:
  replicas: 1
  template:
 metadata:
   labels:
    k8s-app: postgres
 spec:
   containers:
   - name: postgres
     image: dr.xenonstack.com:5050/postgres:v9.6
     imagePullPolicy: "IfNotPresent"
     ports:
     - containerPort: 5432
     env:
     - name: POSTGRES_USER
       value: postgres
     - name: POSTGRES_PASSWORD
       value: superpostgres
     - name: PGDATA
       value: /var/lib/postgresql/data/pgdata
     volumeMounts:
        - mountPath: /var/lib/postgresql/data
          name: postgredb
   volumes:
     - name: postgredb
       glusterfs:
         endpoints: glusterfs-cluster
         path: postgres-disk
         readOnly: false

Continue Reading The Full Article At - XenonStack.com/Blog

Tuesday, 14 March 2017

3/14/2017 12:48:00 pm

Why We Need Modern Big Data Integration Platform




Data is everywhere and we are generating data from different Sources like Social Media, Sensors, API’s, Databases.

Healthcare, Insurance, Finance, Banking, Energy, Telecom, Manufacturing, Retail, IoT, M2M are the leading domains/areas for Data Generation. The Government is using BigData to improve their efficiency and distribution of the services to the people.

The Biggest Challenge for the Enterprises is to create the Business Value from the data coming from the existing system and from new sources. Enterprises are looking for a Modern Data Integration platform for Aggregation, Migration, Broadcast, Correlation, Data Management, and Security.

Traditional ETL is having a paradigm shift for Business Agility and need of Modern Data Integration Platform is arising. Enterprises need Modern Data Integration for agility and for an end to end operations and decision-making which involves Data Integration from different sources, Processing Batch Streaming Real Time with BigData Management, BigData Governance, and Security.


BigData Type Includes:
  • What type of data it is
  • Format of content of data required
  • Whether data is transactional data, historical data or master data
  • The Speed or Frequency at which data made to be available
  • How to process the data i.e. whether in real time or in batch mode


5 V’s to Define BigData



5vs of big data










 

Additional 5V’s to Define BigData


additional 5vs of big data



 

Data Ingestion and Data Transformation


Data Ingestion comprises of integrating Structured/unstructured data from where it is originated into a system, where it can be stored and analyzed for making business decisions. Data Ingestion may be continuous or asynchronous, real-time or batched or both.

Defining the BigData Characteristics: Using Different BigData types, helps us to define the BigData Characteristics i.e how the BigData is Collected, Processed, Analyzed and how we deploy that data On-Premises or Public or Hybrid Cloud.

  • Data type: Type of data
    • Transactional
    • Historical
    • Master Data and others

  • Data Content Format: Format of data
    • Structured (RDBMS)
    • Unstructured (audio, video, and images)
    • Semi-Structured

  • Data Sizes: Data size like Small, Medium, Large and Extra Large which means we can receive data having sizes in Bytes, KBs, MBs or even in GBs.

  • Data Throughput and Latency: How much data is expected and at what frequency does it arrive. Data throughput and latency depend on data sources:
    • On demand, as with Social Media Data
    • Continuous feed, Real-Time (Weather Data, Transactional Data)
    • Time series (Time-Based Data)

  • Processing Methodology: The type of technique to be applied for processing data (e.g. Predictive Analytics, Ad-Hoc Query and Reporting).

  • Data Sources: Data generated Sources
    • The Web and Social Media
    • Machine-Generated
    • Human-Generated etc

  • Data Consumers: A list of all possible consumers of the processed data:
    • Business processes
    • Business users
    • Enterprise applications
    • Individual people in various business roles
    • Part of the process flows
    • Other data repositories or enterprise applications

modern big data integration platform

 

Major Industries Impacted with BigData



industries impacted with big data

 

What is Data Integration?


Data Integration is the process of Data Ingestion - integrating data from different sources i.e. RDBMS, Social Media, Sensors, M2M etc, then using Data Mapping, Schema Definition, Data transformation to build a Data platform for analytics and further Reporting. You need to deliver the right data in the right format at the right timeframe.

BigData integration provides a unified view of data for Business Agility and Decision Making and it involves:

  • Discovering the Data
  • Profiling the Data
  • Understanding the Data
  • Improving the Data
  • Transforming the Data

A Data Integration project usually involves the following steps:

  • Ingest Data from different sources where data resides in multiple formats.
  • Transform Data means converting data into a single format so that one can easily be able to manage his problem with that unified data records. Data Pipeline is the main component used for Integration or Transformation.
  • MetaData Management: Centralized Data Collection.
  • Store Transform Data so that analyst can exactly get when the business needs it, whether it is in batch or real time.

modern big data integration platform

 

Why Data Integration is required


  • Make Data Records Centralized: As data is stored in different formats like in Tabular, Graphical, Hierarchical, Structured, Unstructured form. For making the business decision, a user has to go through all these formats before reaching a conclusion. That’s why a single image is the combination of different format helpful in better decision making.
  • Format Selecting Freedom: Every user has different way or style to solve a problem. User are flexible to use data in whatever system and in whatever format they feel better.
  • Reduce Data Complexity: When data resides in different formats, so by increasing data size, complexity also increases that degrade decision making capability and one will consume much more time in understanding how one should proceed with data.
  • Prioritize the Data: When one have a single image of all the data records, then prioritizing the data what's very much useful and what's not required for business can easily find out.
  • Better Understanding of Information: A single image of data helps non-technical user also to understand how effectively one can utilize data records. While solving any problem one can win the game only if a non-technical person is able to understand what he is saying.
  • Keeping Information Up to Date: As data keeps on increasing on daily basis. So many new things come that become necessary to add on with existing data, so Data Integration makes easy to keep the information up to date.

Continue Reading The Full Article At - XenonStack.com/Blog

Thursday, 9 March 2017

3/09/2017 11:23:00 am

Healthcare is Drowning in Data, Thirst For Knowledge



The Amount of Data in Healthcare is increasing at an astonishing rate. However, in general, the industry has not deployed the level of data management and analysis necessary to make use of those data.

As a result, healthcare executives face the risk of being overwhelmed by a flood of unusable data.

Consider the many sources of data. Current medical technology makes it possible to scan a single organ in 1 second and complete a full-body scan in roughly 60 seconds. The result is nearly 10 GB of raw image data delivered to a hospital’s Picture Archive and Communications System (PACS).

Clinical areas in their digital infancy, such as pathology, proteomics, and genomics, which are the key to personalized medicine, can generate over 2TB of data per patient.

Add to that the research and development of advanced medical compounds and devices, which generate terabytes over their lengthy development, testing and approval processes.


 

Doctors Are Drowning In Data


Technology isn't enough to improve healthcare. Doctors must be able to distinguish between valuable data and information overload.

One of the hopes of Electronic Health Records (EHRs) is that they will revolutionize medicine by collecting information that can be used to improve how we provide care. Getting good data from EHRs can occur if good data is input.

This doesn't always happen. To see patients, document encounters, enter smoking status, create coded problems lists, update medication lists, e-prescribe medications, order tests, find, open, and review multiple prior notes, schedule follow-up appointments, search for SNOWMED codes, search for ICD-9 codes, and find CPT codes to bill encounters(tasks previously delegated to a number of people) and compassionately interact with patients, providers have to take shortcuts.

But We have to Say HealthCare Drowning in Data Elements not yet interoperable onto one Platform.

First, the Data Exchange and Interoperability between EMRs, HIEs, Hospitals, Nursing Homes, Home care, ERs, portals, etc., must be addressed and industry standards need to emerge on the technology, but also the costs need to be defined. Who is going to pay for what and when?

It seems like the deepest pockets in the industry – pharmaceuticals and insurance – have put a dime into technology solutions or Big Data. Yet they have the most to gain. This is a huge disconnect because physicians and hospitals cannot afford to capitalize this start up by ourselves.

I believe that they will need to be influenced to contribute to this effort, in kind or with cash, for this system to be made whole and meaningful.

HIT industry leaders need to sit down with busy clinicians to create a workflow of automated Big Data in a way that provides all the stakeholders with the data to improve all levels of efficiencies and outcomes.











Decisions Through Data-Small data, Predictive modeling expansion, and real-time analytics are three forms of data analytics.

Healthcare data will continue to accumulate rapidly. If practices, hospitals, and healthcare systems do not actively respond to the flood of unstructured data, they risk forgoing the opportunity to use these data in managing their operations.

Small data and Real-Time Analytics are two methods of data analytics that allow practices, hospitals, and healthcare organizations to extract meaningful information.

Predictive Modeling is best suited for organizations managing large patient populations. With all three methods, the applicable information mined from raw data supports improvements in the quality of care and cost efficiency.

The use of Small Data, Real-Time Analytics, and Predictive Modeling will revolutionize the healthcare field by increasing those opportunities beyond reacting to emerging problems.





 

About RayCare:

RayCare is an Integrated HealthCare Platform Starting From Connecting Doctors, Labs, Medicine, Dieticians and Get Healthy Life Tips to Creation of Health Profile, Medical Reports, Daily Health Tracking to Predictive Diagnostic Analytics and Second Option Consultation & Recommendations. Know More!

For More, Visit - XenonStack.com/Blog