XenonStack

A Stack Innovator

Post Top Ad

Showing posts with label cloud. Show all posts
Showing posts with label cloud. Show all posts

Tuesday, 24 December 2019

12/24/2019 04:55:00 pm

Cloud Data Migration from On-Premises to Cloud 


Best Practices of Hadoop Infrastructure Migration

Migration involves the movement of data, business applications from an organizational infrastructure to Cloud for -
  • Recovery
  • Create Backups
  • Store chunks of data
  • High security
  • Reliability
  • Fault Tolerance

Challenge of Building On-Premises Hadoop Infrastructure

  • Limitation of tools
  • Latency issue
  • Architecture Modifications before migration
  • Lack of Skilled Professional
  • Integration
  • Cost
  • Loss of transparency

Service Offerings for Building Data Pipeline and Migration Platform

Understand requirements involving data sources, data pipelines, etc. for the migration of the Platform from On-Premises to Google Cloud Platform.
  • Data Collection Services on Google Compute Engines. Migrate all Data Collection Services and REST API and other background services to Google Compute Engine (VM’s).
  • Update the Data Collection Jobs to write data on Google Buckets. Develop Data Collections Jobs in Node.js and write data to Ceph Object Storage. Use Ceph as Data Lake. Update existing code to write the data to Google Buckets hence use Google Buckets as Data Lake.
  • Use Apache Airflow to build Data Pipelines and Building Data Warehouse using Hive and Spark. Develop a set of Spark Jobs which runs every 3 hours and checks for new files in Data Lake ( Google Buckets ) and then run the transformations and store the data into Hive Data Warehouse.
  • Migrate Airflow Data Pipelines to Google Compute Engines and Hive on HDFS using Cloud DataProc Cluster for Spark and Hadoop. Migrate REST API to Google Compute Instances.
  • The REST API served as Prediction results to Dashboards and acts as Data Access Layer for Data Scientists migrated to Google Compute Instances (VM’s ).

Technology Stack -

  • Node Js based Data Collection Services (on Google Compute Engines)
  • Google Cloud Storage found Data Lake (storing raw data coming from Data Collection Service)
  • Apache Airflow (Configuration & Scheduling of Data Pipeline which runs Spark Transformation Jobs)
  • Apache Spark on Cloud DataProc (Transforming Raw Data to Structured Data)
  • Hive Data Warehouse on Cloud DataProc
  • Play Framework in Scala Language (REST API)
  • Python-based SDKs

Monday, 23 December 2019

12/23/2019 05:33:00 pm

AWS Big Data Pipeline on Cloud and On-Premises

What is Data Pipelines?

Data pipelines refer to the general term of movement of data from one location to another location. The location from where the flow of data starts is known as a data source, and the destination is called as the data sink.
What makes ETL so important is that in the modern age ETL there are numerous data sources as well as data sinks.
The data sources can be data stored in any of the data locations such as database, data files or data warehouse .such data pipelines as called batch data pipelines as the data is already defined and we transfer the data in typical batches.
Whereas there are some data sources such as log files or streaming data from games or real-time application, such type of data is not well defined and may vary in structure as well. Such pipelines are called as streaming data pipelines. Streaming data requires a special kind of solution, as we have to consider late data records due to network latency or inconsistent data velocity.
We may also like to perform some operations/transformation on the data while it’s going from the data source to a data sink, such kind of data pipelines have been given a special kind of names -
ETL — Extract Transform Load
ELT — Extract Load Transform

Batch Data Pipeline Solutions

What is AWS GLUE?

AWS Glue is a serverless ETL job service. While using this, we don’t have to worry about setting up and managing the underlying infrastructure for running the ETL job.

How AWS GLUE Works?

AWS glue has three main components -

Data Catalog

Glue data catalog contains the reference to the data stores that are used as data sources and data sinks in our extract, transform, load (ETL) that we run via AWS Glue. When we defined a catalog, we need to run a crawler which in turn runs a classifier and infers the schema of the data source and data sink. Glue provides built-in classifiers for data formats such as databases, CSV, XML, JSON, etc. We can even add our custom classifiers according to our requirements. Crawlers store data in a metadata store which is an AWS RDS table so that it can be used again and again.

ETL Engine

The ETL engine is the heart of AWS Glue. It performs the most critical task of generating and running the ETL job.
In the ETL job generation part, the ETL engine provides us with a comfortable GUI using which we can select any of the data stores in the data catalog and define the source and sink of the ETL job. Now as we have selected the source and sink, we now choose the transformation we need to apply to the data. Glue provides us with some built-in transformation as well. After we are all set, the ETL engine generates the corresponding pypspark / scala code. We can edit the ETL job code and customize it as well.
Now moving onto the ETL job running part the ETL engine is responsible for running the above-generated code for us. ETL engines manage all the infrastructure ( launching the infrastructure, underlying execution engine for the code, on-demand job run, cleaning up after the job run). The default execution engine is Apache Spark.

Glue Scheduler

Glue scheduler is more or less like a CRON on steroids. We can periodically schedule jobs or run jobs on-demand based on some external triggers, or the job can be triggered via AWS Lambda functions.
A typical AWS Glue workflow looks something like this -

Step1

The first step for getting started with Glue is setting up a data catalog. After the data catalog has been set up, we need to run crawlers on the data catalog to scrap the metadata from the data catalog. The metadata is stored in a table that will be used for running the AWS glue job.

Step2

After the data catalog has been set up, its time to run the ETL job. Glue provides us with an interactive web GUI (graphical user interface ) using which we can create the ETL job. We have to select the source, destination, and transformation we want to apply to the data. AWS Glue provides us with some great built-in transformation. Glue automatically generates the code for ETL job according to our selected source, sink and transformation in pyspark or scala based on our choice. We are also free to edit the script if we want to and add our custom transformations.

Step3

This is the last step. Since now we have all the arsenal ready to run the ETL job it is time to start the job. AWS Glue provides us with a job scheduler, using which we can define when to run the ETL job, define the triggers upon which the job will be triggered. Glue scheduler is a very flexible and mature scheduler service.
Glue under the hood runs the jobs on AWS EMR (Elastic Map Reduce ) and chooses resources from a pool of hot resources so that there is no downtime while running the jobs. AWS Glue will only charge for the measure used when the ETL jobs are running.

Why Adopting AWS Glue Matters?

  • We can reliably schedule data pipelines from time to time.
  • The trigger-based pipeline runs.
  • All the AWS features are supported such as IAM, service roles, etc.
  • Bare minimum coding experience required to get started with making data pipelines.
Datastores supported by AWS Glue are -
Amazon S3, Amazon RDS, Amazon Redshift, Amazon DynamoDB, JDBC.

AWS Data Pipeline

What is the AWS Data Pipeline?

AWS Data Pipeline helps you sequence, schedule, run, and manage recurring data processing workloads reliably and cost-effectively. This service makes it easy for you to design extract-transform-load (ETL) activities using structured and unstructured data, both on-premises and in the cloud, based on your business logic.

How the Data Pipeline Works?

The main components of Data pipeline are -
  • Pipeline Definition
  • Task Runner
  • Pipeline Logging

Pipeline Definition

The pipeline can be created in 3 ways -
  • Graphically, using the AWS console or AWS pipeline Architect UI.
  • Textually, writing a JSON file format.
  • Programmatically, using the AWS data pipeline SDK.
A Pipeline can contain the following components -
  • Data Nodes — The section of input data for a task or the location where output data is to be collected.
  • Activities — A description of work to perform on a program using a computational means and typically input and output data nodes.
  • Preconditions — A conditional statement that must be true before action can run.
  • Scheduling Pipelines — Marks the timing of a planned event, such as when an action runs.
  • Resources — The computational resource that performs the work that a pipeline defines.
  • Actions — An action that is triggered when specified conditions are met, such as the failure of an activity.

Task Runner

It is responsible for the actual running of the task in the pipeline definition file. Task runner regularly polls the pipeline for any new tasks and executes them according to the resources defined, task runner is also capable of retrying the tasks in the case the tasks fail during execution.

Pipeline Logging

Logging is an essential part of data pipelines as it provides an insight into the internal working of the pipeline. The logging is done to the AWS cloud trail, and we can see the logs.
AWS data pipeline service leverages the following compute and storage services -
  • Amazon DynamoDB — Fully managed NoSQL database with fast performance.
  • Amazon RDS — It is a fully managed relational database that can accommodate large datasets. It has numerous options for the database you want, e.g., AWS aurora, Postgres, Mssql, MariaDB.
  • Amazon Redshift — Fully managed petabyte-scale Data Warehouse.
  • Amazon S3 — Low-cost highly-scalable object storage.

Compute Services

  • Amazon EC2 — Service for scalable servers in AWS data center, can be used to build various types of software services.
  • Amazon EMR — Service for distributed storage and compute over big data, using frameworks such as Hadoop and Apache Spark.

Why Enabling Data Pipeline Matters?

  • High integration and support for the existing AWS services.
  • We can create complex pipelines within a brief period.
  • Monitor the pipeline with AWS CloudWatch.
  • Supports a lot of data sources and sinks.

Continue Reading: XenonStack/Blogs

Friday, 20 December 2019

12/20/2019 05:55:00 pm

Hyper-Converged Infrastructure Benefits and Tools

What is Hyper-Converged Infrastructure?

Hyper-Converged infrastructure is a software-centric architecture tightly integrated to compute, networking and storing resources into a single system. Hyper-converged systems require at least three hardware nodes at a minimum for high availability, expanded later by adding more nodes according to requirement.

How HCI Architecture Works?

A hyper-converged platform integrates compute, storage and networking with software-defined software that defines the operational aspects of that infrastructure. The general and traditional choice for orchestration is hypervisor for provisioning of resources like storage, compute and network. Kubernetes also used for the provisioning of resources.
Kubernetes is an open-source container-orchestration system for automating deployment, scaling, and management of containerized applications. It sits between the application and the infrastructure, lets application workflows dynamically control storage, compute, and networking resources. Software-defined storage like Rook (Ceph) used for elastic storage.
Ceph is scalable storage which can be scaled up to an Exabyte. Expanded according to a requirement by adding more nodes. Kubernetes and Ceph both are open source solutions run on commodity hardware and avoiding any vendor lock-in. There are several software tools provided by hardware vendors depending on the server chosen.

Benefits of Hyper-Converged Infrastructure

  • Better utilization of resources.
  • Reduced maintenance cost.
  • Horizontal Scaling.
  • Reduced data center footprint.
  • Reduces administrator burden.
  • Cost-efficient.
  • Optimize the Health of Private Cloud.
  • Continuous Real-Time Workload Decisions.
  • Right Storage Right Work Load.
  • Plan Quickly and Easily Scale.
  • Automatable Workload Placement.
  • Software-Defined Storage.
  • Data Protection.
  • Deploys Virtual Desktop Infrastructure.
  • Consolidating Data Center.
  • Remote Management.
  • No Downtime with Software-Centric Approach.
  • Manages Complex Infrastructure.
  • Simplified Vendor Management.
  • Continuous, Portable and Flexible Protection.
  • Enhanced Governance.
  • Ease of Termination.
  • Encryption of Virtual Machines.

Why Hyper-Converged Infrastructure(HCI) Matters?

Hyper-Converged infrastructure consumes less space hence reducing the data center footprint. Using computation and storage from the same node ensures better utilization of resources. The cost of ownership and maintenance reduces due to the number of nodes. Administrators' burden reduces with the reduced number of nodes. The cost of heating and electricity decreases as the required number of nodes are running. It provides horizontal scaling to add more nodes in the cluster according to the requirement enabling users to start with the smaller cluster. In recent years there has been an increase in adopting the hyper-converged strategy in data centers using it in edge locations like local office locations. It is easier for smaller or midsize organizations to adopt Hyper-Converged infrastructure.
  • Lower Cost
  • Smarter, More Efficient Staff
  • Greater Gains Through Automation
  • Simplified Procurement and Support
  • Increased Data Protection
  • Improve Performance
  • Scalability
  • Flexible
  • Software-Defined Storage
  • Agility
  • Workload Consolidation

Read More: XenonStack/Insights

Wednesday, 18 December 2019

12/18/2019 05:35:00 pm

Smart Manufacturing and IoT Solutions

Introduction to Smart Manufacturing

Smart manufacturing is a technology-driven approach that uses Internet-connected machinery to observe the production process. The purpose of SM is to recognize the possibilities for automating processes and use data analytics to enhance manufacturing execution.
  • Machines have threshold values for the different parameters like Temperature, Oil pressure, and Amperage which should not be crossed.
  • Implementation of the Internet of Things (IoT) enables proactive maintenance breaks on the machines to enable manufacturing for smart manufacturing. The collection of data in Real-Time is an excellent solution.
  • Smart Manufacturing industries reduce the risks

Challenge for Building the IoT Platform

Scalable Solution for smart manufacturing to manage all the machines at a single central point and handle sizeable Real-Time streaming data from sensors that can provide the alerts and trigger to turn off the motor in very minimum time.

Solution Offered Real-Time Data and IoT Platform

Installation of the sensor to collect each value (Temperature, Oil pressure, Amperage) in Real-Time using Google Cloud Platform.
Google Cloud IoT Core to ingest data from all sensors attached to different machines. The data collected from the sensors contain the machine identification number to differentiate the received data.
Google Cloud IoT Core sends the collected data to Google Cloud Pub/Sub. This data stream routes at multiple locations. Store the raw data into BigQuery and also to Google Cloud Function.
Detect whether the collected data values are higher than the threshold values. Implement Google Cloud Function and Google Pub/Sub as the trigger. Google Cloud function is an event-driven serverless compute platform to deploy Function As a Service which is auto-scalable, highly available and fault-tolerant.
Cloud function triggers the configuration changes to the Google Cloud IoT Core if data values are higher than the threshold value, for another sensor which controls the motor of the machine.
Google Cloud IoT Core sends the trigger to the device to turn off the machine and have a maintenance break.

Technology Stack

  • Google Pub/Sub
  • Google IoT Core
  • Google Cloud Function
  • Google Cloud DataFlow

Source: 
XenonStack/Use-Cases