AWS Big Data Pipeline on Cloud and On-Premises

What is Data Pipelines?

Data pipelines refer to the general term of movement of data from one location to another location. The location from where the flow of data starts is known as a data source, and the destination is called as the data sink.

What makes ETL so important is that in the modern age ETL there are numerous data sources as well as data sinks.

The data sources can be data stored in any of the data locations such as database, data files or data warehouse .such data pipelines as called batch data pipelines as the data is already defined and we transfer the data in typical batches.

Whereas there are some data sources such as log files or streaming data from games or real-time application, such type of data is not well defined and may vary in structure as well. Such pipelines are called as streaming data pipelines. Streaming data requires a special kind of solution, as we have to consider late data records due to network latency or inconsistent data velocity.

We may also like to perform some operations/transformation on the data while it’s going from the data source to a data sink, such kind of data pipelines have been given a special kind of names -

ETL — Extract Transform Load

ELT — Extract Load Transform

Batch Data Pipeline Solutions

What is AWS GLUE?

AWS Glue is a serverless ETL job service. While using this, we don’t have to worry about setting up and managing the underlying infrastructure for running the ETL job.

How AWS GLUE Works?

AWS glue has three main components -

Data Catalog

Glue data catalog contains the reference to the data stores that are used as data sources and data sinks in our extract, transform, load (ETL) that we run via AWS Glue. When we defined a catalog, we need to run a crawler which in turn runs a classifier and infers the schema of the data source and data sink. Glue provides built-in classifiers for data formats such as databases, CSV, XML, JSON, etc. We can even add our custom classifiers according to our requirements. Crawlers store data in a metadata store which is an AWS RDS table so that it can be used again and again.

ETL Engine

The ETL engine is the heart of AWS Glue. It performs the most critical task of generating and running the ETL job.

In the ETL job generation part, the ETL engine provides us with a comfortable GUI using which we can select any of the data stores in the data catalog and define the source and sink of the ETL job. Now as we have selected the source and sink, we now choose the transformation we need to apply to the data. Glue provides us with some built-in transformation as well. After we are all set, the ETL engine generates the corresponding pypspark / scala code. We can edit the ETL job code and customize it as well.

Now moving onto the ETL job running part the ETL engine is responsible for running the above-generated code for us. ETL engines manage all the infrastructure ( launching the infrastructure, underlying execution engine for the code, on-demand job run, cleaning up after the job run). The default execution engine is Apache Spark.

Glue Scheduler

Glue scheduler is more or less like a CRON on steroids. We can periodically schedule jobs or run jobs on-demand based on some external triggers, or the job can be triggered via AWS Lambda functions.

A typical AWS Glue workflow looks something like this -

Step1

The first step for getting started with Glue is setting up a data catalog. After the data catalog has been set up, we need to run crawlers on the data catalog to scrap the metadata from the data catalog. The metadata is stored in a table that will be used for running the AWS glue job.

Step2

After the data catalog has been set up, its time to run the ETL job. Glue provides us with an interactive web GUI (graphical user interface ) using which we can create the ETL job. We have to select the source, destination, and transformation we want to apply to the data. AWS Glue provides us with some great built-in transformation. Glue automatically generates the code for ETL job according to our selected source, sink and transformation in pyspark or scala based on our choice. We are also free to edit the script if we want to and add our custom transformations.

Step3

This is the last step. Since now we have all the arsenal ready to run the ETL job it is time to start the job. AWS Glue provides us with a job scheduler, using which we can define when to run the ETL job, define the triggers upon which the job will be triggered. Glue scheduler is a very flexible and mature scheduler service.

Glue under the hood runs the jobs on AWS EMR (Elastic Map Reduce ) and chooses resources from a pool of hot resources so that there is no downtime while running the jobs. AWS Glue will only charge for the measure used when the ETL jobs are running.

Why Adopting AWS Glue Matters?

We can reliably schedule data pipelines from time to time.
The trigger-based pipeline runs.
All the AWS features are supported such as IAM, service roles, etc.
Bare minimum coding experience required to get started with making data pipelines.

Datastores supported by AWS Glue are -

Amazon S3, Amazon RDS, Amazon Redshift, Amazon DynamoDB, JDBC.

AWS Data Pipeline

What is the AWS Data Pipeline?

AWS Data Pipeline helps you sequence, schedule, run, and manage recurring data processing workloads reliably and cost-effectively. This service makes it easy for you to design extract-transform-load (ETL) activities using structured and unstructured data, both on-premises and in the cloud, based on your business logic.

How the Data Pipeline Works?

The main components of Data pipeline are -

Pipeline Definition
Task Runner
Pipeline Logging

Pipeline Definition

The pipeline can be created in 3 ways -

Graphically, using the AWS console or AWS pipeline Architect UI.
Textually, writing a JSON file format.
Programmatically, using the AWS data pipeline SDK.

A Pipeline can contain the following components -

Data Nodes — The section of input data for a task or the location where output data is to be collected.
Activities — A description of work to perform on a program using a computational means and typically input and output data nodes.
Preconditions — A conditional statement that must be true before action can run.
Scheduling Pipelines — Marks the timing of a planned event, such as when an action runs.
Resources — The computational resource that performs the work that a pipeline defines.
Actions — An action that is triggered when specified conditions are met, such as the failure of an activity.

Task Runner

It is responsible for the actual running of the task in the pipeline definition file. Task runner regularly polls the pipeline for any new tasks and executes them according to the resources defined, task runner is also capable of retrying the tasks in the case the tasks fail during execution.

Pipeline Logging

Logging is an essential part of data pipelines as it provides an insight into the internal working of the pipeline. The logging is done to the AWS cloud trail, and we can see the logs.

AWS data pipeline service leverages the following compute and storage services -

Amazon DynamoDB — Fully managed NoSQL database with fast performance.
Amazon RDS — It is a fully managed relational database that can accommodate large datasets. It has numerous options for the database you want, e.g., AWS aurora, Postgres, Mssql, MariaDB.
Amazon Redshift — Fully managed petabyte-scale Data Warehouse.
Amazon S3 — Low-cost highly-scalable object storage.

Compute Services

Amazon EC2 — Service for scalable servers in AWS data center, can be used to build various types of software services.
Amazon EMR — Service for distributed storage and compute over big data, using frameworks such as Hadoop and Apache Spark.