ADF: On Premise Data Loading & Transformation

Prabhat Rawat
9 min readJan 14, 2022

--

Overview :

This blog consist of implementation of an ETL process using source as On Premise server or any firewall secured private system, aims to store data into Azure SQL after all the required transformations that will take place through ADF Dataflows. The record covers about the installation of Self-hosted Integration Runtime, connections, dataflows, pipelines with the defined use cases.

The challenge we are working on is to grab a file data from the on-premise system(system secured with a Firewall) and dump it in Azure SQL as final destination table by keeping ADLS Gen2 storage as the staging. Before dumping the data into Azure SQL we also need to implement few simple transformation through Mapping Data Flows.

Lets Begin…

Design we are using

Azure Data Factory :

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.

Azure Data Factory does not store any data itself. It allows you to create data-driven workflows to orchestrate the movement of data between supported data stores and processing of data using compute services in other regions or in an on-premise environment. It also allows you to monitor and manage workflows using both programmatic and UI mechanisms.

Getting Started with Self-Hosted Integration Runtimes :

A self-hosted integration runtime used to perform different activities between a cloud data store and a data store in a private network(such an on-premise). Treat our data source as an on-premises data source that is behind a firewall and use the self-hosted integration runtime to connect the Azure service to our data source. The installation of a self-hosted integration runtime needs an on-premises machine or a virtual machine inside a private network.

Installation & Setup :

Launching a Self-hosted Integration Runtime is quite simple, we have to perform few mentioned steps :

  1. First we have to move to connections tab and then inside it we choose integration runtime section and then choose +New for a new self-hosted runtime.
Azure Data Factory Portal

2. We will then download the created self-hosted runtime, by choosing manual setup provided with the authentication keys (which are required in case of setup) as shown below.

Note : We can use Express Setup also but it might fails sometimes in between, hence manual set is best fit.

Self-Hosted IR Configration

3. After downloading that simple msi, we simply need to follow the general procedure to install i.e. Next…Next…Next….then, Ok. Atlast we just need to put the keys to make the connection active as shown

Registering Self-Hosted IR

4. After this we are ready to use our Self hosted IR to perform different data activities with our on-premise system as data source.

Next, we will move to our linked services.

Setting up the Linked Services :

Linked services defines the connection information for Data Factory to connect the external resources. A linked service defines the connection to the data source and a dataset that represents the structure of the data. It basically acts as a key and address for source and sink activities.

  1. For launching a new linked service we have to choose +New in connections tab and then we can select linked service type according to our activity requirement.
Adding Linked Service

2. Then we have to create our linked service by selecting our required Integration runtimes and other credentials (don’t forgot to test the connection).

Configuring Linked Service

By using above mentioned steps we can successfully create any number of linked services vary according to our source and sink.

Next, we will go for creating datasets.

Creating Datasets :

Datasets basically identify data within the linked data stores, such as SQL tables, files, folders, and documents. Before you create a dataset, create a linked service to link your data store to the data factory. An activity can take zero or more input datasets, and produce one or more output datasets. An input dataset represents the input for an activity in the pipeline, and an output dataset represents the output for the activity.

Procedure :

  1. Inside the Factory resources we have can see Datasets, where we have to select New dataset and the same service like our linked service that we want to connect with our dataset.
Dataset Configration

2. Next we have to select our dataset format for sink and source. Note, we only have excel format for source purpose and not for the sink purpose (may be in future). Hence, we can only use those datasets in sink activity which don’t have excel as data format.

3. Next, we can select our desired runtime and the actual path to source or sink data store. In these ways we can create any number of datasets as per our use.

Next, we will see about creating dataflows and Pipelines for our task.

Our Use-Case

We have a source excel file (refer Fig1) on our on-premise file system that we are going to lift and shift first in the Blob storage(as staging).

After this we need to transform the input data (refer Fig 2) coming from Blob storage using Mapping Data Flows and finally we’ll sink our transformed data in a Azure SQL table.

We already specified the hoe to setup Self-Hosted IR and respective Linked Services and now we simply focus on building the pipeline and dataflow.

So, Let’s start with the Implementation.

Creating Dataflows and Pipelines :

A “pipeline” is a series that connect components together so they can form a chain. A chain that have different activities which are being executed from top-to-bottom order.

Data flows is an activity to design the flow of data includes extraction, loading and transformations of data. A dataflow have any combination of data ports including one input port and one or two output ports. Data enters components through the input port and exits through the output port.

These all dataflows are executed as a pipeline activity.

In our case we have our source dataset in Excel (.xlsx) format from our private on-premise system and sink dataset in Comma Separated Values (.csv) format to blob storage.

NOTE : We have used .csv format in case of sink because as we already mentioned above .xlsx format is not available for dumping purpose.

Fig1 : Sample Input Data (Source Data)
Fig 2 : Expected SQL Table (Expected after Transformation)

In our case we design our task in 4 major steps :

  • Installing self-hosted Integration Runtime on our on-premise system.
  • Moving simple data (shown in FIG1) from on-premise to Azure Blob Storage using data-factory pipelines.
  • Collecting data from blob and then performing some simple transformation using Dataflow.
  • Sinking the Data inside Azure SQL in the transformed (desired) format, shown in FIG 2.

Now, we will start implementation of above mentioned steps.

Copy Data Pipeline

Copy data is a pipeline feature used for copying data from the source to sink using Azure Data Factory. For successful copying we must connect correct datasets with source and sink. We uses source dataset as the one connected with Self-Hosted IR and sink dataset is one which is connected with azure blob container.

Next we will try to move our .xlsx file from private on-premise system to azure blob with copy data pipeline activity and we will use are designed datasets form accurate connections of source and sink.

For setting up the copy data pipeline we just need to choose the pipeline sections and then drag-n-drop the copy data option as shown below :

Copy Data Pipeline in ADF Author

After mentioning the correct source and sink we run our pipeline by clicking on the Debug Button shown in below figure :

Next we will try to create a Dataflow for remaining procedure

Creating Data flows

Data flows allow data engineers to develop data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. In this section we will try to execute the 3rd step mention in

our strategy i.e. “ To Collect data from blob and then transform it using Dataflows.”

Mapping data flow has a unique authoring canvas designed to make building transformation logic easy. The data flow canvas is separated into three parts: the top bar, the graph, and the configuration panel. We can conveniently use that canvas for building our desired dataflow as shown :

ADF Mapping Dataflows

The graph displays the transformation stream. It shows the lineage of source data as it flows into one or more sinks. To add a new source, select Add source. To add a new transformation, select the plus sign on the lower right of an existing transformation.

Dataflow Tranformations

For our problem statement we have to use Unpivot plug-in as shown above.

Unpivoting leads to conversion of data into our desired format as mentioned in FIG 2. And then finally we will add another Sink plug-in to our dataflow design to sink the transformed data in Azure SQL.

Here’s how our final Data flow look like :

Dataflow Design

NOTE : We have to turn on the Data flow Debug for better tracking of activites.

Next, we will discuss about the Final Pipeline creation by integrating above mention steps

Complete Pipeline

In this section we will try to implement the final step i.e. “Sinking the Data inside Azure SQL in the transformed (desired) format, shown in FIG 2.”

For this step we have to connect both the Copy data pipeline & Designed Dataflow, hence we will again move to pipeline section and then we drag-n-drop the dataflow option. Inside it we have to select our desired dataflow as shown :

Dataflow Activity ADF Author

Finally we have to connect both the Data pipelines (copy-data + Dataflow) together to automate the complete process of receiving the data from on-premise system, transforming it, and then sinking it to Azure SQL. This whole processes is taken care by data factory completely after the integration of both pipelines together.

Here how our final pipeline looks like :

Final Pipeline

Finally, to save our resources and pipelines we must have to Publish our resources for running it in production.

The publish button shown at the top of canvas is most appropriate way of publishing our resources.

Publishing Activities

Hence, we are able to successfully execute our task of “Moving and Transforming On-premise data to azure SQL”.

I hope you enjoyed this scenario, i’ll came up with multiple different scenarios in future posts regarding Azure Data Factory and for any further queries and Suggestions you can reach me out via mail at : ragvenderrawat@gmail.com

--

--

Prabhat Rawat
Prabhat Rawat

Written by Prabhat Rawat

Big Data Engineer|| ETL Developer|| Deep Learning Enthusiastic|| Machine Learning Enthusiastic|| Python Programmer

No responses yet