Azure Data Factory (Part-2)— Moving Bulk Files Using Metadata

Prabhat Rawat
6 min readMay 30, 2021

--

In the Part1 we came to know about the basics of Azure Data Factory and also about the Copy Data Pipeline that helps in data movement for multiple sources and destinations. Azure Data Factory provides a convenient environment of drag-n-drop and managing different activities for data movement, transformations and dataflow ingestion.

In this post, we will make a look at Get Metadata Activity. How does it works? Where can we use them? And how do we use them with for-each loops to iterate over multiple files from a Directory.

Now with the help of Get Metadata Activity our task is to move all the files from a specified directory in azure file storage to our azure blob container. The Get Metadata Activity plays a major role in this for dynamically parameterizing the file name and then our for-each loop will execute a common copy-data activity on them for successful movement of data files.

Let’s Start ….. First, we will discuss what we have :

We have our ADF, Storage Account with ADLS Gen2 having “input” folder with multiple files with different formats inside it, Azure SQL with an SQL Table Named “Sink_SQL” in the format of csv data and this all are in a Resource Group “RG-Test” with that currently i’m using my student Subscription for this demonstration.

Resource Group “RG-Test”

Problem Statement :

We need to migrate the data of only for csv files inside the input folder to Azure SQL by converting all the csv files to a single SQL Table named “Sink_SQL”.

Solution :

According to our purposed solution, we first need to get the all the files with .csv extension from our specified “Input” folder. Next we need to move the data of that that files to our desired SQL table by appending all of them iteratively.

Let’s move towards implementing the solution :

Step 1 : First we need to create the required linked services. Linked Services will act as a pair of Key & Address for our source and destination. For our case we just need 2 i.e. one for Azure SQL, One for ADLS Gen2 as shown below :

Linked Service for ADLS Gen2

Similarly, we can create a Linked Service for Azure SQL by inputting the right credentials inside with SQL Server Authentication.

NOTE : Don’t forget to Test for Successful Connection as shown above

Step 2 : For our Sink Dataset we need to create a table in Azure SQL for our data migration. Although, we can auto-create table with sink dataset but that’s not recommended in best practices.

Table Structure for Sink Dataset

Step 3: Next we need to Create appropriate Datasets according to our source requirements. In our case we will choose the Delimited text for Source CSV’s and SQL Table for Destination.

Dataset for Source

Similarly, we can create a Dataset for Azure SQL by inputting the right fields inside dataset and selecting the right table for sinking.

Step 4 : Next we will start by creating our pipeline for data migration where first we need to access all the csv files from our “Input” directory in ADLS and now we need to setup the Get Metadata Activity as shown below.

Get Metadata Activity

NOTE : As shown in the output of above activity we need to access the the
“childItems” array having all the csv filenames.

Next, We need to process all these n number of csv files for successful data movement by iterating them over a for-each activity, where we actually perform the movement of data inside each and every csv files to our Azure SQL Database.

Step 5 : Now we will add another activity of For-each inside our pipeline structure next to the Get Metadata activity, where we will dynamically input our filenames to for-each activity as shown below.

For-each activity (connected with Get Metadata)
Dynamically parametarizing the files

Here we inputs the dynamically parameterized input array of files inside the for-each activity.

Basically we passing the ‘childItems’ array inside the for-each activity where we will iterate the array and process every single file.

Step 6 : Now we need to design an activity for each and every file inside for-each to move that from ADLS to Azure SQL table. Hence, we need to design a copy-data activity inside for-loop as shoe below :

Activity inside For-Each

Under the Activity Section we need to create a new Copy-Data activity.

Activity will use to process files.

Copy data activity inside for-each

Inside this copy data activity we will fill our required source dataset and sink datasets.

Next we can perform mapping (if required, but in our case it is required).

If you need to update the data in SQL table every time with new file, then you can add a Truncate Statement inside pre-copy script section, so that first table becomes empty and then we can insert data. In our case we are not doing this because we need to append not to replace the data with each file inside table .

Now we are done with all connection and pipelines, and next we will move towards execution. We will execute using the debug button on top of canvas as shown below :

Debug Pipeline Execution

And now we can see the successful execution of our pipeline (Get Metadata + For-Each) and you can even check the migrated data inside the “Sink_SQL” table in Azure SQl DB.

Complete Execution from Source to Destination

In the above image finally we can see the complete execution of our data movement pipeline.

By following the same approach we can map any data migration using different linked services, datasets and pipelines. For transformations we requires Dataflows and we will explore them in future scenarios.

NOTE : Data ingestion can become expensive because of several factors. For example, the infrastructure you need to support the various data sources and patented tools can be very costly to maintain in the long run.

I hope you enjoyed this scenario, i’ll came up with multiple different scenarios in future posts regarding Azure Data Factory and for any further queries and Suggestions you can reach me out via mail at : ragvenderrawat@gmail.com

--

--

Prabhat Rawat
Prabhat Rawat

Written by Prabhat Rawat

Big Data Engineer|| ETL Developer|| Deep Learning Enthusiastic|| Machine Learning Enthusiastic|| Python Programmer