Azure Data Factory (Part-1)— Getting Started
Microsoft Azure is a wonderful cloud computing platform and an online portal that allows you to access and manage every kind of cloud services and resources provided by Microsoft. These services and resources include Storage, transformation, Load Balancing, Auto Scaling , depending on your requirements. To get access to these resources and services, all you need to have is an active internet connection and an azure subscription.
For learning purpose you can use free trail Subscription for one month where you have 200 $ for that duration. The services in free trial are based on resource and region availability. In case you have any other subscription make sure that ADF exists inside it.
What is Azure Data Factory ?
Azure data factory (ADF) is an amazing service provided by Microsoft azure for ingestion of Data flows from & to various data sources. It is a cloud computing service that provides ease in ETL (Extract, transform & Load) and data integration processes and orchestration data flows and transformations.
Components of ADF :
- Integration Runtimes : It acts as a compute infrastructure used by azure data factory for integration and supervision of different data flows and pipelines across different network environment. The default provided by azure is used to integrate various cloud(azure) services, while for on-premise or firewall secured networks you can simply go with Self-Hosted Runtimes.
- Linked Service : Linked services acts as key & Address for our data flows, they basically designed according to requirement the from where we have to collect the data and where we have to dump it. It defines the connection information needed for Data Factory to connect to external resources like azure blobs, azure SQL, even from file systems.
- Datasets : These are something like a data structure that basically used to define the structure and format of the data for which we are going to ingest using our dataflows. We can create N number of Datasets according to our requirement, but try to create as low as possible (we have to promote reusability).
- Pipelines : These are the actual delivery agent used for the collecting data from multiple data sources and then sink them to a the desired locations with the help of designed linked services.
- Data Flows : Data flows are used for in case when there is need of transformation of the data before sinking, It allow you to create data transformation logics without writing single line of code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters.
Creating A Simple Azure Pipeline
Till Now we get to know about different components requires in ADF, but now we are going to create all the resources discussed in above section.
Here we will start with creating the Azure Data Factory :
ADF: Creating a simple Azure Data Factory Service on Azure Portal.
ADF : Select the Author & Monitor tile to start the Azure Data Factory user interface (UI) application on a separate browser tab.
Resource 2.1 : For launching a new linked service we have to choose +New in connections tab and then we can select linked service type according to our activity requirement.
Resource 2.2 : Then we have to create our linked service by selecting our required Integration runtimes and other credentials (don’t forgot to test the connection).
Resource 3 : Inside the Factory resources we have can see Datasets, where we have to select New dataset and the same service like our linked service that we want to connect with our dataset.
Resource 4 : Copy Data Pipeline, For setting up the copy data pipeline we just need to choose the pipeline sections and then drag-n-drop the copy data option as shown below :
You need to specify the Source and Sink attributes inside the pipeline settings
After mentioning the correct source and sink we run our pipeline by clicking on the Debug Button shown in below figure :
Publishing our resources
Till now we are done with creating the pipelines and other resources and now we have to execute them. For running or final pipeline we have to use the debug button provided in the canvas and rest will be taken care by data factory.
Finally, to save our resources and pipelines we must have to Publish our resources for future use. The publish button shown at the top of canvas is most appropriate way of saving our resources.
Hurray!, we are done with creating our Azure Data Factory and the resources associated with it successfully. I hope you will get a good idea about ADF and how does it works.
Final Words
Azure data factory is seriously amazing and that’s not enough, there’s a lot more in ADF to explore and we are going to see that applications in upcoming blogs. Till Then Keep Learning….
Incase you need any help then you can reach me out via mail at : ragvenderrawat@gmail.com.
Thanks For Your Time, Have a nice Day :-)