This Project helps in understanding how various azure services & tools are used to implement an end-end data project.
- Move on-premise data to the cloud using Azure Data Factory with a Self-Hosted Integration Runtime(IR).
- Integrate data factory with GitHub for version control.
- Use Azure Data Lake Storage Gen2 (ADLS) as a central data repository.
- Improve reliability with Azure Logic Apps for email alerts and Azure Monitor for detailed tracking.
- Perform data cleaning and transformations using Azure Databricks.
- Implement security with Azure Key Vault + Databricks Secret Scope to protect sensitive secrets.
- Analyze data with Azure Synapse Analytics using SQL.
- Connect ADLS to Power BI for visualizations.
The sales deals data set is located on-premise(from maven analytics), dataset has 5 csv files named accounts, products, sales teams, and sales opportunities, about a B2B sales pipeline data from a fictitious company that sells computers hardware.
- Create azure resource group
- Create azure storage account (ADLS Gen 2)
- Create azure data factory (ADF)
- Create containers: raw, transformed data.
- Link GitHub repository to azure data factory for version control.
- Create self-hosted integration runtime (IR).
- Run the downloaded IR file.
- Create Linked services (to connect on-prem to cloud, use filesystem) and provide the path to on-premise data source, test the connection.
- Create a pipeline in ADF and configure source (on-premise path), use wild file path(*.csv) and sink (raw data in storage account) for the pipeline.
- Create Logic apps, click on add an action i) add when https request is received and write the following expression in the body section:
{
"type": "object",
"properties": {
"pipelinename": {
"type": "string"
},
"status": {
"type": "string"
},
"pipelineid": {
"type": "string"
},
"time": {
"type": "string"
}
}
ii) click on add action, add send email, select parameters body and subject and write the below expression to get email with the following details:
Pipeline Details:
Name: @name
Status: @status
ID: @pipelineid
time: @time utc
)
- Create Azure Monitor and under metrics add the failure pipeline and success pipeline (Count) monitor metrics.
- Now in pipeline add activity web and provide the https URL from the logic apps and configure body with the following for both success and failed web activities:
{ "pipelinename": "@{pipeline().Pipeline}", "status": "Failed", "runid": @{pipeline().RunId}, "time": "@{utcNow()}" } - Validate the pipeline and run by manually triggering or scheduling.
- Now the data from the on-premise should be moved to raw container in storage.
- Create Azure Key vault -> under objects –> click on secret and create secret, use storage account (Adls)security –> access key –> key 1 and click on create secret.
- Create Databricks workspace.
- Create new notebook in workspace and create secret scope for connection between storage account and databricks. Duplicate the notebook link and edit the link as: https://adb-1025563089158539.19.azuredatabricks.net/?o=1025563089158539#secrets/createScope
- Click on properties in key vault and copy vault URI, resource id and paste in vault section DNS name, resource id.
- Create compute first since it takes time to start.
- Now mount the storage account to databricks using the following code:
dbutils.fs.mount(
source = "wasbs://raw-data@azureprojectspractice.blob.core.windows.net",
mount_point = "/mnt/raw-data",
extra_configs = {"fs.azure.sas.raw-data.azureprojectspractice.blob.core.windows.net":
dbutils.secrets.get(scope = "databricks scope1", key = "azuresecret1")})
- Verify the mount using: dbutils.fs.ls("/mnt/raw-data")
- Now repeat the same to mount transformed data container from storage account, copy the code and rename raw-data to transformed-data:
dbutils.fs.mount(
source = "wasbs://transformed-data@azureprojectspractice.blob.core.windows.net",
mount_point = "/mnt/transformed-data",
extra_configs = {"fs.azure.sas.transformed-data.azureprojectspractice.blob.core.windows.net":
dbutils.secrets.get(scope = "databricks scope1", key = "azuresecret1")})
- Check the notebook for more transformation details.
- Now write back the cleaned data to transformed-data store in ADLS.
- Terminate the cluster to avoid charges.
- Now create Synapse analytics workspace.
- Create SQL database and then create views (virtual queries, data is not stored)
- Create queries that we need for reporting. check sql queryset for details.
- Open Power BI and connect it to ADLS gen 2.
- Provide the link to transformed-data container URL.
- Change the blob in url to dfs.
- Authenticate -> transform data -> select files and start building visualizations.
