Skip to content

lavanchukka/Sales-Deals-End2End-Azure-Project

Repository files navigation

Sales deals Azure End2End Project

Azure Data Factory Azure Datalake Gen2 Azure Logic Apps Azure Synapse PySpark Databricks PowerBI Git

This Project helps in understanding how various azure services & tools are used to implement an end-end data project.

In this project we will perform the following tasks:

  1. Move on-premise data to the cloud using Azure Data Factory with a Self-Hosted Integration Runtime(IR).
  2. Integrate data factory with GitHub for version control.
  3. Use Azure Data Lake Storage Gen2 (ADLS) as a central data repository.
  4. Improve reliability with Azure Logic Apps for email alerts and Azure Monitor for detailed tracking.
  5. Perform data cleaning and transformations using Azure Databricks.
  6. Implement security with Azure Key Vault + Databricks Secret Scope to protect sensitive secrets.
  7. Analyze data with Azure Synapse Analytics using SQL.
  8. Connect ADLS to Power BI for visualizations.

Dataset

The sales deals data set is located on-premise(from maven analytics), dataset has 5 csv files named accounts, products, sales teams, and sales opportunities, about a B2B sales pipeline data from a fictitious company that sells computers hardware.

Architecture

architecture

Project Implementation steps

i) Azure Storage Account

  1. Create azure resource group
  2. Create azure storage account (ADLS Gen 2)

ii) Azure Data Factory

  1. Create azure data factory (ADF)
  2. Create containers: raw, transformed data.

iii) GitHub

  1. Link GitHub repository to azure data factory for version control.

iv) Integration Runtime

  1. Create self-hosted integration runtime (IR).
  2. Run the downloaded IR file.

v) Linked Services

  1. Create Linked services (to connect on-prem to cloud, use filesystem) and provide the path to on-premise data source, test the connection.
  2. Create a pipeline in ADF and configure source (on-premise path), use wild file path(*.csv) and sink (raw data in storage account) for the pipeline.

vi) Azure Logic Apps

  1. Create Logic apps, click on add an action i) add when https request is received and write the following expression in the body section:
{
 "type": "object",
 "properties": {
   "pipelinename": {
     "type": "string"
   },
   "status": {
     "type": "string"
   },
   "pipelineid": {
     "type": "string"
   },
   "time": {
     "type": "string"
   }
 }


       ii) click on add action, add send email, select parameters body and subject and write the below expression to get email with the following details:
    Pipeline Details:
                 Name: @name
                 Status: @status
                 ID: @pipelineid
                 time: @time utc
)

vii) Azure Monitor

  1. Create Azure Monitor and under metrics add the failure pipeline and success pipeline (Count) monitor metrics.
  2. Now in pipeline add activity web and provide the https URL from the logic apps and configure body with the following for both success and failed web activities:
      {
    "pipelinename": "@{pipeline().Pipeline}",
    "status":  "Failed",
    "runid": @{pipeline().RunId},
    "time": "@{utcNow()}"  }
    
    
  3. Validate the pipeline and run by manually triggering or scheduling.
  4. Now the data from the on-premise should be moved to raw container in storage.

vii) Azure Key Vault

  1. Create Azure Key vault -> under objects –> click on secret and create secret, use storage account (Adls)security –> access key –> key 1 and click on create secret.

viii) Databricks

  1. Create Databricks workspace.
  2. Create new notebook in workspace and create secret scope for connection between storage account and databricks. Duplicate the notebook link and edit the link as: https://adb-1025563089158539.19.azuredatabricks.net/?o=1025563089158539#secrets/createScope
  3. Click on properties in key vault and copy vault URI, resource id and paste in vault section DNS name, resource id.
  4. Create compute first since it takes time to start.
  5. Now mount the storage account to databricks using the following code:
dbutils.fs.mount(
   source = "wasbs://raw-data@azureprojectspractice.blob.core.windows.net",
   mount_point = "/mnt/raw-data",
   extra_configs = {"fs.azure.sas.raw-data.azureprojectspractice.blob.core.windows.net":
                    dbutils.secrets.get(scope = "databricks scope1", key = "azuresecret1")})
  1. Verify the mount using: dbutils.fs.ls("/mnt/raw-data")
  2. Now repeat the same to mount transformed data container from storage account, copy the code and rename raw-data to transformed-data:
dbutils.fs.mount(
 source = "wasbs://transformed-data@azureprojectspractice.blob.core.windows.net",
 mount_point = "/mnt/transformed-data",
 extra_configs = {"fs.azure.sas.transformed-data.azureprojectspractice.blob.core.windows.net":
                  dbutils.secrets.get(scope = "databricks scope1", key = "azuresecret1")})
  1. Check the notebook for more transformation details.
  2. Now write back the cleaned data to transformed-data store in ADLS.
  3. Terminate the cluster to avoid charges.

ix) Synapse Analytics

  1. Now create Synapse analytics workspace.
  2. Create SQL database and then create views (virtual queries, data is not stored)
  3. Create queries that we need for reporting. check sql queryset for details.

x) Power BI

  1. Open Power BI and connect it to ADLS gen 2.
  2. Provide the link to transformed-data container URL.
  3. Change the blob in url to dfs.
  4. Authenticate -> transform data -> select files and start building visualizations.

About

This project helps in understanding how various azure services & tools are used to complete end-end data projects.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors