Skip to content

Latest commit

 

History

History
47 lines (34 loc) · 2.03 KB

File metadata and controls

47 lines (34 loc) · 2.03 KB

Cloud Data Engineering & Analytics Pipeline

End-to-end data engineering and analytics project built on Google Cloud Platform and Databricks, showcasing a complete pipeline from data ingestion to analysis, machine learning, and visualization.

Objective

Design and implement a scalable cloud-based data pipeline to process, analyze, and visualize data using modern data engineering and analytics tools.

Tools & Technologies

  • Google Cloud Platform (GCS, BigQuery, Cloud Shell)
  • Databricks
  • Apache Spark (Spark SQL, DataFrames)
  • Spark MLlib
  • Looker Studio
  • SQL, Python (Jupyter Notebooks)

Workflow

  • Cloud Setup
    Created a Google Cloud Storage (GCS) bucket and configured project resources.

  • Data Ingestion
    Downloaded the dataset, uploaded it to GCS, and verified data integrity using Cloud Shell.

  • Data Manipulation & Querying
    Imported data into BigQuery and executed analytical queries using:

    • BigQuery Web Console
    • Jupyter notebooks
  • Distributed Data Analysis
    Loaded data into Spark DataFrames on Databricks and replicated analytical queries using:

    • Spark SQL
    • DataFrame operations
  • Data Enrichment
    Applied a machine learning model using Spark MLlib to enhance the analysis.

  • Data Visualization
    Built an interactive dashboard in Looker Studio to present insights (with optional visualization in Databricks).

Outcome

The project demonstrates how cloud storage, distributed computing, machine learning, and visualization tools can be integrated into a unified data pipeline for real-world analytics use cases.

Key Takeaway

This repository highlights practical experience in building and managing cloud-based data pipelines, combining data engineering and data analysis skills in a scalable environment.

download download