☁️ CloudAutomationGNN

AI-Powered Cloud Infrastructure Anomaly Detection & Automated Remediation
A full-stack, cloud-native platform that uses a Graph Neural Network (GNN) to detect anomalies across distributed AWS infrastructure in real time, explain why an anomaly occurred using XAI, and trigger automated remediation.

🌐 Project Overview

CloudAutomationGNN monitors live AWS cloud resources (EC2, Lambda, RDS, ELB, S3) by continuously ingesting CloudWatch metrics. These metrics are modelled as a graph — each cloud resource is a node, interconnected by service dependencies — and a trained GraphSAGE GNN classifies every node as normal or anomalous.

When an anomaly is detected:

The system identifies why (SHAP + GNNExplainer XAI).
It publishes a real-time alert to the frontend via WebSocket.
Automated remediation actions are triggered and logged.
A natural-language explanation is generated by Google Gemini (via LangChain RAG).

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         USER BROWSER (React)                            │
│  Dashboard │ Alerts │ XAI Panel │ Automation Logs │ AI Chat             │
└───────────────────────┬─────────────────────────────────────────────────┘
                        │  REST API + WebSocket (WSS)
                        ▼
┌─────────────────────────────────────────────────────────────────────────┐
│              Node.js Backend  (AWS Lambda via Serverless)               │
│                                                                         │
│   ┌─────────────┐  ┌──────────────────┐  ┌──────────────────────────┐  │
│   │ REST API    │  │ Event Processor  │  │  WebSocket Handlers      │  │
│   │ (Express)   │  │ (EventBridge/SQS)│  │  (connect/disconnect/    │  │
│   │             │  │                  │  │   SNS→WS subscriber)     │  │
│   └──────┬──────┘  └────────┬─────────┘  └──────────────────────────┘  │
└──────────┼──────────────────┼──────────────────────────────────────────┘
           │                  │
           │ HTTP             │ HTTP
           ▼                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                  Python FastAPI Backend (Docker / ECS)                  │
│                                                                         │
│   ┌────────────────┐  ┌──────────────┐  ┌──────────────────────────┐   │
│   │ GNN Inference  │  │ XAI Service  │  │  LLM / RAG Service       │   │
│   │ (GraphSAGE)    │  │ (SHAP +      │  │  (Gemini + LangChain +   │   │
│   │                │  │  GNNExplainer│  │   Pinecone + HuggingFace)│   │
│   └────────────────┘  └──────────────┘  └──────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘
           │                  │
           ▼                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        AWS Cloud Services                               │
│                                                                         │
│  CloudWatch → EventBridge → SQS → Lambda (eventProcessor)              │
│  DynamoDB (anomalies + WS connections)                                  │
│  SNS → Lambda (snsToWebSocket) → API Gateway WebSocket → Browser       │
│  S3 (GNN model artifacts + frontend hosting)                            │
│  MongoDB Atlas (users, events, anomalies, conversations)                │
└─────────────────────────────────────────────────────────────────────────┘

🛠 Tech Stack

Layer	Technology
Frontend	React (Vite), Recharts, D3/Force Graph, CSS, Socket.IO
Node Backend	Node.js, Express, Serverless Framework, AWS Lambda
Python Backend	FastAPI, PyTorch Geometric, SHAP, LangChain, Docker
ML Model	GraphSAGE (PyTorch Geometric)
XAI	GNNExplainer, SHAP KernelExplainer
LLM / RAG	Google Gemini 2.5, LangChain (LCEL), HuggingFace Embeddings
Vector DB	Pinecone
App Database	MongoDB Atlas (Mongoose ODM)
AWS Services	Lambda, API Gateway, EventBridge, SQS, SNS, DynamoDB, CloudWatch, S3
IaC	Terraform (AWS provider ~5.70), Serverless Framework v3+
Auth	JWT (Access + Refresh tokens), HTTP-only cookies

🧠 ML/DL Models

1. GraphSAGE — Anomaly Detector (Primary Model)

File: Python-Backend/app/services/gnn_model.py | Trained: ML/train_gnn.py

GraphSAGE (Graph Sample and Aggregate) is a graph neural network that learns node representations by aggregating features from a node's local neighbourhood. It is ideal for this use-case because cloud resources are inherently relational — an EC2 instance's anomalous behaviour is often caused by or propagated from upstream Lambda functions or RDS databases.

Architecture:
  Input (5 node features)
    → SAGEConv Layer 1  (64 units) → ReLU → Dropout(0.3)
    → SAGEConv Layer 2  (32 units) → ReLU → Dropout(0.3)
    → SAGEConv Layer 3  ( 1 unit ) → Sigmoid
  Output: per-node anomaly probability ∈ [0, 1]

Node features (5 per node):

Feature	Description
`cpu`	CPU utilization (%)
`memory`	Memory utilization (%)
`latency`	Request latency (ms)
`error_rate`	Fraction of failed requests
`request_count`	Requests per minute

Training details:

Dataset: Synthetic cloud infrastructure graphs (ML/data_generator.py)
Loss: Binary Cross-Entropy (BCELoss) with positive class weighting for imbalanced anomaly labels
Optimizer: Adam (lr=0.01, weight_decay=5e-4)
Epochs: 200
Threshold: 0.5 → ANOMALY
Metrics: Accuracy + F1 Score
Saved model: ML/models/gnn_model.pt (uploaded to S3 for inference)

Severity thresholds (score → label):

Score Range	Severity
≥ 0.85	CRITICAL
≥ 0.65	HIGH
≥ 0.35	MEDIUM
< 0.35	LOW

2. GNNExplainer — Subgraph Explainability

File: Python-Backend/app/services/xai_service.py

GNNExplainer (from torch_geometric.explain) runs 200 optimisation epochs to identify which features and edges (neighbouring nodes) most contributed to the GNN's anomaly prediction for a specific node.

Outputs a feature mask [N, F] — importance weight per feature per node
Outputs an edge mask [E] — importance weight per edge
Edges with mask value ≥ 0.5 form the important subgraph shown in the cascade path

3. SHAP KernelExplainer — Feature Attribution

File: Python-Backend/app/services/xai_service.py

SHAP (SHapley Additive exPlanations) KernelExplainer treats the GNN as a black-box function and estimates each feature's marginal contribution to the anomaly score.

Background baseline: mean feature vector of all graph nodes (representing a healthy node)
Samples: 64 perturbations per explanation
Fallback: If model outputs are degenerate (near-zero variance), attribution reverts to |z-score| deviation from training baseline — always produces meaningful, distinct values
Final importance: sqrt(SHAP × GNNExplainer mask) — a geometric blend of both methods

4. Google Gemini (LLM) + LangChain RAG — Natural Language Explanations

File: Python-Backend/app/services/llm_service.py

When a user requests an explanation of an anomaly, the SHAP scores and GNN output are fed into a LangChain RAG pipeline:

Retrieval: HuggingFace sentence-transformers/all-MiniLM-L6-v2 embeds the query and retrieves relevant cloud runbook documents from Pinecone vector store
Generation: Google Gemini 2.5 generates a natural-language explanation using the retrieved context + anomaly details
Pipeline routing: LCEL (LangChain Expression Language) routes queries to Smart_Chat, Document_Analysis, Analytical_Insights, or General_Conversation pipelines based on feature_mode

☁️ Cloud Services Used

AWS Services

Service	Purpose
AWS Lambda	Serverless compute for REST API, Event Processor, and WebSocket handlers
API Gateway (HTTP)	Exposes all REST endpoints (`/anomalies`, `/events`, `/graph`, etc.)
API Gateway (WebSocket)	Real-time bidirectional communication channel to browser clients
EventBridge	Rule-based event bus — triggers on `CloudWatch Alarm State Change` and periodic metric-poll schedule
SQS	Decoupled event queue between EventBridge and the `eventProcessor` Lambda (batch window: 30s, batch size: 10)
SQS Dead Letter Queue	Captures failed events (max 3 receive attempts, retained 14 days)
SNS	Publishes anomaly alerts; the `snsToWebSocket` Lambda subscribes and fans out to all WebSocket connections
DynamoDB	Stores: (1) GNN anomaly results keyed by `partitionKey/sortKey`, (2) WebSocket `connectionId` records with TTL
CloudWatch	Source of alarm state-change events; `DescribeAlarms` API polled directly for live alerts
S3	(1) Model artifact bucket (versioned, AES-256 encrypted, private) — hosts `gnn_model.pt`; (2) Frontend static hosting bucket (React SPA)
IAM	Fine-grained Lambda execution roles for SNS publish, DynamoDB CRUD, EventBridge emit, SQS consume, CloudWatch read, API Gateway connections

External Cloud Services

Service	Purpose
MongoDB Atlas	Primary application database — stores Users, Events, Anomalies, Conversations, Documents
Pinecone	Vector database for RAG document retrieval (index: `cksfinbot`, dynamic namespace per uploaded document)
Google AI (Gemini)	`gemini-2.5-flash-lite` LLM for natural language anomaly explanations
HuggingFace	`all-MiniLM-L6-v2` embedding model (runs locally in Python container)

🔧 Microservices Breakdown

The system is composed of 5 Lambda functions and 1 Python FastAPI service:

Lambda 1 — `app` (REST API Handler)

Handler: lambda.js → src/app.js
Trigger: API Gateway HTTP (ANY *)
Memory: 512 MB | Timeout: 29s

Express.js application exposed as a Lambda function via serverless-http. Handles all client-facing REST operations.

Routes & Controllers:

Route	Controller	Purpose
`POST /users/register`	`user.controller`	Create user account
`POST /users/login`	`user.controller`	Authenticate, issue JWT cookies
`GET /anomalies`	`anomaly.controller`	Paginated anomaly list (DB + live CloudWatch merge)
`GET /anomalies/:id/explain`	`anomaly.controller`	GNN + SHAP explanation for a specific anomaly
`PATCH /anomalies/:id/resolve`	`anomaly.controller`	Mark anomaly as resolved
`GET /anomalies/stats`	`anomaly.controller`	Severity distribution stats
`GET /events`	`events.controller`	Paginated cloud metric events
`GET /graph`	`graph.controller`	Resource graph topology for D3 visualisation
`POST /automation/trigger`	`automation.controller`	Trigger GNN-recommended remediation
`GET /automation/logs`	`automation.controller`	Automation action history
`POST /conversation`	`conversation.controller`	Initiate RAG chat with LLM
`POST /document/upload`	`document.controller`	Upload runbook to S3 + index in Pinecone
`GET /s3/signed-url`	`s3.controller`	Generate pre-signed S3 URL for file access

Lambda 2 — `eventProcessor` (GNN Metric Ingestion)

Handler: src/eventProcessor.js
Triggers: EventBridge (cloud-metrics-rule) + SQS batch
Memory: 256 MB | Timeout: 60s

The core real-time data pipeline. Each time an EventBridge event or SQS message arrives:

Parse the raw event (CloudWatch Alarm or custom CloudMetricEvent)
Persist the metric event to MongoDB (hydrates the frontend graph)
Forward to Python GNN service (POST /predict/single)
If anomaly detected → persist to DynamoDB + MongoDB
For critical/high → publish to SNS

Lambda 3 — `wsConnect`

Handler: src/Handlers/websocketHandlers.connect
Trigger: WebSocket $connect route

Stores the new connectionId in the cloud-automation-ws-connections-dev DynamoDB table with a TTL so stale connections are automatically cleaned up.

Lambda 4 — `wsDisconnect`

Handler: src/Handlers/websocketHandlers.disconnect
Trigger: WebSocket $disconnect route

Removes the connectionId record from DynamoDB when a browser disconnects.

Lambda 5 — `snsToWebSocket` (Real-Time Alert Fan-out)

Handler: src/Handlers/snsSubscriber.handler
Trigger: SNS topic subscription

When SNS receives a critical anomaly notification:

Scans DynamoDB for all active WebSocket connection IDs
Uses @aws-sdk/client-apigatewaymanagementapi to POST the anomaly payload to every connected browser simultaneously

Python FastAPI Service

Entry: Python-Backend/app/main.py
Deployment: Docker container (Dockerfile included)
Base URL: configured via PYTHON_SERVICE_URL env var

Endpoint	Description
`POST /predict/single`	Run GNN inference on one cloud node
`POST /predict/graph`	Run GNN inference on a full resource graph
`POST /explain`	Generate SHAP + GNNExplainer XAI explanation
`POST /chat`	Invoke LangChain RAG with Gemini LLM
`POST /document/process`	Chunk and embed uploaded documents into Pinecone

Internal services:

File	Responsibility
`gnn_model.py`	GraphSAGE model definition + `load_model()` helper
`graph_builder.py`	Convert raw node/edge lists → PyTorch Geometric `Data` object
`gnn_inference.py`	Load model from S3/disk + run forward pass
`xai_service.py`	GNNExplainer + SHAP KernelExplainer orchestration
`explanation_builder.py`	Format XAI outputs into frontend-ready JSON
`llm_service.py`	LangChain LCEL pipelines (Smart Chat, Document Analysis, etc.)
`document_processor.py`	PDF/text chunking and embedding for Pinecone indexing
`s3_service.py`	Upload/download model artifacts from S3
`pinecone_service.py`	Pinecone client initialisation

🔄 System Workflow (End-to-End)

Workflow 1 — Real-Time Anomaly Detection (Happy Path)

1. AWS CloudWatch detects a metric threshold breach (e.g., EC2 CPU > 90%)
   └─► Emits CloudWatch Alarm State Change event

2. EventBridge rule (cloud-metrics-rule) matches the event
   └─► Routes to SQS event queue  (OR directly triggers Lambda 2)

3. Lambda 2 (eventProcessor) processes the SQS batch
   ├─► Parses & normalises the CloudMetricEvent payload
   ├─► Saves Event record to MongoDB (graph hydration)
   └─► Calls Python FastAPI:  POST /predict/single
         └─► graph_builder.py builds PyG Data object (normalised features + self-loop edges)
         └─► GraphSAGE.forward() → anomaly_score ∈ [0,1]
         └─► Returns { is_anomaly: true, anomaly_score: 0.91, ... }

4. If is_anomaly == true:
   ├─► Persist anomaly to DynamoDB (7-day TTL)
   ├─► Persist anomaly to MongoDB (UI stats)
   └─► If severity ∈ {critical, high}:
         └─► SNS.publish(anomaly alert)

5. SNS triggers Lambda 5 (snsToWebSocket)
   ├─► Scan DynamoDB for all active WS connectionIds
   └─► POST anomaly payload to each connected browser via APIGW Execution API

6. Browser receives WebSocket message → React updates Dashboard in real time
   ├─► ResourceGraph re-renders with anomalous node highlighted (red)
   ├─► AlertCard appears in Alerts page
   └─► MetricsChart spikes visible

Workflow 2 — XAI Explanation Request

1. User clicks "Explain" on an alert card in the browser

2. Browser:  GET /anomalies/:id/explain  →  Node.js Lambda 1

3. Node.js Lambda:
   ├─► If alarm ID starts with "aws-alarm-":
   │     └─► Derive attack-specific metrics from alarm name pattern
   │     └─► Compute local metric-deviation SHAP (baseline comparison)
   └─► POST /explain to Python FastAPI with node metrics + edges

4. Python FastAPI /explain:
   ├─► graph_builder.py → builds PyG Data from request payload
   ├─► gnn_model.load_model() → loads GraphSAGE from disk
   │
   ├─► Step 1: GNNExplainer (200 epochs of mask optimisation)
   │     └─► Outputs feature mask [N, F] + edge mask [E]
   │     └─► Identifies which neighbouring nodes matter (important subgraph)
   │
   ├─► Step 2: SHAP KernelExplainer (64 perturbation samples)
   │     ├─► model_predict() isolates target node feature, runs full graph GNN
   │     ├─► Detects degenerate output (variance < 1e-6) → falls back to |z-score|
   │     └─► Blends SHAP × GNNExplainer mask: importance = sqrt(shap * gnn_mask)
   │
   └─► Returns { feature_importance, important_nodes, important_edges }

5. Node.js Lambda:
   ├─► Checks if Python SHAP is degenerate (all equal values) → uses local SHAP
   ├─► Formats response: { shapValues, cascadePath, nlExplanation, actionStatus }
   └─► Caches explanation to MongoDB anomaly record

6. Browser XAI Panel displays:
   ├─► SHAP bar chart (feature importances)
   ├─► Cascade path (important subgraph nodes)
   └─► Natural language explanation

Workflow 3 — LLM Chat / RAG Query

1. User types a question in the AI Chat panel

2. Browser: POST /conversation → Node.js → Python FastAPI POST /chat

3. Python FastAPI:
   ├─► HuggingFace all-MiniLM-L6-v2 embeds the question
   ├─► Pinecone retrieves top-k relevant runbook chunks (dynamic k = 15% of namespace size)
   ├─► LangChain LCEL chain: context → PromptTemplate → Gemini 2.5 → StrOutputParser
   └─► Returns natural language answer

4. Browser: streams response into chat UI

Workflow 4 — WebSocket Connection Lifecycle

1. Browser connects to WSS://  (API Gateway WebSocket URL)
   └─► Lambda 3 (wsConnect): DynamoDB.PutItem(connectionId, TTL)

2. Browser stays connected — receives real-time anomaly pushes (Workflow 1, step 5)

3. Browser disconnects (page close / network loss)
   └─► Lambda 4 (wsDisconnect): DynamoDB.DeleteItem(connectionId)

Workflow 5 — Model Training & Deployment

1. Generate synthetic cloud graph data
   └─► python ML/data_generator.py  →  ML/data/graph_data.pt

2. Train GraphSAGE
   └─► python ML/train_gnn.py
         ├─► 200 epochs, Adam, BCELoss with class weighting
         ├─► Saves best checkpoint:  ML/models/gnn_model.pt
         └─► Saves loss curve:      ML/reports/training_loss.png

3. Upload model artifact to S3
   └─► python ML/upload_model.py  →  s3://cloud-automation-gnn-model-dev/gnn_model.pt

4. Python API loads model at startup from S3 (or local disk for dev)

🖥 Frontend Pages & Components

Pages

Page	Route	Description
`LandingPage`	`/`	Marketing landing page with animated background
`LoginPage`	`/login`	JWT auth login form
`SignupPage`	`/signup`	User registration with email verification
`DashboardPage`	`/dashboard`	Main monitoring hub — graph + metrics
`AlertsPage`	`/alerts`	Paginated anomaly alert list with severity tabs

Key Components

Component	Description
`ResourceGraph`	D3 force-directed graph of cloud nodes with live anomaly highlighting
`MetricsChart`	Recharts time-series chart for CPU/Memory/Latency/Error Rate
`XAIPanel`	SHAP bar chart + cascade path + NL explanation panel
`AlertCard`	Individual anomaly card with severity badge + Explain button
`AutomationLog`	Real-time log of automated remediation actions
`WelcomeScreen`	AI-powered welcome after login
`NodeDetailModal`	Click-through modal with per-node live metrics
`ChatInput` / `Message`	AI chat interface with Markdown rendering
`EnhancedFileUpload`	Drag-and-drop document upload for RAG indexing

Real-Time Layer (Frontend)

Primary: WebSocket connection to API Gateway WSS endpoint (src/services/socket.js)
Fallback: HTTP polling every 30s if WebSocket is unavailable
Events multiplexed: anomaly:new, graph:update, automation:log, metrics:update

📡 API Endpoints

Node.js REST API (via API Gateway)

Auth
  POST   /users/register          Register new user
  POST   /users/login             Login (returns JWT cookies)
  POST   /users/logout            Clear auth cookies
  GET    /users/me                Get current user profile

Anomalies
  GET    /anomalies               List anomalies (paginated, filtered)
  GET    /anomalies/stats         Severity distribution stats
  GET    /anomalies/:id           Single anomaly detail
  GET    /anomalies/:id/explain   GNN + SHAP + NL explanation
  PATCH  /anomalies/:id/resolve   Mark anomaly as resolved

Events (Cloud Metric Events)
  GET    /events                  Paginated cloud metric event log

Graph
  GET    /graph                   Resource graph topology (nodes + edges)

Automation
  POST   /automation/trigger      Trigger automated remediation
  GET    /automation/logs         Remediation action log

Conversations (RAG Chat)
  POST   /conversation            Start / continue AI chat session
  GET    /conversation/:id        Retrieve chat history

Documents (RAG Indexing)
  POST   /document/upload         Upload + embed document in Pinecone
  GET    /s3/signed-url           Generate S3 pre-signed download URL

Python FastAPI

  POST   /predict/single          GNN inference (1 node)
  POST   /predict/graph           GNN inference (full graph)
  POST   /explain                 XAI: SHAP + GNNExplainer
  POST   /chat                    LangChain RAG + Gemini
  POST   /document/process        Chunk + embed document → Pinecone

🏗 Infrastructure as Code

Terraform (`infra/`)

Manages core, long-lived AWS resources:

Resource	Description
`aws_s3_bucket.model_bucket`	GNN model artifacts (versioned + AES-256 encrypted)
`aws_s3_bucket.frontend_bucket`	React SPA static hosting
`aws_sqs_queue.event_queue`	Main event queue (long-polling, 1-day retention)
`aws_sqs_queue.event_dlq`	Dead-letter queue (14-day retention, max 3 retries)
`aws_cloudwatch_event_rule.cloudwatch_alarm_rule`	EventBridge rule: CloudWatch ALARM state changes
`aws_cloudwatch_event_rule.metric_poll_rule`	Periodic metric poll schedule
`aws_cloudwatch_event_target.alarm_to_sqs`	Route EventBridge → SQS
`aws_dynamodb_table.anomaly_table`	Anomaly results (GSI on resourceId + severity)
`aws_dynamodb_table.automation_log_table`	Automation action history
`aws_cloudwatch_log_group`	Lambda log groups with retention policies

Serverless Framework (`Node-Backend/serverless.yml`)

Manages Lambda functions and ephemeral AWS resources:

Resource	Description
`WsConnectionsTable`	DynamoDB: active WebSocket connection IDs (PAY_PER_REQUEST, TTL-enabled)
`CloudAutomationSnsTopic`	SNS topic for anomaly alerts
`GnnDataTable`	DynamoDB: GNN result store (composite key)

Plugins: serverless-esbuild (tree-shaking + source maps), serverless-offline (local dev on port 5000)

📁 Directory Structure

CloudAutomationGNN/
│
├── Frontend/                   # React (Vite) SPA
│   └── src/
│       ├── JSX/
│       │   ├── Pages/          # LandingPage, Dashboard, Alerts, Login, Signup
│       │   └── Components/     # ResourceGraph, MetricsChart, XAIPanel, AlertCard...
│       ├── services/           # socket.js (WebSocket + polling client)
│       └── Config/             # API base URL config
│
├── Node-Backend/              # Node.js Lambda (Serverless Framework)
│   ├── src/
│   │   ├── Controllers/        # anomaly, automation, graph, events, user, s3...
│   │   ├── Routes/             # Express route definitions
│   │   ├── Models/             # Mongoose schemas (User, Event, Anomaly, Conversation)
│   │   ├── Handlers/           # WebSocket handlers + SNS subscriber Lambda
│   │   ├── Middlewares/        # JWT auth middleware
│   │   ├── Utils/              # ApiError, ApiResponse, AsyncHandler
│   │   ├── db/                 # MongoDB Atlas connection
│   │   ├── eventProcessor.js   # Lambda 2: EventBridge/SQS → GNN → DynamoDB/SNS
│   │   └── app.js              # Express app factory
│   ├── serverless.yml          # 5 Lambda function definitions + IAM + resources
│   └── lambda.js               # serverless-http adapter
│
├── Python-Backend/            # FastAPI GNN + XAI + LLM service
│   ├── app/
│   │   ├── api/                # FastAPI route handlers
│   │   ├── core/               # Config, model loader (singleton)
│   │   ├── schemas/            # Pydantic request/response models
│   │   ├── services/
│   │   │   ├── gnn_model.py    # GraphSAGE architecture
│   │   │   ├── graph_builder.py# Raw metrics → PyG Data object
│   │   │   ├── gnn_inference.py# Model loading + forward pass
│   │   │   ├── xai_service.py  # GNNExplainer + SHAP orchestration
│   │   │   ├── llm_service.py  # LangChain LCEL pipelines (Gemini)
│   │   │   ├── document_processor.py # PDF chunking + Pinecone indexing
│   │   │   └── s3_service.py   # S3 model artifact I/O
│   │   ├── prompts.py          # LangChain prompt templates
│   │   └── main.py             # FastAPI app entry point
│   └── Dockerfile              # Container definition for Python service
│
├── ML/                        # Offline training pipeline
│   ├── data_generator.py       # Synthetic cloud graph data generator
│   ├── train_gnn.py            # GraphSAGE training script
│   ├── evaluate.py             # Model evaluation utilities
│   ├── upload_model.py         # Upload gnn_model.pt to S3
│   ├── generate_fake_metrics.py# Fake metric generation for testing
│   ├── data/                   # graph_data.pt (generated)
│   ├── models/                 # gnn_model.pt (trained checkpoint)
│   └── reports/                # training_loss.png
│
├── infra/                     # Terraform IaC
│   ├── main.tf                 # Core AWS resources (S3, SQS, DynamoDB, EventBridge)
│   ├── lambda.tf               # Lambda-specific Terraform resources
│   └── variables.tf            # Input variable definitions
│
└── Database/                  # MongoDB schema reference / seed scripts

🚀 Getting Started

Prerequisites

Node.js ≥ 20, Python ≥ 3.10, Docker
AWS CLI configured (aws configure)
Terraform ≥ 1.6
Serverless Framework (npm i -g serverless)
MongoDB Atlas connection URI
Pinecone API key, Google AI API key

1. Provision Infrastructure (Terraform)

cd infra
terraform init
terraform apply

2. Train & Upload the GNN Model

cd ML
pip install -r requirements.txt
python data_generator.py
python train_gnn.py
python upload_model.py

3. Start Python Backend

cd Python-Backend
pip install -r requirements.txt
uvicorn app.main:app --host 0.0.0.0 --port 8000
# Or with Docker:
docker build -t cloud-gnn-python .
docker run -p 8000:8000 --env-file .env cloud-gnn-python

4. Start Node.js Backend (local dev)

cd Node-Backend
cp .env.example .env   # fill in your values
npm install
npx serverless offline
# API available at http://localhost:5000

5. Start Frontend

cd Frontend
npm install
npm run dev
# App available at http://localhost:5173

6. Deploy to AWS

# Deploy Node.js Lambdas
cd Node-Backend
npx serverless deploy --stage dev

# Deploy Frontend to S3
# (see .agents/workflows/deploy-frontend.md)

👥 Team

CloudAutomationGNN — 6th Semester Cloud Computing Project

Built with PyTorch Geometric, AWS Serverless, and a healthy obsession with graph theory.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.agents/workflows		.agents/workflows
.github/workflows		.github/workflows
Database/TCS		Database/TCS
Frontend		Frontend
ML		ML
Node-Backend		Node-Backend
Python-Backend		Python-Backend
infra		infra
.gitignore		.gitignore
README.md		README.md
Report.tex		Report.tex
structure.txt		structure.txt

Folders and files

Latest commit

History

Repository files navigation

☁️ CloudAutomationGNN

📋 Table of Contents

🌐 Project Overview

🏗️ System Architecture

🛠 Tech Stack

🧠 ML/DL Models

1. GraphSAGE — Anomaly Detector (Primary Model)

2. GNNExplainer — Subgraph Explainability

3. SHAP KernelExplainer — Feature Attribution

4. Google Gemini (LLM) + LangChain RAG — Natural Language Explanations

☁️ Cloud Services Used

AWS Services

External Cloud Services

🔧 Microservices Breakdown

Lambda 1 — app (REST API Handler)

Lambda 2 — eventProcessor (GNN Metric Ingestion)

Lambda 3 — wsConnect

Lambda 4 — wsDisconnect

Lambda 5 — snsToWebSocket (Real-Time Alert Fan-out)

Python FastAPI Service

🔄 System Workflow (End-to-End)

Workflow 1 — Real-Time Anomaly Detection (Happy Path)

Workflow 2 — XAI Explanation Request

Workflow 3 — LLM Chat / RAG Query

Workflow 4 — WebSocket Connection Lifecycle

Workflow 5 — Model Training & Deployment

🖥 Frontend Pages & Components

Pages

Key Components

Real-Time Layer (Frontend)

📡 API Endpoints

Node.js REST API (via API Gateway)

Python FastAPI

🏗 Infrastructure as Code

Terraform (infra/)

Serverless Framework (Node-Backend/serverless.yml)

📁 Directory Structure

🚀 Getting Started

Prerequisites

1. Provision Infrastructure (Terraform)

2. Train & Upload the GNN Model

3. Start Python Backend

4. Start Node.js Backend (local dev)

5. Start Frontend

6. Deploy to AWS

👥 Team

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Lambda 1 — `app` (REST API Handler)

Lambda 2 — `eventProcessor` (GNN Metric Ingestion)

Lambda 3 — `wsConnect`

Lambda 4 — `wsDisconnect`

Lambda 5 — `snsToWebSocket` (Real-Time Alert Fan-out)

Terraform (`infra/`)

Serverless Framework (`Node-Backend/serverless.yml`)

Packages