Skip to content

KrishChothani/CloudAutomationGNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

69 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

☁️ CloudAutomationGNN

AI-Powered Cloud Infrastructure Anomaly Detection & Automated Remediation
A full-stack, cloud-native platform that uses a Graph Neural Network (GNN) to detect anomalies across distributed AWS infrastructure in real time, explain why an anomaly occurred using XAI, and trigger automated remediation.


πŸ“‹ Table of Contents


🌐 Project Overview

CloudAutomationGNN monitors live AWS cloud resources (EC2, Lambda, RDS, ELB, S3) by continuously ingesting CloudWatch metrics. These metrics are modelled as a graph β€” each cloud resource is a node, interconnected by service dependencies β€” and a trained GraphSAGE GNN classifies every node as normal or anomalous.

When an anomaly is detected:

  1. The system identifies why (SHAP + GNNExplainer XAI).
  2. It publishes a real-time alert to the frontend via WebSocket.
  3. Automated remediation actions are triggered and logged.
  4. A natural-language explanation is generated by Google Gemini (via LangChain RAG).

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         USER BROWSER (React)                            β”‚
β”‚  Dashboard β”‚ Alerts β”‚ XAI Panel β”‚ Automation Logs β”‚ AI Chat             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚  REST API + WebSocket (WSS)
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Node.js Backend  (AWS Lambda via Serverless)               β”‚
β”‚                                                                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚ REST API    β”‚  β”‚ Event Processor  β”‚  β”‚  WebSocket Handlers      β”‚  β”‚
β”‚   β”‚ (Express)   β”‚  β”‚ (EventBridge/SQS)β”‚  β”‚  (connect/disconnect/    β”‚  β”‚
│   │             │  │                  │  │   SNS→WS subscriber)     │  │
β”‚   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                  β”‚
           β”‚ HTTP             β”‚ HTTP
           β–Ό                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Python FastAPI Backend (Docker / ECS)                  β”‚
β”‚                                                                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚ GNN Inference  β”‚  β”‚ XAI Service  β”‚  β”‚  LLM / RAG Service       β”‚   β”‚
β”‚   β”‚ (GraphSAGE)    β”‚  β”‚ (SHAP +      β”‚  β”‚  (Gemini + LangChain +   β”‚   β”‚
β”‚   β”‚                β”‚  β”‚  GNNExplainerβ”‚  β”‚   Pinecone + HuggingFace)β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                  β”‚
           β–Ό                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        AWS Cloud Services                               β”‚
β”‚                                                                         β”‚
β”‚  CloudWatch β†’ EventBridge β†’ SQS β†’ Lambda (eventProcessor)              β”‚
β”‚  DynamoDB (anomalies + WS connections)                                  β”‚
β”‚  SNS β†’ Lambda (snsToWebSocket) β†’ API Gateway WebSocket β†’ Browser       β”‚
β”‚  S3 (GNN model artifacts + frontend hosting)                            β”‚
β”‚  MongoDB Atlas (users, events, anomalies, conversations)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ›  Tech Stack

Layer Technology
Frontend React (Vite), Recharts, D3/Force Graph, CSS, Socket.IO
Node Backend Node.js, Express, Serverless Framework, AWS Lambda
Python Backend FastAPI, PyTorch Geometric, SHAP, LangChain, Docker
ML Model GraphSAGE (PyTorch Geometric)
XAI GNNExplainer, SHAP KernelExplainer
LLM / RAG Google Gemini 2.5, LangChain (LCEL), HuggingFace Embeddings
Vector DB Pinecone
App Database MongoDB Atlas (Mongoose ODM)
AWS Services Lambda, API Gateway, EventBridge, SQS, SNS, DynamoDB, CloudWatch, S3
IaC Terraform (AWS provider ~5.70), Serverless Framework v3+
Auth JWT (Access + Refresh tokens), HTTP-only cookies

🧠 ML/DL Models

1. GraphSAGE β€” Anomaly Detector (Primary Model)

File: Python-Backend/app/services/gnn_model.py | Trained: ML/train_gnn.py

GraphSAGE (Graph Sample and Aggregate) is a graph neural network that learns node representations by aggregating features from a node's local neighbourhood. It is ideal for this use-case because cloud resources are inherently relational β€” an EC2 instance's anomalous behaviour is often caused by or propagated from upstream Lambda functions or RDS databases.

Architecture:
  Input (5 node features)
    β†’ SAGEConv Layer 1  (64 units) β†’ ReLU β†’ Dropout(0.3)
    β†’ SAGEConv Layer 2  (32 units) β†’ ReLU β†’ Dropout(0.3)
    β†’ SAGEConv Layer 3  ( 1 unit ) β†’ Sigmoid
  Output: per-node anomaly probability ∈ [0, 1]

Node features (5 per node):

Feature Description
cpu CPU utilization (%)
memory Memory utilization (%)
latency Request latency (ms)
error_rate Fraction of failed requests
request_count Requests per minute

Training details:

  • Dataset: Synthetic cloud infrastructure graphs (ML/data_generator.py)
  • Loss: Binary Cross-Entropy (BCELoss) with positive class weighting for imbalanced anomaly labels
  • Optimizer: Adam (lr=0.01, weight_decay=5e-4)
  • Epochs: 200
  • Threshold: 0.5 β†’ ANOMALY
  • Metrics: Accuracy + F1 Score
  • Saved model: ML/models/gnn_model.pt (uploaded to S3 for inference)

Severity thresholds (score β†’ label):

Score Range Severity
β‰₯ 0.85 CRITICAL
β‰₯ 0.65 HIGH
β‰₯ 0.35 MEDIUM
< 0.35 LOW

2. GNNExplainer β€” Subgraph Explainability

File: Python-Backend/app/services/xai_service.py

GNNExplainer (from torch_geometric.explain) runs 200 optimisation epochs to identify which features and edges (neighbouring nodes) most contributed to the GNN's anomaly prediction for a specific node.

  • Outputs a feature mask [N, F] β€” importance weight per feature per node
  • Outputs an edge mask [E] β€” importance weight per edge
  • Edges with mask value β‰₯ 0.5 form the important subgraph shown in the cascade path

3. SHAP KernelExplainer β€” Feature Attribution

File: Python-Backend/app/services/xai_service.py

SHAP (SHapley Additive exPlanations) KernelExplainer treats the GNN as a black-box function and estimates each feature's marginal contribution to the anomaly score.

  • Background baseline: mean feature vector of all graph nodes (representing a healthy node)
  • Samples: 64 perturbations per explanation
  • Fallback: If model outputs are degenerate (near-zero variance), attribution reverts to |z-score| deviation from training baseline β€” always produces meaningful, distinct values
  • Final importance: sqrt(SHAP Γ— GNNExplainer mask) β€” a geometric blend of both methods

4. Google Gemini (LLM) + LangChain RAG β€” Natural Language Explanations

File: Python-Backend/app/services/llm_service.py

When a user requests an explanation of an anomaly, the SHAP scores and GNN output are fed into a LangChain RAG pipeline:

  1. Retrieval: HuggingFace sentence-transformers/all-MiniLM-L6-v2 embeds the query and retrieves relevant cloud runbook documents from Pinecone vector store
  2. Generation: Google Gemini 2.5 generates a natural-language explanation using the retrieved context + anomaly details
  3. Pipeline routing: LCEL (LangChain Expression Language) routes queries to Smart_Chat, Document_Analysis, Analytical_Insights, or General_Conversation pipelines based on feature_mode

☁️ Cloud Services Used

AWS Services

Service Purpose
AWS Lambda Serverless compute for REST API, Event Processor, and WebSocket handlers
API Gateway (HTTP) Exposes all REST endpoints (/anomalies, /events, /graph, etc.)
API Gateway (WebSocket) Real-time bidirectional communication channel to browser clients
EventBridge Rule-based event bus β€” triggers on CloudWatch Alarm State Change and periodic metric-poll schedule
SQS Decoupled event queue between EventBridge and the eventProcessor Lambda (batch window: 30s, batch size: 10)
SQS Dead Letter Queue Captures failed events (max 3 receive attempts, retained 14 days)
SNS Publishes anomaly alerts; the snsToWebSocket Lambda subscribes and fans out to all WebSocket connections
DynamoDB Stores: (1) GNN anomaly results keyed by partitionKey/sortKey, (2) WebSocket connectionId records with TTL
CloudWatch Source of alarm state-change events; DescribeAlarms API polled directly for live alerts
S3 (1) Model artifact bucket (versioned, AES-256 encrypted, private) β€” hosts gnn_model.pt; (2) Frontend static hosting bucket (React SPA)
IAM Fine-grained Lambda execution roles for SNS publish, DynamoDB CRUD, EventBridge emit, SQS consume, CloudWatch read, API Gateway connections

External Cloud Services

Service Purpose
MongoDB Atlas Primary application database β€” stores Users, Events, Anomalies, Conversations, Documents
Pinecone Vector database for RAG document retrieval (index: cksfinbot, dynamic namespace per uploaded document)
Google AI (Gemini) gemini-2.5-flash-lite LLM for natural language anomaly explanations
HuggingFace all-MiniLM-L6-v2 embedding model (runs locally in Python container)

πŸ”§ Microservices Breakdown

The system is composed of 5 Lambda functions and 1 Python FastAPI service:


Lambda 1 β€” app (REST API Handler)

Handler: lambda.js β†’ src/app.js
Trigger: API Gateway HTTP (ANY *)
Memory: 512 MB | Timeout: 29s

Express.js application exposed as a Lambda function via serverless-http. Handles all client-facing REST operations.

Routes & Controllers:

Route Controller Purpose
POST /users/register user.controller Create user account
POST /users/login user.controller Authenticate, issue JWT cookies
GET /anomalies anomaly.controller Paginated anomaly list (DB + live CloudWatch merge)
GET /anomalies/:id/explain anomaly.controller GNN + SHAP explanation for a specific anomaly
PATCH /anomalies/:id/resolve anomaly.controller Mark anomaly as resolved
GET /anomalies/stats anomaly.controller Severity distribution stats
GET /events events.controller Paginated cloud metric events
GET /graph graph.controller Resource graph topology for D3 visualisation
POST /automation/trigger automation.controller Trigger GNN-recommended remediation
GET /automation/logs automation.controller Automation action history
POST /conversation conversation.controller Initiate RAG chat with LLM
POST /document/upload document.controller Upload runbook to S3 + index in Pinecone
GET /s3/signed-url s3.controller Generate pre-signed S3 URL for file access

Lambda 2 β€” eventProcessor (GNN Metric Ingestion)

Handler: src/eventProcessor.js
Triggers: EventBridge (cloud-metrics-rule) + SQS batch
Memory: 256 MB | Timeout: 60s

The core real-time data pipeline. Each time an EventBridge event or SQS message arrives:

  1. Parse the raw event (CloudWatch Alarm or custom CloudMetricEvent)
  2. Persist the metric event to MongoDB (hydrates the frontend graph)
  3. Forward to Python GNN service (POST /predict/single)
  4. If anomaly detected β†’ persist to DynamoDB + MongoDB
  5. For critical/high β†’ publish to SNS

Lambda 3 β€” wsConnect

Handler: src/Handlers/websocketHandlers.connect
Trigger: WebSocket $connect route

Stores the new connectionId in the cloud-automation-ws-connections-dev DynamoDB table with a TTL so stale connections are automatically cleaned up.


Lambda 4 β€” wsDisconnect

Handler: src/Handlers/websocketHandlers.disconnect
Trigger: WebSocket $disconnect route

Removes the connectionId record from DynamoDB when a browser disconnects.


Lambda 5 β€” snsToWebSocket (Real-Time Alert Fan-out)

Handler: src/Handlers/snsSubscriber.handler
Trigger: SNS topic subscription

When SNS receives a critical anomaly notification:

  1. Scans DynamoDB for all active WebSocket connection IDs
  2. Uses @aws-sdk/client-apigatewaymanagementapi to POST the anomaly payload to every connected browser simultaneously

Python FastAPI Service

Entry: Python-Backend/app/main.py
Deployment: Docker container (Dockerfile included)
Base URL: configured via PYTHON_SERVICE_URL env var

Endpoint Description
POST /predict/single Run GNN inference on one cloud node
POST /predict/graph Run GNN inference on a full resource graph
POST /explain Generate SHAP + GNNExplainer XAI explanation
POST /chat Invoke LangChain RAG with Gemini LLM
POST /document/process Chunk and embed uploaded documents into Pinecone

Internal services:

File Responsibility
gnn_model.py GraphSAGE model definition + load_model() helper
graph_builder.py Convert raw node/edge lists β†’ PyTorch Geometric Data object
gnn_inference.py Load model from S3/disk + run forward pass
xai_service.py GNNExplainer + SHAP KernelExplainer orchestration
explanation_builder.py Format XAI outputs into frontend-ready JSON
llm_service.py LangChain LCEL pipelines (Smart Chat, Document Analysis, etc.)
document_processor.py PDF/text chunking and embedding for Pinecone indexing
s3_service.py Upload/download model artifacts from S3
pinecone_service.py Pinecone client initialisation

πŸ”„ System Workflow (End-to-End)

Workflow 1 β€” Real-Time Anomaly Detection (Happy Path)

1. AWS CloudWatch detects a metric threshold breach (e.g., EC2 CPU > 90%)
   └─► Emits CloudWatch Alarm State Change event

2. EventBridge rule (cloud-metrics-rule) matches the event
   └─► Routes to SQS event queue  (OR directly triggers Lambda 2)

3. Lambda 2 (eventProcessor) processes the SQS batch
   β”œβ”€β–Ί Parses & normalises the CloudMetricEvent payload
   β”œβ”€β–Ί Saves Event record to MongoDB (graph hydration)
   └─► Calls Python FastAPI:  POST /predict/single
         └─► graph_builder.py builds PyG Data object (normalised features + self-loop edges)
         └─► GraphSAGE.forward() β†’ anomaly_score ∈ [0,1]
         └─► Returns { is_anomaly: true, anomaly_score: 0.91, ... }

4. If is_anomaly == true:
   β”œβ”€β–Ί Persist anomaly to DynamoDB (7-day TTL)
   β”œβ”€β–Ί Persist anomaly to MongoDB (UI stats)
   └─► If severity ∈ {critical, high}:
         └─► SNS.publish(anomaly alert)

5. SNS triggers Lambda 5 (snsToWebSocket)
   β”œβ”€β–Ί Scan DynamoDB for all active WS connectionIds
   └─► POST anomaly payload to each connected browser via APIGW Execution API

6. Browser receives WebSocket message β†’ React updates Dashboard in real time
   β”œβ”€β–Ί ResourceGraph re-renders with anomalous node highlighted (red)
   β”œβ”€β–Ί AlertCard appears in Alerts page
   └─► MetricsChart spikes visible

Workflow 2 β€” XAI Explanation Request

1. User clicks "Explain" on an alert card in the browser

2. Browser:  GET /anomalies/:id/explain  β†’  Node.js Lambda 1

3. Node.js Lambda:
   β”œβ”€β–Ί If alarm ID starts with "aws-alarm-":
   β”‚     └─► Derive attack-specific metrics from alarm name pattern
   β”‚     └─► Compute local metric-deviation SHAP (baseline comparison)
   └─► POST /explain to Python FastAPI with node metrics + edges

4. Python FastAPI /explain:
   β”œβ”€β–Ί graph_builder.py β†’ builds PyG Data from request payload
   β”œβ”€β–Ί gnn_model.load_model() β†’ loads GraphSAGE from disk
   β”‚
   β”œβ”€β–Ί Step 1: GNNExplainer (200 epochs of mask optimisation)
   β”‚     └─► Outputs feature mask [N, F] + edge mask [E]
   β”‚     └─► Identifies which neighbouring nodes matter (important subgraph)
   β”‚
   β”œβ”€β–Ί Step 2: SHAP KernelExplainer (64 perturbation samples)
   β”‚     β”œβ”€β–Ί model_predict() isolates target node feature, runs full graph GNN
   β”‚     β”œβ”€β–Ί Detects degenerate output (variance < 1e-6) β†’ falls back to |z-score|
   β”‚     └─► Blends SHAP Γ— GNNExplainer mask: importance = sqrt(shap * gnn_mask)
   β”‚
   └─► Returns { feature_importance, important_nodes, important_edges }

5. Node.js Lambda:
   β”œβ”€β–Ί Checks if Python SHAP is degenerate (all equal values) β†’ uses local SHAP
   β”œβ”€β–Ί Formats response: { shapValues, cascadePath, nlExplanation, actionStatus }
   └─► Caches explanation to MongoDB anomaly record

6. Browser XAI Panel displays:
   β”œβ”€β–Ί SHAP bar chart (feature importances)
   β”œβ”€β–Ί Cascade path (important subgraph nodes)
   └─► Natural language explanation

Workflow 3 β€” LLM Chat / RAG Query

1. User types a question in the AI Chat panel

2. Browser: POST /conversation β†’ Node.js β†’ Python FastAPI POST /chat

3. Python FastAPI:
   β”œβ”€β–Ί HuggingFace all-MiniLM-L6-v2 embeds the question
   β”œβ”€β–Ί Pinecone retrieves top-k relevant runbook chunks (dynamic k = 15% of namespace size)
   β”œβ”€β–Ί LangChain LCEL chain: context β†’ PromptTemplate β†’ Gemini 2.5 β†’ StrOutputParser
   └─► Returns natural language answer

4. Browser: streams response into chat UI

Workflow 4 β€” WebSocket Connection Lifecycle

1. Browser connects to WSS://  (API Gateway WebSocket URL)
   └─► Lambda 3 (wsConnect): DynamoDB.PutItem(connectionId, TTL)

2. Browser stays connected β€” receives real-time anomaly pushes (Workflow 1, step 5)

3. Browser disconnects (page close / network loss)
   └─► Lambda 4 (wsDisconnect): DynamoDB.DeleteItem(connectionId)

Workflow 5 β€” Model Training & Deployment

1. Generate synthetic cloud graph data
   └─► python ML/data_generator.py  β†’  ML/data/graph_data.pt

2. Train GraphSAGE
   └─► python ML/train_gnn.py
         β”œβ”€β–Ί 200 epochs, Adam, BCELoss with class weighting
         β”œβ”€β–Ί Saves best checkpoint:  ML/models/gnn_model.pt
         └─► Saves loss curve:      ML/reports/training_loss.png

3. Upload model artifact to S3
   └─► python ML/upload_model.py  β†’  s3://cloud-automation-gnn-model-dev/gnn_model.pt

4. Python API loads model at startup from S3 (or local disk for dev)

πŸ–₯ Frontend Pages & Components

Pages

Page Route Description
LandingPage / Marketing landing page with animated background
LoginPage /login JWT auth login form
SignupPage /signup User registration with email verification
DashboardPage /dashboard Main monitoring hub β€” graph + metrics
AlertsPage /alerts Paginated anomaly alert list with severity tabs

Key Components

Component Description
ResourceGraph D3 force-directed graph of cloud nodes with live anomaly highlighting
MetricsChart Recharts time-series chart for CPU/Memory/Latency/Error Rate
XAIPanel SHAP bar chart + cascade path + NL explanation panel
AlertCard Individual anomaly card with severity badge + Explain button
AutomationLog Real-time log of automated remediation actions
WelcomeScreen AI-powered welcome after login
NodeDetailModal Click-through modal with per-node live metrics
ChatInput / Message AI chat interface with Markdown rendering
EnhancedFileUpload Drag-and-drop document upload for RAG indexing

Real-Time Layer (Frontend)

  • Primary: WebSocket connection to API Gateway WSS endpoint (src/services/socket.js)
  • Fallback: HTTP polling every 30s if WebSocket is unavailable
  • Events multiplexed: anomaly:new, graph:update, automation:log, metrics:update

πŸ“‘ API Endpoints

Node.js REST API (via API Gateway)

Auth
  POST   /users/register          Register new user
  POST   /users/login             Login (returns JWT cookies)
  POST   /users/logout            Clear auth cookies
  GET    /users/me                Get current user profile

Anomalies
  GET    /anomalies               List anomalies (paginated, filtered)
  GET    /anomalies/stats         Severity distribution stats
  GET    /anomalies/:id           Single anomaly detail
  GET    /anomalies/:id/explain   GNN + SHAP + NL explanation
  PATCH  /anomalies/:id/resolve   Mark anomaly as resolved

Events (Cloud Metric Events)
  GET    /events                  Paginated cloud metric event log

Graph
  GET    /graph                   Resource graph topology (nodes + edges)

Automation
  POST   /automation/trigger      Trigger automated remediation
  GET    /automation/logs         Remediation action log

Conversations (RAG Chat)
  POST   /conversation            Start / continue AI chat session
  GET    /conversation/:id        Retrieve chat history

Documents (RAG Indexing)
  POST   /document/upload         Upload + embed document in Pinecone
  GET    /s3/signed-url           Generate S3 pre-signed download URL

Python FastAPI

  POST   /predict/single          GNN inference (1 node)
  POST   /predict/graph           GNN inference (full graph)
  POST   /explain                 XAI: SHAP + GNNExplainer
  POST   /chat                    LangChain RAG + Gemini
  POST   /document/process        Chunk + embed document β†’ Pinecone

πŸ— Infrastructure as Code

Terraform (infra/)

Manages core, long-lived AWS resources:

Resource Description
aws_s3_bucket.model_bucket GNN model artifacts (versioned + AES-256 encrypted)
aws_s3_bucket.frontend_bucket React SPA static hosting
aws_sqs_queue.event_queue Main event queue (long-polling, 1-day retention)
aws_sqs_queue.event_dlq Dead-letter queue (14-day retention, max 3 retries)
aws_cloudwatch_event_rule.cloudwatch_alarm_rule EventBridge rule: CloudWatch ALARM state changes
aws_cloudwatch_event_rule.metric_poll_rule Periodic metric poll schedule
aws_cloudwatch_event_target.alarm_to_sqs Route EventBridge β†’ SQS
aws_dynamodb_table.anomaly_table Anomaly results (GSI on resourceId + severity)
aws_dynamodb_table.automation_log_table Automation action history
aws_cloudwatch_log_group Lambda log groups with retention policies

Serverless Framework (Node-Backend/serverless.yml)

Manages Lambda functions and ephemeral AWS resources:

Resource Description
WsConnectionsTable DynamoDB: active WebSocket connection IDs (PAY_PER_REQUEST, TTL-enabled)
CloudAutomationSnsTopic SNS topic for anomaly alerts
GnnDataTable DynamoDB: GNN result store (composite key)

Plugins: serverless-esbuild (tree-shaking + source maps), serverless-offline (local dev on port 5000)


πŸ“ Directory Structure

CloudAutomationGNN/
β”‚
β”œβ”€β”€ Frontend/                   # React (Vite) SPA
β”‚   └── src/
β”‚       β”œβ”€β”€ JSX/
β”‚       β”‚   β”œβ”€β”€ Pages/          # LandingPage, Dashboard, Alerts, Login, Signup
β”‚       β”‚   └── Components/     # ResourceGraph, MetricsChart, XAIPanel, AlertCard...
β”‚       β”œβ”€β”€ services/           # socket.js (WebSocket + polling client)
β”‚       └── Config/             # API base URL config
β”‚
β”œβ”€β”€ Node-Backend/              # Node.js Lambda (Serverless Framework)
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ Controllers/        # anomaly, automation, graph, events, user, s3...
β”‚   β”‚   β”œβ”€β”€ Routes/             # Express route definitions
β”‚   β”‚   β”œβ”€β”€ Models/             # Mongoose schemas (User, Event, Anomaly, Conversation)
β”‚   β”‚   β”œβ”€β”€ Handlers/           # WebSocket handlers + SNS subscriber Lambda
β”‚   β”‚   β”œβ”€β”€ Middlewares/        # JWT auth middleware
β”‚   β”‚   β”œβ”€β”€ Utils/              # ApiError, ApiResponse, AsyncHandler
β”‚   β”‚   β”œβ”€β”€ db/                 # MongoDB Atlas connection
β”‚   β”‚   β”œβ”€β”€ eventProcessor.js   # Lambda 2: EventBridge/SQS β†’ GNN β†’ DynamoDB/SNS
β”‚   β”‚   └── app.js              # Express app factory
β”‚   β”œβ”€β”€ serverless.yml          # 5 Lambda function definitions + IAM + resources
β”‚   └── lambda.js               # serverless-http adapter
β”‚
β”œβ”€β”€ Python-Backend/            # FastAPI GNN + XAI + LLM service
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ api/                # FastAPI route handlers
β”‚   β”‚   β”œβ”€β”€ core/               # Config, model loader (singleton)
β”‚   β”‚   β”œβ”€β”€ schemas/            # Pydantic request/response models
β”‚   β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”‚   β”œβ”€β”€ gnn_model.py    # GraphSAGE architecture
β”‚   β”‚   β”‚   β”œβ”€β”€ graph_builder.py# Raw metrics β†’ PyG Data object
β”‚   β”‚   β”‚   β”œβ”€β”€ gnn_inference.py# Model loading + forward pass
β”‚   β”‚   β”‚   β”œβ”€β”€ xai_service.py  # GNNExplainer + SHAP orchestration
β”‚   β”‚   β”‚   β”œβ”€β”€ llm_service.py  # LangChain LCEL pipelines (Gemini)
β”‚   β”‚   β”‚   β”œβ”€β”€ document_processor.py # PDF chunking + Pinecone indexing
β”‚   β”‚   β”‚   └── s3_service.py   # S3 model artifact I/O
β”‚   β”‚   β”œβ”€β”€ prompts.py          # LangChain prompt templates
β”‚   β”‚   └── main.py             # FastAPI app entry point
β”‚   └── Dockerfile              # Container definition for Python service
β”‚
β”œβ”€β”€ ML/                        # Offline training pipeline
β”‚   β”œβ”€β”€ data_generator.py       # Synthetic cloud graph data generator
β”‚   β”œβ”€β”€ train_gnn.py            # GraphSAGE training script
β”‚   β”œβ”€β”€ evaluate.py             # Model evaluation utilities
β”‚   β”œβ”€β”€ upload_model.py         # Upload gnn_model.pt to S3
β”‚   β”œβ”€β”€ generate_fake_metrics.py# Fake metric generation for testing
β”‚   β”œβ”€β”€ data/                   # graph_data.pt (generated)
β”‚   β”œβ”€β”€ models/                 # gnn_model.pt (trained checkpoint)
β”‚   └── reports/                # training_loss.png
β”‚
β”œβ”€β”€ infra/                     # Terraform IaC
β”‚   β”œβ”€β”€ main.tf                 # Core AWS resources (S3, SQS, DynamoDB, EventBridge)
β”‚   β”œβ”€β”€ lambda.tf               # Lambda-specific Terraform resources
β”‚   └── variables.tf            # Input variable definitions
β”‚
└── Database/                  # MongoDB schema reference / seed scripts

πŸš€ Getting Started

Prerequisites

  • Node.js β‰₯ 20, Python β‰₯ 3.10, Docker
  • AWS CLI configured (aws configure)
  • Terraform β‰₯ 1.6
  • Serverless Framework (npm i -g serverless)
  • MongoDB Atlas connection URI
  • Pinecone API key, Google AI API key

1. Provision Infrastructure (Terraform)

cd infra
terraform init
terraform apply

2. Train & Upload the GNN Model

cd ML
pip install -r requirements.txt
python data_generator.py
python train_gnn.py
python upload_model.py

3. Start Python Backend

cd Python-Backend
pip install -r requirements.txt
uvicorn app.main:app --host 0.0.0.0 --port 8000
# Or with Docker:
docker build -t cloud-gnn-python .
docker run -p 8000:8000 --env-file .env cloud-gnn-python

4. Start Node.js Backend (local dev)

cd Node-Backend
cp .env.example .env   # fill in your values
npm install
npx serverless offline
# API available at http://localhost:5000

5. Start Frontend

cd Frontend
npm install
npm run dev
# App available at http://localhost:5173

6. Deploy to AWS

# Deploy Node.js Lambdas
cd Node-Backend
npx serverless deploy --stage dev

# Deploy Frontend to S3
# (see .agents/workflows/deploy-frontend.md)

πŸ‘₯ Team

CloudAutomationGNN β€” 6th Semester Cloud Computing Project

Built with PyTorch Geometric, AWS Serverless, and a healthy obsession with graph theory.

Releases

No releases published

Packages

 
 
 

Contributors