Decision Optimization · Business Operations · AIOps — Unified Intelligence
EntropyOPStack is a next-generation AI-Driven Operations Stack that unifies three critical operational domains into a single intelligent platform:
| Layer | Domain | Focus |
|---|---|---|
| Decision Optimization | Operations Research | Resource allocation, scheduling, capacity planning |
| Business Operations | Business Ops | KPIs, user journey, conversion, business impact |
| AIOps | IT Operations | Infrastructure monitoring, diagnostics, automation |
By leveraging a hierarchical multi-agent collaboration system, EntropyOPStack bridges the gap between business strategy, operational execution, and technical infrastructure.
Unlike traditional infrastructure monitoring tools, EntropyOPStack provides:
- Business-centric perspective - Understanding how technical systems impact business outcomes
- Decision optimization - AI-driven resource allocation, scheduling, and capacity planning
- End-to-end visibility - From user behavior to infrastructure components across 5 architectural layers
- Intelligent automation - Self-healing capabilities with human-in-the-loop oversight
Intelligent decision optimization for resource allocation, scheduling, and capacity planning.
Business metrics monitoring, user journey analysis, and impact assessment.
Infrastructure monitoring, intelligent diagnostics, and automated remediation.
- Hierarchical Agent Collaboration - Multi-level agent system with Global Supervisors, Team Supervisors, and specialized Workers for complex problem-solving
- 5-Layer Topology Visualization - Business Scenario → Business Flow → Application → Middleware → Infrastructure
- Business-Tech Correlation - Link technical metrics to business KPIs and quantify impact
- AI-Powered Discovery - Automated infrastructure discovery from Kubernetes, Cloud, Prometheus, and distributed tracing
- Intelligent Diagnostics - Real-time collaborative analysis with streaming AI thought processes
- Decision Optimization - Resource allocation, capacity planning, and cost optimization recommendations
- Report Generation - Automated diagnostic and business impact reports with customizable templates
┌─────────────────────────────────────────────────────────┐
│ Global Supervisor │
│ (Orchestrates overall analysis) │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│Team Supervisor│ │Team Supervisor│ │Team Supervisor│
│ (Database) │ │ (Service) │ │ (Gateway) │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
▼ ▼ ▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Worker│ │Worker│ │Worker│ │Worker│ │Worker│ │Worker│
└──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘
| Layer | Description | Examples |
|---|---|---|
| Business Scenario | End-user facing scenarios | Web Storefront, Mobile App |
| Business Flow | Traffic routing & orchestration | API Gateway, CDN, Load Balancer |
| Business Application | Core business services | Auth Service, Payment API, Order Service |
| Middleware | Supporting infrastructure | Redis Cache, Kafka, RabbitMQ |
| Infrastructure | Foundational resources | PostgreSQL, MongoDB, K8s Cluster |
- System health overview with real-time metrics
- Agent activity monitoring
- Quick access to recent diagnostics
- Interactive graph visualization with D3.js
- Drag-and-drop node positioning with layout caching
- 5-layer visual separation with color coding
- Link creation between nodes
- Barycenter-based automatic layout algorithm
- Detailed resource views with metadata editing
- Associated topology tracking
- Agent team assignment
- Analysis history with session replay
- Agent configuration (model, temperature, system instructions)
- Worker deployment from specialized templates
- Real-time status monitoring
- Findings aggregation (warnings/critical issues)
- Connectors: K8s, Cloud, Prometheus, Trace sources
- Inbox: Approval workflow for discovered nodes/links
- Scanner: AI-powered infrastructure exploration
- Hierarchical task delegation
- Real-time log streaming with agent attribution
- Click-to-focus: Navigate to agent messages in log stream
- Abort/resume capabilities
- AI-powered report creation
- Multiple report types: Diagnosis, Audit, Performance, Security
- Markdown content with Mermaid diagram support
- PDF export capability
- Context-aware AI assistant
- Resource & topology attachments
- Standalone mode (accessible via
?view=chatURL parameter)
| Category | Technologies |
|---|---|
| Frontend Framework | React 18.2, TypeScript 5.8 |
| Build Tool | Vite 6.2 |
| Styling | Tailwind CSS |
| Visualization | D3.js 7.9, Recharts 2.12, Mermaid 10.9 |
| AI Integration | Google Gemini AI (@google/genai) |
| Icons | Lucide React |
| Markdown | react-markdown, remark-gfm |
| PDF Export | html2pdf.js |
| Testing | Playwright |
├── App.tsx # Main application component (990 lines)
├── types.ts # TypeScript type definitions (213 lines)
├── index.html # Entry HTML
├── index.css # Global styles
├── components/
│ ├── TopologyGraph.tsx # D3-based topology visualization (1538 lines)
│ ├── ResourceDetailView.tsx # Resource detail page (1121 lines)
│ ├── SubGraphManagement.tsx # Topology list management (621 lines)
│ ├── GlobalChat.tsx # AI chat interface (541 lines)
│ ├── TopologiesManagement.tsx # Topology CRUD (552 lines)
│ ├── ReportDetailView.tsx # Report viewing/editing (502 lines)
│ ├── AgentManagement.tsx # Agent configuration (454 lines)
│ ├── DiscoveryInbox.tsx # Discovery approval queue (409 lines)
│ ├── ResourceManagement.tsx # Resource list (346 lines)
│ ├── DiscoveryManagement.tsx # Discovery connectors (311 lines)
│ ├── ReportManagement.tsx # Report list (296 lines)
│ ├── ScannerView.tsx # AI scanner interface (279 lines)
│ ├── Dashboard.tsx # Main dashboard (278 lines)
│ ├── PromptManagement.tsx # Prompt templates (276 lines)
│ ├── ReportTemplateManagement.tsx # Report templates (272 lines)
│ ├── SubGraphCanvas.tsx # Topology canvas (257 lines)
│ ├── ModelManagement.tsx # AI model config (257 lines)
│ ├── ToolManagement.tsx # Agent tools (226 lines)
│ ├── AuthPage.tsx # Authentication (210 lines)
│ ├── AgentConfigModal.tsx # Agent config modal (167 lines)
│ ├── AgentHierarchy.tsx # Agent tree view (146 lines)
│ ├── SettingsModal.tsx # App settings (115 lines)
│ └── LogStream.tsx # Real-time log display (105 lines)
├── services/
│ ├── mockData.ts # Mock data & initial state (2048 lines)
│ └── geminiService.ts # Gemini AI integration (445 lines)
└── public/ # Static assets
Total Source Code: ~13,000 lines of TypeScript/React
// Agent System
interface Agent {
id: string;
name: string;
role: 'Global Supervisor' | 'Team Supervisor' | 'Worker' | 'Scouter';
status: 'IDLE' | 'THINKING' | 'WORKING' | 'COMPLETED' | 'WAITING' | 'ERROR';
specialty?: string;
findings: { warnings: number; critical: number };
config?: AgentConfig;
}
interface Team {
id: string;
resourceId: string;
name: string;
supervisor: Agent;
members: Agent[];
}
// Topology System
interface TopologyNode {
id: string;
label: string;
type: 'Database' | 'Service' | 'Gateway' | 'Cache' | 'Infrastructure';
layer?: 'scenario' | 'flow' | 'application' | 'middleware' | 'infrastructure';
properties?: Record<string, string>;
}
interface TopologyLink {
source: string;
target: string;
type?: 'call' | 'deployment' | 'dependency' | 'inferred';
confidence?: number;
}
// Discovery System
interface DiscoverySource {
id: string;
name: string;
type: 'K8s' | 'Cloud' | 'Prometheus' | 'Trace';
endpoint: string;
status: 'Connected' | 'Error' | 'Scanning';
}- Node.js 18+
- npm or yarn
- Gemini API Key (for AI features)
# Clone the repository
git clone https://github.com/your-org/entropyops.git
cd entropyops
# Install dependencies
npm install
# Configure environment
cp .env.example .env.local
# Edit .env.local and set GEMINI_API_KEY# Start development server
npm run dev
# Build for production
npm run build
# Preview production build
npm run preview| Variable | Description | Required |
|---|---|---|
GEMINI_API_KEY |
Google Gemini API key for AI features | Yes |
- Navigate to Topologies and select or create a topology
- Click Diagnose Topology to enter the diagnosis view
- Enter your diagnostic query (e.g., "Analyze system health and identify bottlenecks")
- Click EXECUTE to start the hierarchical agent analysis
- Watch real-time collaboration in the log stream
- Click on any agent in the left hierarchy to jump to their messages
- Generate a report when analysis completes
- Go to Resources to view all infrastructure nodes
- Click a resource to see details, associated topologies, and agent teams
- Edit metadata or add workers to the assigned team
- View analysis history and replay previous sessions
- Configure Connectors (K8s, Cloud, Prometheus, Trace)
- Run scans to discover new infrastructure
- Review discoveries in the Inbox
- Approve or reject discovered nodes and links
EntropyOPStack is evolving into a comprehensive AI Operations Research · Business Operations · IT Operations integrated platform.
┌─────────────────────────────────────────────────────────────────────────┐
│ AI-Driven Unified Operations Platform │
├─────────────────────┬─────────────────────┬─────────────────────────────┤
│ Decision Optim. │ Business Ops │ AIOps │
│ Resource Planning │ Business Metrics │ IT Operations │
├─────────────────────┴─────────────────────┴─────────────────────────────┤
│ AI Decision Engine │
├─────────────────────────────────────────────────────────────────────────┤
│ Unified Data Platform │
└─────────────────────────────────────────────────────────────────────────┘
- Infrastructure topology visualization (5-layer model)
- Multi-agent collaborative diagnostics
- Resource discovery & management
- AI-powered report generation
- Global chat assistant
| Feature | Description | Status |
|---|---|---|
| Alert Management | Alert aggregation, noise reduction, correlation, storm suppression | 🔲 Planned |
| Anomaly Detection | AI-based anomaly detection for metrics, logs, and traces | 🔲 Planned |
| Root Cause Analysis | Fault propagation analysis, automatic root cause identification | 🔲 Planned |
| Change Risk Assessment | Pre-change impact analysis, risk scoring, rollback suggestions | 🔲 Planned |
| Capacity Forecasting | Resource usage trend prediction, scaling recommendations | 🔲 Planned |
| SLO/SLA Management | Service level objectives, error budget tracking | 🔲 Planned |
| Incident Management | Incident lifecycle, on-call scheduling, escalation policies | 🔲 Planned |
| Knowledge Base | Fault case library, solution recommendations, similar issue matching | 🔲 Planned |
| Feature | Description | Status |
|---|---|---|
| Business Metrics Dashboard | Real-time KPI monitoring (GMV, conversion rate, user activity) | 🔲 Planned |
| User Journey Analysis | End-to-end behavior paths, conversion funnels, churn analysis | 🔲 Planned |
| Business-Tech Correlation | Causal relationship between business and technical metrics | 🔲 Planned |
| Business Impact Assessment | Quantify technical failures' business impact (revenue loss, affected users) | 🔲 Planned |
| A/B Experiment Platform | Experiment design, traffic allocation, effect analysis | 🔲 Planned |
| Business Health Score | Multi-dimensional business health scoring and early warning | 🔲 Planned |
| Cost Allocation | Cloud resource cost allocation by business line/product | 🔲 Planned |
| Operations Calendar | Promotions, events scheduling linked with system protection | 🔲 Planned |
| Feature | Description | Status |
|---|---|---|
| Intelligent Scheduling | Optimal scheduling strategies for tasks, resources, and traffic | 🔲 Planned |
| Resource Optimization | Cloud resource configuration optimization, cost-performance balance | 🔲 Planned |
| Predictive Auto-scaling | Elastic scaling decisions based on business forecasting | 🔲 Planned |
| Multi-objective Optimization | Balance cost, performance, and availability trade-offs | 🔲 Planned |
| Simulation & What-if Analysis | Architecture change simulation, scenario analysis | 🔲 Planned |
| Resource Planning | Mid-to-long term resource procurement and configuration planning | 🔲 Planned |
| On-call Optimization | Optimal on-call and duty scheduling | 🔲 Planned |
| Feature | Description | Status |
|---|---|---|
| Self-healing System | Automated fault detection, decision, and remediation | 🔲 Planned |
| Continuous Optimization | Ongoing system tuning based on feedback loops | 🔲 Planned |
| Knowledge Accumulation | Learning from incidents and building organizational knowledge | 🔲 Planned |
┌─────────────────────────────────────────────────────────────────────────┐
│ Unified Data Platform │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │ Business │ │
│ │ │ │ │ │ │ │ Events │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ └─────────────┴─────────────┴─────────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Unified Data Model │ │
│ │ (Entity-Relation │ │
│ │ Knowledge Graph)│ │
│ └──────────┬──────────┘ │
└───────────────────────────┼─────────────────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────────────────┐
│ AI Decision Engine │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Anomaly │ │ Root Cause │ │ Predictive │ │
│ │ Detection │ │ Analysis │ │ Warning │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Optimization │ │ Simulation │ │ Automated │ │
│ │ Suggestions │ │ & What-if │ │ Decisions │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└───────────────────────────┬─────────────────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────────────────┐
│ Automation Execution Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Self-healing │ │ Ticket │ │ Change │ │
│ │ Actions │ │ Workflow │ │ Execution │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
├── Data Layer
│ ├── Collectors
│ │ ├── MetricCollector # Prometheus/InfluxDB integration
│ │ ├── LogCollector # ELK/Loki integration
│ │ ├── TraceCollector # Jaeger/Zipkin integration
│ │ └── BusinessEventCollector
│ ├── Storage
│ │ ├── TimeSeriesDB
│ │ ├── GraphDB # Knowledge graph
│ │ └── VectorDB # Semantic search
│ └── Governance
│
├── Intelligence Layer
│ ├── Detection Engine
│ │ ├── AnomalyDetector
│ │ ├── PatternMatcher
│ │ └── ThresholdManager
│ ├── Analysis Engine
│ │ ├── RootCauseAnalyzer
│ │ ├── ImpactAnalyzer
│ │ └── CorrelationEngine
│ ├── Prediction Engine
│ │ ├── CapacityForecaster
│ │ ├── TrendPredictor
│ │ └── RiskScorer
│ └── Optimization Engine
│ ├── ResourceOptimizer
│ ├── CostOptimizer
│ └── ScheduleOptimizer
│
├── Decision Layer
│ ├── Policy Engine
│ ├── Approval Workflow
│ └── Human-in-the-loop
│
└── Execution Layer
├── Orchestration
├── Runbook Execution
└── Change Management
Contributions are welcome! Please read our contributing guidelines before submitting PRs.
This project is licensed under the MIT License - see the LICENSE file for details.
AI-Driven Operations Stack
From Decision Optimization to Business Operations and AIOps


