Skip to content

feat: Scale support for 4500+ nodes per NICo controller #1864

@deepak-poornachandra

Description

@deepak-poornachandra

Is this a new feature, an enhancement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

An upcoming deployment introduces a target requirement of 4,500+ endpoints (compute trays) managed under a single NICo controller domain. To ensure seamless production operations at this density, we need to optimize NICo to handle the increased concurrency in three key areas:

  1. Ingestion Throughput
  2. Core provisioning pipelines (including DHCP/PXE allocation, firmware updates, and OS installation)
  3. Control Plane Efficiency to ensure the controller handles the volume of 4,500 active endpoints efficiently under peak load.

Feature Description

  1. Use Case 1: Bulk Endpoint Ingestion & Registration
    Scenario: A site operator initiates a full-site bring-up or power cycle, causing 4,500+ nodes (compute trays and ancillary devices) to hit the network and check in concurrently.

Success Criteria: 100% of endpoints are successfully discovered, validated, and registered in the database without dropped connections, discovery timeouts, or system deadlocks. The operator does not need to artificially stagger or pace out rack power sequences.

  1. Use Case 2: Coordinated Fleet-Wide Provisioning Workflows
    Scenario: The site operator triggers a synchronized lifecycle operation—such as an OS provisioning cycle, firmware flash, or tenant sanitization—across all 4,500+ managed hosts.

Success Criteria: The parallel provisioning pipelines execute reliably across the entire high-density footprint, completing deterministically within the site's scheduled maintenance window without getting stuck in intermediate lifecycle states or requiring manual retries.

  1. Use Case 3: API Responsiveness Under Peak Telemetry Load
    Scenario: The data center is running at full capacity, with 4,500+ active nodes continuously streaming high-frequency health metrics, heartbeats, and sensor data back to the controller domain.

Success Criteria: The north-bound API and CLI remain fully responsive. External automation and orchestration tools can query the inventory or execute audits instantly without experiencing command lag, API timeouts, or database write-locks.

Describe your ideal solution

No response

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • I agree to follow NCX Infra Controller's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeature (deprecated - use issue type, but it's needed for reporting now)
    No fields configured for Enhancement.

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions