feat: Scale support for 4500+ nodes per NICo controller

### Is this a new feature, an enhancement, or a change to existing functionality?

New Feature

### How would you describe the priority of this feature request

High

### Please provide a clear description of problem this feature solves

An upcoming deployment introduces a  target requirement of 4,500+ endpoints (compute trays) managed under a single NICo controller domain. To ensure seamless production operations at this density, we need to optimize NICo to handle the increased concurrency in three key areas:
1. Ingestion Throughput
2. Core provisioning pipelines (including DHCP/PXE allocation, firmware updates, and OS installation)
3. Control Plane Efficiency to ensure the controller handles the volume of 4,500 active endpoints efficiently under peak load.


### Feature Description

1. Use Case 1: Bulk Endpoint Ingestion & Registration
Scenario: A site operator initiates a full-site bring-up or power cycle, causing 4,500+ nodes (compute trays and ancillary devices) to hit the network and check in concurrently.

Success Criteria: 100% of endpoints are successfully discovered, validated, and registered in the database without dropped connections, discovery timeouts, or system deadlocks. The operator does not need to artificially stagger or pace out rack power sequences.

2. Use Case 2: Coordinated Fleet-Wide Provisioning Workflows
Scenario: The site operator triggers a synchronized lifecycle operation—such as an OS provisioning cycle, firmware flash, or tenant sanitization—across all 4,500+ managed hosts.

Success Criteria: The parallel provisioning pipelines execute reliably across the entire high-density footprint, completing deterministically within the site's scheduled maintenance window without getting stuck in intermediate lifecycle states or requiring manual retries.

3. Use Case 3: API Responsiveness Under Peak Telemetry Load
Scenario: The data center is running at full capacity, with 4,500+ active nodes continuously streaming high-frequency health metrics, heartbeats, and sensor data back to the controller domain.

Success Criteria: The north-bound API and CLI remain fully responsive. External automation and orchestration tools can query the inventory or execute audits instantly without experiencing command lag, API timeouts, or database write-locks.

### Describe your ideal solution

_No response_

### Describe any alternatives you have considered

_No response_

### Additional context

_No response_

### Code of Conduct

- [x] I agree to follow NCX Infra Controller's Code of Conduct
- [x] I have searched the [open feature requests](https://github.com/NVIDIA/ncx-infra-controller-core/issues?q=is%3Aopen+is%3Aissue+type:Enhancement,Epic) and have found no duplicates for this feature request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Scale support for 4500+ nodes per NICo controller #1864

Is this a new feature, an enhancement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Feature Description

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: Scale support for 4500+ nodes per NICo controller #1864

Description

Is this a new feature, an enhancement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Feature Description

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions