Skip to content

Commit 9c6d00f

Browse files
ambledclaude
andcommitted
Remove outdated files, trim CHANGELOG and CLAUDE.md (v0.5.8)
Removed from repo: - docs/archive/: all historical planning, completion, and design docs - antctl_*.stdout/stderr: debug capture files - *.out, scratch*, wnm-cron.log: runtime output and scratch files - .coverage: should not be committed - .gitignore updated to exclude these going forward CHANGELOG: collapsed verbose root-cause/line-number entries to one-liners; recent v0.5.x entries kept readable. 1150 -> 403 lines. CLAUDE.md: removed concurrent ops examples, dead file references, anm migration detail, key config defaults table. 297 -> 99 lines. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent 68af32c commit 9c6d00f

45 files changed

Lines changed: 149 additions & 9150 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.coverage

-52 KB
Binary file not shown.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@ alembic
44
*.pyc
55
colony.db
66
__pycache__
7+
.pytest_cache
78
wnm.egg-info
89
src/wnm.egg-info/
910
alembic/versions/__pycache__/
11+
.coverage
12+
save

CHANGELOG.md

Lines changed: 92 additions & 834 deletions
Large diffs are not rendered by default.

CLAUDE.md

Lines changed: 53 additions & 252 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
44

55
## Project Overview
66

7-
Weave Node Manager (wnm) is a Python application for managing Autonomi nodes on Linux and macOS systems. It's an Alpha-stage Python port of the `anm` (autonomic node manager) tool. The system automatically manages node lifecycle: creating, starting, stopping, upgrading, and removing nodes based on system resource thresholds (CPU, memory, disk, network I/O, load average).
7+
Weave Node Manager (wnm) is a Python application for managing Autonomi nodes on Linux and macOS systems. The system automatically manages node lifecycle: creating, starting, stopping, upgrading, and removing nodes based on system resource thresholds (CPU, memory, disk, network I/O, load average).
88

99
**Platforms**:
1010
- **Linux**: systemd or setsid for process management, UFW for firewall (root or user-level)
@@ -13,286 +13,87 @@ Weave Node Manager (wnm) is a Python application for managing Autonomi nodes on
1313

1414
## Development Environment
1515

16-
### macOS Development (Native)
17-
18-
**On macOS, you can run and test natively** using launchd for process management:
16+
### macOS (Native)
1917

2018
```bash
21-
# Run tests natively on macOS
2219
./scripts/test-macos.sh
23-
24-
# Or run tests directly
20+
# or
2521
pytest tests/ -v -m "not linux_only"
26-
27-
# Run application in dry-run mode
2822
python3 -m wnm --dry_run
29-
30-
# Initialize with rewards address
31-
python3 -m wnm --init --rewards_address 0xYourEthereumAddress
3223
```
3324

34-
**macOS Notes**:
35-
- Uses `~/Library/Application Support/autonomi/` for data
36-
- Uses `~/Library/Logs/autonomi/` for logs
37-
- Nodes managed via launchd (`~/Library/LaunchAgents/`)
38-
- No root/sudo required
25+
- Data: `~/Library/Application Support/autonomi/`
26+
- Logs: `~/Library/Logs/autonomi/`
27+
- Nodes: `~/Library/LaunchAgents/`
3928
- Some tests marked `@pytest.mark.linux_only` will be skipped
4029

41-
### Linux Development (Docker)
42-
43-
**On Linux, use Docker for systemd/UFW testing**:
30+
### Linux (Docker)
4431

4532
```bash
46-
# Run tests in Docker container
47-
./scripts/test.sh
48-
49-
# Interactive development shell in Docker
50-
./scripts/dev.sh
51-
52-
# Inside the container, you can run:
53-
pytest tests/ -v # Run all tests
54-
python3 -m wnm --dry_run # Run application in dry-run mode
33+
./scripts/test.sh # run tests
34+
./scripts/dev.sh # interactive shell
5535
```
5636

57-
See `DOCKER-DEV.md` for complete Docker development environment documentation.
58-
5937
## Development Commands
6038

61-
### Setup Development Environment
6239
```bash
63-
# Create virtual environment
64-
python3 -m venv .venv
65-
. .venv/bin/activate
40+
# Setup
41+
python3 -m venv .venv && . .venv/bin/activate
42+
pip3 install -r requirements.txt -r requirements-dev.txt
6643

67-
# Install dependencies
68-
pip3 install -r requirements.txt
44+
# Format
45+
black src/ && isort src/
6946

70-
# Install development dependencies
71-
pip3 install -r requirements-dev.txt
72-
```
73-
74-
### Code Formatting
75-
```bash
76-
# Format code with black
77-
black src/
78-
79-
# Sort imports with isort
80-
isort src/
81-
```
82-
83-
### Build and Package
84-
```bash
85-
# Build the package
47+
# Build
8648
python3 -m build
87-
88-
# Upload to TestPyPi (maintainers only)
89-
twine upload --verbose --repository testpypi dist/*
90-
# Upload to PyPI (maintainers only)
9149
twine upload dist/*
9250
```
9351

94-
### Running the Application
95-
```bash
96-
# Run directly from source
97-
python3 -m wnm
98-
99-
# With command-line options
100-
python3 -m wnm --dry_run --init --migrate_anm
101-
102-
# Entry point after installation
103-
wnm
104-
```
105-
10652
## Architecture
10753

10854
### Core Flow (`__main__.py`)
109-
The application runs as a single-execution cycle (typically invoked via cron every minute):
55+
Single-execution cycle, invoked via cron each minute:
11056

111-
1. **Locking**: Creates platform-specific lock file to prevent concurrent runs
112-
- macOS: `~/Library/Application Support/autonomi/wnm_active`
113-
- Linux (root): `/var/antctl/wnm_active`
114-
- Linux (user): `~/.local/share/autonomi/wnm_active`
115-
2. **Configuration**: Loads machine config from SQLite database or initializes from `anm` migration
116-
3. **Metrics Collection**: Gathers system metrics (CPU, memory, disk, I/O, load average) and node statuses
117-
4. **Decision Engine**: `choose_action()` determines what actions to take based on thresholds and concurrency limits
118-
5. **Action Execution**: Performs operations (add/remove/upgrade/restart nodes, or idle) via ProcessManager, respecting concurrent operation limits
57+
1. **Locking**: Platform-specific lock file prevents concurrent runs
58+
2. **Configuration**: Loads machine config from SQLite (`colony.db`)
59+
3. **Metrics Collection**: CPU, memory, disk, I/O, load average + node statuses
60+
4. **Decision Engine**: Plans actions based on thresholds and concurrency limits
61+
5. **Action Execution**: Performs operations via ProcessManager
11962
6. **Cleanup**: Removes lock file and exits
12063

121-
### Database Models (`models.py`)
122-
Two SQLAlchemy ORM models backed by SQLite (`colony.db`):
123-
124-
- **Machine**: Single row (id=1) storing cluster configuration (thresholds, ports, paths, addresses)
125-
- **Node**: One row per Autonomi node with status, version, ports, metrics, timestamps
126-
127-
### Configuration System (`config.py`)
128-
Multi-layer configuration priority (highest to lowest):
129-
1. Command-line arguments (via `configargparse`)
130-
2. Environment variables from `.env` or `/var/antctl/config`
131-
3. Config files (`~/.local/share/wnm/config`, `~/wnm/config`)
132-
4. Database-stored machine config
133-
5. Defaults
134-
135-
Configuration loading happens at module import, creating global `options`, `machine_config`, and database session factory `S`.
136-
137-
### Process Management (`process_managers/`)
138-
Platform-specific process managers handle node lifecycle via factory pattern:
139-
140-
- **SystemdManager** (`systemd_manager.py`): Linux root-level, uses systemd services
141-
- **LaunchdManager** (`launchd_manager.py`): macOS, uses launchd agents
142-
- **SetsidManager** (`setsid_manager.py`): Linux user-level, background processes
143-
144-
All managers implement the `ProcessManager` base class with these methods:
145-
- `create_node()`: Creates directories, copies binary, starts node
146-
- `start_node()`, `stop_node()`, `restart_node()`: Controls node lifecycle
147-
- `get_status()`: Returns node process status
148-
- `remove_node()`: Stops node and cleans up files
149-
- `survey_nodes()`: Discovers existing nodes
150-
151-
### Firewall Management (`firewall/`)
152-
Platform-specific firewall managers:
153-
154-
- **UfwManager** (`ufw_manager.py`): Linux, manages UFW firewall rules
155-
- **NullFirewallManager** (`null_manager.py`): macOS and fallback, no-op implementation
156-
157-
### Node Management (`utils.py`)
158-
Legacy helper functions (being phased out in favor of ProcessManager abstraction):
159-
160-
- **Metrics**: `read_node_metrics()`, `read_node_metadata()` - Polls node HTTP endpoints
161-
- **Binary**: `get_latest_binary_version()` - Checks for new antnode versions
64+
### Key Modules
65+
- **`models.py`**: SQLAlchemy ORM — `Machine` (single row, cluster config) and `Node` (one row per node)
66+
- **`config.py`**: Multi-layer config — CLI args → env vars → config files → DB → defaults; global `options`, `machine_config`, session factory `S` created at import
67+
- **`decision_engine.py`**: `DecisionEngine` class; `_compute_features()` + `plan_actions()`
68+
- **`executor.py`**: `ActionExecutor` class; executes planned actions and all `--force_action` variants
69+
- **`utils.py`**: Metrics polling (`read_node_metrics()`, `read_node_metadata()`), counter/state updates
70+
- **`process_managers/`**: Factory pattern — `SystemdManager`, `LaunchdManager`, `SetsidManager`, `AntctlManager`, `AntctlZenManager`, `DockerManager` all implement `ProcessManager` base
71+
- **`firewall/`**: `UfwManager` (Linux) and `NullFirewallManager` (macOS/fallback)
16272

16373
### Node States (`common.py`)
164-
Nodes transition through states tracked in the database:
165-
- `RUNNING`: Node responding to metrics port
166-
- `STOPPED`: Node not responding
167-
- `UPGRADING`: In upgrade delay period
168-
- `RESTARTING`: In restart delay period
169-
- `REMOVING`: In removal delay period before deletion
170-
- `DEAD`: Node with missing root directory, marked for immediate removal
171-
- `DISABLED`: Excluded from management
172-
173-
### Decision Engine Logic
174-
The `choose_action()` function implements a priority-based decision tree:
175-
176-
1. **System Reboot Detection**: If system start time changed, resurvey all nodes
177-
2. **Dead Node Cleanup**: Remove nodes with missing directories immediately
178-
3. **Version Updates**: Update version field for nodes missing it
179-
4. **Delay Expiration**: Wait for in-progress operations (RESTARTING, UPGRADING)
180-
5. **Resource Pressure Removal**: Remove youngest nodes if CPU/Mem/HD/IO/Load exceed removal thresholds
181-
6. **Upgrades**: Upgrade oldest running nodes with outdated versions (only when not removing)
182-
7. **Node Addition**: Start stopped nodes or create new nodes when under capacity and resource thresholds allow
183-
8. **Idle Survey**: Update all node metrics when no action needed
184-
185-
## Migration from anm
186-
187-
When `--init --migrate_anm` flags are used:
188-
1. Disables anm by removing `/etc/cron.d/anm`
189-
2. Reads `/var/antctl/config` and `/usr/bin/anms.sh` for configuration
190-
3. Scans `/etc/systemd/system/antnode*.service` files
191-
4. Imports discovered nodes into SQLite database
192-
5. Takes over management from anm
193-
194-
## Port Assignment Scheme
195-
196-
- **Node Ports**: `{PortStart} * 1000 + {node_id}` (default: 55000 + id)
197-
- **Metrics Ports**: `13000 + {node_id}`
198-
199-
Port ranges cannot be changed after initialization.
200-
201-
## Key Configuration Parameters
202-
203-
Resource thresholds control when nodes are added/removed (use snake_case on command line):
204-
- `--cpu_less_than` / `--cpu_remove`: CPU percentage thresholds for add/remove decisions (default: 70% / 80%)
205-
- `--mem_less_than` / `--mem_remove`: Memory percentage thresholds (default: 70% / 80%)
206-
- `--hd_less_than` / `--hd_remove`: Disk usage percentage thresholds (default: 70% / 80%)
207-
- `--desired_load_average` / `--max_load_average_allowed`: Load average thresholds
208-
- `--node_cap`: Maximum number of nodes allowed (default: 50)
209-
- `--delay_start` / `--delay_restart` / `--delay_upgrade` / `--delay_remove`: Minutes to wait in transitional states
210-
- `--node_storage`: Root directory for node data (platform-specific defaults)
211-
- `--rewards_address`: Ethereum address for node rewards (required)
212-
213-
**Platform-Specific Default Paths**:
214-
- macOS: `~/Library/Application Support/autonomi/node/`
215-
- Linux (root): `/var/antctl/services/`
216-
- Linux (user): `~/.local/share/autonomi/node/`
217-
218-
## Concurrent Operations
219-
220-
WNM supports running multiple node operations simultaneously to better utilize powerful hardware. This feature allows aggressive scaling on machines with high capacity.
221-
222-
### Configuration Parameters
223-
224-
**Per-Operation Limits:**
225-
- `--max_concurrent_upgrades` (default: 1): Maximum nodes upgrading simultaneously
226-
- `--max_concurrent_starts` (default: 1): Maximum nodes starting/restarting simultaneously
227-
- `--max_concurrent_removals` (default: 1): Maximum nodes being removed simultaneously
228-
229-
**Global Limit:**
230-
- `--max_concurrent_operations` (default: 1): Total concurrent operations across all types
231-
232-
The effective limit is MIN(per_operation_limit, remaining_global_capacity).
233-
234-
### Examples
235-
236-
**Conservative (default):**
237-
```bash
238-
wnm --max_concurrent_upgrades 1 \
239-
--max_concurrent_starts 1 \
240-
--max_concurrent_operations 1
241-
```
242-
243-
**Aggressive (powerful machine):**
244-
```bash
245-
wnm --max_concurrent_upgrades 4 \
246-
--max_concurrent_starts 4 \
247-
--max_concurrent_removals 2 \
248-
--max_concurrent_operations 8
249-
```
250-
251-
**Very aggressive (high-end server):**
252-
```bash
253-
wnm --max_concurrent_upgrades 10 \
254-
--max_concurrent_starts 10 \
255-
--max_concurrent_removals 5 \
256-
--max_concurrent_operations 20
257-
```
258-
259-
### Behavior
260-
261-
WNM will **aggressively scale to capacity** each cycle:
262-
- If upgrade limit is 4 and 2 nodes are upgrading, WNM will start 2 more upgrades immediately
263-
- Operations respect both per-type limits AND global limit
264-
- Dead node removals always take priority and ignore limits
265-
- Each action selects a different node (no duplicate operations on same node)
266-
267-
### Capacity Constraints
268-
269-
Operations are limited by actual node availability:
270-
- **Upgrades**: Limited by nodes needing upgrade
271-
- **Starts**: Limited by stopped nodes available
272-
- **Adds**: Limited by node cap - total nodes
273-
- **Removes**: Limited by stopped/running nodes available
274-
275-
Example: If `max_concurrent_starts=4` but only 2 stopped nodes exist, WNM will:
276-
1. Start 2 stopped nodes
277-
2. Add 2 new nodes (if under node cap)
74+
`RUNNING``STOPPED``RESTARTING``UPGRADING``REMOVING` → deleted; `DEAD` (missing dir, immediate removal); `DISABLED` (excluded from management)
75+
76+
### Decision Engine Priority
77+
1. Reboot detection → resurvey all nodes
78+
2. Dead node cleanup (immediate)
79+
3. Version field updates
80+
4. Delay expiration for transitional states
81+
5. Resource pressure removal (CPU/Mem/HD/IO/Load)
82+
6. Upgrades (only when `--enable_upgrade` passed; blocked during removals)
83+
7. Node addition (stopped nodes first, then create new)
84+
8. Idle survey
85+
86+
### Port Assignment
87+
- Node ports: `port_start * 1000 + node_id` (default: 55000+)
88+
- Metrics ports: `metrics_port_start * 1000 + node_id` (default: 13000+)
89+
- Cannot be changed after `--init`
90+
91+
### Concurrent Operations
92+
Per-type limits (`--max_concurrent_upgrades/starts/removals`, default 1) combined with a global cap (`--max_concurrent_operations`, default 1). Effective limit = MIN(per-type, remaining global). See `docs/USER-GUIDE-PART3.md` for configuration examples.
27893

27994
## Important Constraints
280-
281-
- Concurrent operations respect configured limits (defaults: 1 operation per cycle, configurable for powerful machines)
282-
- Nodes are added/removed based on the "youngest" (most recent `age` timestamp)
283-
- Upgrades only proceed when no removals are pending
284-
- Database has single Machine row (id=1); updates apply to entire cluster
285-
- Lock file prevents concurrent execution
286-
- **Platform-specific requirements**:
287-
- Linux (root): Requires sudo for systemd and ufw
288-
- Linux (user): No sudo required, uses setsid
289-
- macOS: No sudo required, uses launchd
290-
291-
## Platform Support
292-
293-
See `MACOS-SUPPORT-PLAN.md` for detailed macOS implementation roadmap.
294-
See `PLATFORM-SUPPORT.md` for platform-specific details on:
295-
- Process management (systemd, launchd, setsid)
296-
- Firewall management (UFW, null)
297-
- Path conventions
298-
- Binary management and upgrades
95+
- Single `Machine` row (id=1); all config updates apply cluster-wide
96+
- Nodes selected for removal by "youngest" (`age` timestamp)
97+
- Upgrades skipped unless `--enable_upgrade` is passed (antnode self-upgrades by default)
98+
- Linux root mode requires sudo for systemd and UFW
99+
- `--port_start`, `--metrics_port_start`, and `--process_manager` are immutable after `--init`

antctl_add.stderr

Lines changed: 0 additions & 27 deletions
This file was deleted.

0 commit comments

Comments
 (0)