Skip to content

Commit c66055e

Browse files
fix timing
1 parent 3f5bb84 commit c66055e

3 files changed

Lines changed: 113 additions & 4 deletions

File tree

CLAUDE.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## What is RHNode
6+
7+
RHNode is a Python library for deploying deep learning models as REST endpoints. It handles job queuing, resource allocation (GPU/CPU/memory), file transfers, caching, and inter-node dependencies. Used at CAAI (Clinical AI) at Rigshospitalet, Copenhagen.
8+
9+
## Common Commands
10+
11+
### Running Tests
12+
```bash
13+
# Start the test cluster (in tests/ directory)
14+
cd tests && docker compose up --build
15+
16+
# In another terminal, run all tests
17+
pytest
18+
19+
# Run a specific test file
20+
pytest test_docker.py
21+
22+
# Run a specific test
23+
pytest test_docker.py::test_finish_and_caching
24+
```
25+
26+
### Running a Node Locally (Development)
27+
```bash
28+
# Start a single node server
29+
uvicorn add:app --port 8010
30+
31+
# Access at http://localhost:8010/add
32+
```
33+
34+
### CLI Tool
35+
```bash
36+
# Run a job via CLI
37+
rhjob <node_name> input_key=value input_file=/path/to/file.nii.gz
38+
39+
# Get help for a specific node
40+
rhjob <node_name> -h
41+
```
42+
43+
### Docker
44+
```bash
45+
# Build and run with docker compose
46+
docker compose up --build
47+
48+
# Build and push to DockerHub
49+
docker compose build --push
50+
```
51+
52+
## Architecture
53+
54+
### Core Components
55+
56+
**RHNode** ([rhnode/rhnode.py](rhnode/rhnode.py)) - Base class for creating node servers. Subclass this to create a new node:
57+
- Define `input_spec` and `output_spec` as Pydantic BaseModel classes
58+
- Override the `process(inputs, job)` static method with inference logic
59+
- Set resource requirements: `required_gb_gpu_memory`, `required_num_threads`, `required_gb_memory`
60+
61+
**RHJob** ([rhnode/rhjob.py](rhnode/rhjob.py)) - Client for submitting jobs to nodes. Handles file uploads, polling for completion, and downloading results.
62+
63+
**RHProcess** ([rhnode/rhprocess.py](rhnode/rhprocess.py)) - Server-side job execution. Manages job lifecycle: file uploads → queue → run process → cleanup.
64+
65+
**RHManager** ([nodes/manager/manager.py](nodes/manager/manager.py)) - Resource queue manager. Allocates GPU/CPU/memory across jobs using a priority queue. Multiple RHNode clusters can link together via `RH_OTHER_ADDRESSES`.
66+
67+
**Common types** ([rhnode/common.py](rhnode/common.py)) - Shared Pydantic models: `JobMetaData`, `JobStatus`, `QueueRequest`, etc.
68+
69+
### Data Flow
70+
71+
1. Client creates `RHJob` with inputs → POST to node creates `RHProcess`
72+
2. File inputs uploaded one by one to `.inputs/<job_id>/`
73+
3. Job enters resource queue via manager (unless `resources_included=True`)
74+
4. When resources available, `process()` runs in subprocess with allocated `job.device`
75+
5. Outputs saved to `.outputs/<job_id>/`, optionally cached to `.cache/`
76+
6. Client downloads file outputs, job auto-deleted after 10 minutes
77+
78+
### Key Patterns
79+
80+
- **FilePath fields**: Input/output Pydantic models use `FilePath` for files that need to be transferred. Non-serializable data (numpy arrays, nifti images) must be saved to disk.
81+
- **job.directory**: All output files must be saved within `job.directory` for proper cleanup.
82+
- **job.device**: The allocated CUDA device ID. Only use this device for GPU operations.
83+
- **Child jobs**: Use `RHJob.from_parent_job()` to spawn dependent jobs that inherit priority and resource settings.
84+
- **Caching**: Results are cached by input hash. Use `check_cache=False` during development.
85+
86+
### Environment Variables
87+
88+
For the manager node:
89+
- `RH_NAME`: Host identifier
90+
- `RH_GPU_MEM`: GPU memory per device, comma-separated for multiple GPUs (e.g., "8,8,12")
91+
- `RH_NUM_THREADS`: Available CPU threads
92+
- `RH_MEMORY`: Available RAM in GB
93+
- `RH_OTHER_ADDRESSES`: Comma-separated addresses of other managers (e.g., "titan6:9050,peyo:9050")
94+
95+
For nodes:
96+
- `RH_EMAIL_ON_ERROR`: Email recipient for error notifications
97+
- `RH_MODE`: Set to "dev" for development mode
98+
99+
## Versioning
100+
101+
Version format: `major.minor.patch`
102+
- Development branches: `dev/v1.X.0`
103+
- Alpha releases: `v1.X.0-a.N`
104+
- When tagging docker images for rh-library, omit the hyphen: `hdbet-v1.1.0_rhnode1.2.0a.1`
105+
106+
## Test Data
107+
108+
Tests require a NIfTI file at `tests/data/mr.nii.gz`. This can be any valid nifti file.

rhnode/rhjob.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ def _finish_job(self, job):
3535
"job with inputs",
3636
str(job.input_data),
3737
"encountered an error or was cancelled, ignoring",
38+
str(error),
3839
)
3940
else:
4041
raise
@@ -51,10 +52,10 @@ def _check_and_update_active_jobs(self):
5152
IDs_to_remove.append(ID)
5253

5354
for ID in IDs_to_remove:
54-
remaining_jobs = len(self.jobs) + len(self.started_jobs)
55-
print(
56-
"Finished job:", ID, f"completed:{remaining_jobs}/{self.n_total_jobs}"
55+
completed_jobs = self.n_total_jobs - (
56+
len(self.jobs) + len(self.started_jobs)
5757
)
58+
print("Finished job:", ID, f"\n{completed_jobs}/{self.n_total_jobs}\n")
5859
del self.started_jobs[ID]
5960

6061
while len(self.jobs) > 0 and len(self.started_jobs) <= self.queue_length:

rhnode/rhprocess.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ def get_runtime_str(self):
219219
if self.time_finished is None:
220220
dt2 = datetime.fromtimestamp(time.time())
221221
else:
222-
dt2 = datetime.fromtimestamp(self.time_started)
222+
dt2 = datetime.fromtimestamp(self.time_finished)
223223

224224
delta = dt2 - dt1
225225
return str(delta)[:-7]

0 commit comments

Comments
 (0)