|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## What is RHNode |
| 6 | + |
| 7 | +RHNode is a Python library for deploying deep learning models as REST endpoints. It handles job queuing, resource allocation (GPU/CPU/memory), file transfers, caching, and inter-node dependencies. Used at CAAI (Clinical AI) at Rigshospitalet, Copenhagen. |
| 8 | + |
| 9 | +## Common Commands |
| 10 | + |
| 11 | +### Running Tests |
| 12 | +```bash |
| 13 | +# Start the test cluster (in tests/ directory) |
| 14 | +cd tests && docker compose up --build |
| 15 | + |
| 16 | +# In another terminal, run all tests |
| 17 | +pytest |
| 18 | + |
| 19 | +# Run a specific test file |
| 20 | +pytest test_docker.py |
| 21 | + |
| 22 | +# Run a specific test |
| 23 | +pytest test_docker.py::test_finish_and_caching |
| 24 | +``` |
| 25 | + |
| 26 | +### Running a Node Locally (Development) |
| 27 | +```bash |
| 28 | +# Start a single node server |
| 29 | +uvicorn add:app --port 8010 |
| 30 | + |
| 31 | +# Access at http://localhost:8010/add |
| 32 | +``` |
| 33 | + |
| 34 | +### CLI Tool |
| 35 | +```bash |
| 36 | +# Run a job via CLI |
| 37 | +rhjob <node_name> input_key=value input_file=/path/to/file.nii.gz |
| 38 | + |
| 39 | +# Get help for a specific node |
| 40 | +rhjob <node_name> -h |
| 41 | +``` |
| 42 | + |
| 43 | +### Docker |
| 44 | +```bash |
| 45 | +# Build and run with docker compose |
| 46 | +docker compose up --build |
| 47 | + |
| 48 | +# Build and push to DockerHub |
| 49 | +docker compose build --push |
| 50 | +``` |
| 51 | + |
| 52 | +## Architecture |
| 53 | + |
| 54 | +### Core Components |
| 55 | + |
| 56 | +**RHNode** ([rhnode/rhnode.py](rhnode/rhnode.py)) - Base class for creating node servers. Subclass this to create a new node: |
| 57 | +- Define `input_spec` and `output_spec` as Pydantic BaseModel classes |
| 58 | +- Override the `process(inputs, job)` static method with inference logic |
| 59 | +- Set resource requirements: `required_gb_gpu_memory`, `required_num_threads`, `required_gb_memory` |
| 60 | + |
| 61 | +**RHJob** ([rhnode/rhjob.py](rhnode/rhjob.py)) - Client for submitting jobs to nodes. Handles file uploads, polling for completion, and downloading results. |
| 62 | + |
| 63 | +**RHProcess** ([rhnode/rhprocess.py](rhnode/rhprocess.py)) - Server-side job execution. Manages job lifecycle: file uploads → queue → run process → cleanup. |
| 64 | + |
| 65 | +**RHManager** ([nodes/manager/manager.py](nodes/manager/manager.py)) - Resource queue manager. Allocates GPU/CPU/memory across jobs using a priority queue. Multiple RHNode clusters can link together via `RH_OTHER_ADDRESSES`. |
| 66 | + |
| 67 | +**Common types** ([rhnode/common.py](rhnode/common.py)) - Shared Pydantic models: `JobMetaData`, `JobStatus`, `QueueRequest`, etc. |
| 68 | + |
| 69 | +### Data Flow |
| 70 | + |
| 71 | +1. Client creates `RHJob` with inputs → POST to node creates `RHProcess` |
| 72 | +2. File inputs uploaded one by one to `.inputs/<job_id>/` |
| 73 | +3. Job enters resource queue via manager (unless `resources_included=True`) |
| 74 | +4. When resources available, `process()` runs in subprocess with allocated `job.device` |
| 75 | +5. Outputs saved to `.outputs/<job_id>/`, optionally cached to `.cache/` |
| 76 | +6. Client downloads file outputs, job auto-deleted after 10 minutes |
| 77 | + |
| 78 | +### Key Patterns |
| 79 | + |
| 80 | +- **FilePath fields**: Input/output Pydantic models use `FilePath` for files that need to be transferred. Non-serializable data (numpy arrays, nifti images) must be saved to disk. |
| 81 | +- **job.directory**: All output files must be saved within `job.directory` for proper cleanup. |
| 82 | +- **job.device**: The allocated CUDA device ID. Only use this device for GPU operations. |
| 83 | +- **Child jobs**: Use `RHJob.from_parent_job()` to spawn dependent jobs that inherit priority and resource settings. |
| 84 | +- **Caching**: Results are cached by input hash. Use `check_cache=False` during development. |
| 85 | + |
| 86 | +### Environment Variables |
| 87 | + |
| 88 | +For the manager node: |
| 89 | +- `RH_NAME`: Host identifier |
| 90 | +- `RH_GPU_MEM`: GPU memory per device, comma-separated for multiple GPUs (e.g., "8,8,12") |
| 91 | +- `RH_NUM_THREADS`: Available CPU threads |
| 92 | +- `RH_MEMORY`: Available RAM in GB |
| 93 | +- `RH_OTHER_ADDRESSES`: Comma-separated addresses of other managers (e.g., "titan6:9050,peyo:9050") |
| 94 | + |
| 95 | +For nodes: |
| 96 | +- `RH_EMAIL_ON_ERROR`: Email recipient for error notifications |
| 97 | +- `RH_MODE`: Set to "dev" for development mode |
| 98 | + |
| 99 | +## Versioning |
| 100 | + |
| 101 | +Version format: `major.minor.patch` |
| 102 | +- Development branches: `dev/v1.X.0` |
| 103 | +- Alpha releases: `v1.X.0-a.N` |
| 104 | +- When tagging docker images for rh-library, omit the hyphen: `hdbet-v1.1.0_rhnode1.2.0a.1` |
| 105 | + |
| 106 | +## Test Data |
| 107 | + |
| 108 | +Tests require a NIfTI file at `tests/data/mr.nii.gz`. This can be any valid nifti file. |
0 commit comments