Skip to content

Issue with GPU #18

@rfcasseb

Description

@rfcasseb

Dear developers,
Thank you for the great work.

I followed the steps provided at https://aid-hs.readthedocs.io/en/latest/index.html to install AID-HS v1.1.0 on a Linux machine (Ubuntu 22.04.1) with 32GB of RAM, i7 14700K, NVIDIA RTX 3060, . However, I am getting an error when trying to run MELD AID-HS, using the following command:

sudo DOCKER_USER="$(id -u):$(id -g)" docker compose run aidhs python scripts/new_patient_pipeline/new_patient_pipeline.py -ids list_subjects.txt -demos demographics_file.csv --parallelise

These are the last lines from terminal:

...
[Sun Jun  7 11:49:32 2026]
rule run_inference:
    input: work/sub-063NJBLV/anat/sub-063NJBLV_hemi-R_space-corobl_desc-preproc_T1w.nii.gz, /opt/hippunfold_cache/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar
    output: work/sub-063NJBLV/anat/sub-063NJBLV_hemi-R_space-corobl_desc-nnunet_dseg.nii.gz
    log: logs/sub-063NJBLV/sub-063NJBLV_hemi-R_space-corobl_nnunet.txt
    jobid: 25024
    reason: Missing output files: work/sub-063NJBLV/anat/sub-063NJBLV_hemi-R_space-corobl_desc-nnunet_dseg.nii.gz; Input files updated by another job: work/sub-063NJBLV/anat/sub-063NJBLV_hemi-R_space-corobl_desc-preproc_T1w.nii.gz
    wildcards: subject=063NJBLV, hemi=R
    threads: 10
    resources: tmpdir=/tmp, gpus=0, mem_mb=16000, time=60

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2026-06-07T113934.169884.snakemake.log

ERROR: One step of the pipeline has failed. Process has been aborted for one subject
SCRIPT 1: Segmentation and feature extraction has failed at least for one subject. See log at /data/logs/AIDHS_pipeline_2026-06-07-11-38-52.log. Consider fixing errors or excluding these subjects before re-running the pipeline. Segmentation will be skipped for subjects already processed

I checked the snake log file but I couldn't identify the error. The part that called my attention is similar to the output from the terminal:

[Sun Jun  7 11:39:58 2026]
rule import_t1:
    input: /data/output/bids_outputs/sub-045LVSFV/anat/sub-045LVSFV_T1w.nii.gz
    output: work/sub-045LVSFV/anat/sub-045LVSFV_T1w.nii.gz
    jobid: 26901
    reason: Missing output files: work/sub-045LVSFV/anat/sub-045LVSFV_T1w.nii.gz
    wildcards: subject=045LVSFV
    resources: tmpdir=/tmp

In the terminal output I noticed that "gpus=0", and in the compose.yml I had set it to "all".
I posted this error in google gemini, and it believes it is due to an incompatibility between GPU architecture and code. It says that:

Because you are using the MELD Docker setup as distributed out-of-the-box via docker compose, the container is internally using its bundled, outdated PyTorch installation. When Snakemake attempts to process sub-063NJBLV on your RTX 3060, the internal PyTorch build hits a limitation: it doesn't recognize the Ampere architecture (sm_86) and halts. Furthermore, notice that Snakemake explicitly sets gpus=0 for that rule in the log snippet, meaning it is defaulting back to a CPU context but the containerized environment is failing to handle the fallback seamlessly.

Is this indeed the problem. Is there any work around to it?
I kindly appreciate any help.
Kind regards,
Raph

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions