Skip to content

Commit 5fc9a3f

Browse files
tsalomattcieslak
andauthored
Add --scalar-columns parameter to NIfTI and MIF CLIs (#34)
Co-authored-by: Matt Cieslak <mattcieslak@gmail.com>
1 parent 7502f6e commit 5fc9a3f

25 files changed

Lines changed: 1142 additions & 272 deletions

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,5 @@
1414
installation
1515
auto_examples/index
1616
usage
17+
outputs
1718
api

docs/outputs.rst

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
#######
2+
Outputs
3+
#######
4+
5+
This page describes what each CLI command writes, how files are named, and what data
6+
is stored inside each output artifact.
7+
8+
9+
*****************
10+
Commands Overview
11+
*****************
12+
13+
The commands fall into two groups:
14+
15+
- ``*-to-h5`` commands: convert input neuroimaging data into either:
16+
- one or more HDF5 files (``--backend hdf5``), or
17+
- one or more TileDB directories (``--backend tiledb``).
18+
- ``h5-to-*`` commands: convert analysis results stored in an HDF5 file into image files.
19+
20+
21+
*********************
22+
nifti-to-h5 (volumes)
23+
*********************
24+
25+
Default output name (HDF5 backend): ``voxelarray.h5``.
26+
27+
HDF5 output contents:
28+
29+
- ``voxels`` dataset:
30+
- transposed voxel table with rows for ``voxel_id``, ``i``, ``j``, ``k``.
31+
- attribute ``column_names = ['voxel_id', 'i', 'j', 'k']``.
32+
- Per scalar:
33+
- ``scalars/<scalar_name>/values`` with shape ``(n_subjects, n_voxels)``.
34+
- ``scalars/<scalar_name>/column_names`` listing source file names.
35+
36+
TileDB output contents:
37+
38+
- Per scalar dense array at ``scalars/<scalar_name>/values`` with shape
39+
``(n_subjects, n_voxels)``.
40+
- Column names are stored in array metadata (``column_names``).
41+
42+
When ``--scalar-columns`` is provided:
43+
44+
- Output is split by scalar column name.
45+
- Example: ``--scalar-columns alpha beta --output voxelarray.h5`` writes:
46+
- ``alpha_voxelarray.h5``
47+
- ``beta_voxelarray.h5``
48+
- The same prefix rule also applies to TileDB output paths.
49+
50+
51+
*******************
52+
cifti-to-h5 (CIFTI)
53+
*******************
54+
55+
Default output name (HDF5 backend): ``greyordinatearray.h5``.
56+
57+
HDF5 output contents:
58+
59+
- ``greyordinates`` dataset:
60+
- transposed table with rows for ``vertex_id`` and ``structure_id``.
61+
- attribute ``column_names = ['vertex_id', 'structure_id']``.
62+
- attribute ``structure_names`` listing CIFTI brain structures.
63+
- Per scalar:
64+
- ``scalars/<scalar_name>/values`` with shape ``(n_subjects, n_greyordinates)``.
65+
- ``scalars/<scalar_name>/column_names`` listing source file names.
66+
67+
TileDB output contents:
68+
69+
- Per scalar dense array at ``scalars/<scalar_name>/values`` with shape
70+
``(n_subjects, n_greyordinates)``.
71+
- Column names metadata is written on each scalar matrix.
72+
- An explicit TileDB array is also written at ``scalars/<scalar_name>/column_names``.
73+
74+
When ``--scalar-columns`` is provided:
75+
76+
- Output is split by scalar column name.
77+
- Example: ``--scalar-columns alpha beta --output greyordinatearray.h5`` writes:
78+
- ``alpha_greyordinatearray.h5``
79+
- ``beta_greyordinatearray.h5``
80+
- The same prefix rule also applies to TileDB output paths.
81+
82+
83+
******************
84+
mif-to-h5 (fixels)
85+
******************
86+
87+
Default output name (HDF5 backend): ``fixelarray.h5``.
88+
89+
HDF5 output contents:
90+
91+
- ``fixels`` dataset:
92+
- transposed fixel table (``fixel_id``, coordinates/directions metadata from input fixel DB).
93+
- attribute ``column_names`` containing table column names.
94+
- ``voxels`` dataset:
95+
- transposed voxel table with ``voxel_id``, ``i``, ``j``, ``k``.
96+
- attribute ``column_names`` containing table column names.
97+
- Per scalar:
98+
- ``scalars/<scalar_name>/values`` with shape ``(n_subjects, n_fixels)``.
99+
- ``scalars/<scalar_name>/column_names`` listing source file names.
100+
101+
TileDB output contents:
102+
103+
- Per scalar dense array at ``scalars/<scalar_name>/values`` with shape
104+
``(n_subjects, n_fixels)``.
105+
- Column names are stored in array metadata (``column_names``).
106+
107+
When ``--scalar-columns`` is provided:
108+
109+
- Output is split by scalar column name.
110+
- Example: ``--scalar-columns alpha beta --output fixelarray.h5`` writes:
111+
- ``alpha_fixelarray.h5``
112+
- ``beta_fixelarray.h5``
113+
- The same prefix rule also applies to TileDB output paths.
114+
115+
116+
***********************************
117+
h5-to-* commands (result exporters)
118+
***********************************
119+
120+
These commands read statistical results from:
121+
122+
- ``results/<analysis_name>/results_matrix`` (shape: ``(n_results, n_elements)``).
123+
124+
Result names are read in this order:
125+
126+
- ``results_matrix.attrs['colnames']`` (if present),
127+
- ``results/<analysis_name>/column_names`` dataset,
128+
- ``results/<analysis_name>/results_matrix/column_names`` dataset,
129+
- fallback names: ``component001``, ``component002``, ...
130+
131+
Any spaces or ``/`` in result names are replaced with ``_`` in filenames.
132+
133+
134+
h5-to-nifti
135+
===========
136+
137+
Writes one file per result to ``--output-dir``:
138+
139+
- ``<analysis_name>_<result_name><output_ext>`` (default extension ``.nii.gz``).
140+
- If a result name contains ``p.value``, an additional file is written:
141+
``<analysis_name>_<result_name_with_1m.p.value><output_ext>``,
142+
containing ``1 - p.value``.
143+
144+
Each output volume uses ``--group-mask-file`` to map vectorized results back into 3D space.
145+
146+
147+
h5-to-cifti
148+
===========
149+
150+
Writes one CIFTI dscalar file per result to ``--output-dir``:
151+
152+
- ``<analysis_name>_<result_name>.dscalar.nii``.
153+
- If a result name contains ``p.value``, also writes the ``1 - p.value`` companion file
154+
with ``1m.p.value`` in its name.
155+
156+
The header is taken from ``--example-cifti`` (or from the first cohort ``source_file`` if
157+
``--cohort-file`` is used instead).
158+
159+
160+
h5-to-mif
161+
=========
162+
163+
Writes one MIF file per result to ``--output-dir``:
164+
165+
- ``<analysis_name>_<result_name>.mif``.
166+
- If a result name contains ``p.value``, also writes the ``1 - p.value`` companion file
167+
with ``1m.p.value`` in its name.
168+
169+
Also copies these files into ``--output-dir``:
170+
171+
- ``--index-file``
172+
- ``--directions-file``
173+
174+
The output MIF geometry/header template is taken from ``--example-mif`` (or from the first
175+
cohort ``source_file`` if ``--cohort-file`` is used instead).

src/modelarrayio/cli/cifti_to_h5.py

Lines changed: 50 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44

55
import argparse
66
import logging
7-
import os
87
from concurrent.futures import ThreadPoolExecutor, as_completed
98
from pathlib import Path
109

@@ -13,14 +12,13 @@
1312
from tqdm import tqdm
1413

1514
from modelarrayio.cli import utils as cli_utils
16-
from modelarrayio.cli.parser_utils import add_scalar_columns_arg, add_to_modelarray_args
15+
from modelarrayio.cli.parser_utils import add_to_modelarray_args
1716
from modelarrayio.utils.cifti import (
18-
_build_scalar_sources,
19-
_cohort_to_long_dataframe,
20-
_load_cohort_cifti,
2117
brain_names_to_dataframe,
2218
extract_cifti_scalar_data,
19+
load_cohort_cifti,
2320
)
21+
from modelarrayio.utils.misc import build_scalar_sources, cohort_to_long_dataframe
2422

2523
logger = logging.getLogger(__name__)
2624

@@ -35,7 +33,7 @@ def cifti_to_h5(
3533
shuffle=True,
3634
chunk_voxels=0,
3735
target_chunk_mb=2.0,
38-
workers=None,
36+
workers=1,
3937
s3_workers=1,
4038
scalar_columns=None,
4139
):
@@ -47,7 +45,7 @@ def cifti_to_h5(
4745
Path to a csv with demographic info and paths to data
4846
backend : :obj:`str`
4947
Backend to use for storage (``'hdf5'`` or ``'tiledb'``)
50-
output : :obj:`str`
48+
output : :obj:`pathlib.Path`
5149
Output path. For the hdf5 backend, path to an .h5 file;
5250
for the tiledb backend, path to a .tdb directory.
5351
storage_dtype : :obj:`str`
@@ -64,7 +62,7 @@ def cifti_to_h5(
6462
target_chunk_mb : :obj:`float`
6563
Target chunk/tile size in MiB when auto-computing the spatial axis length
6664
workers : :obj:`int`
67-
Maximum number of parallel TileDB write workers (``None`` = auto).
65+
Maximum number of parallel TileDB write workers. Default 1.
6866
Has no effect when ``backend='hdf5'``.
6967
s3_workers : :obj:`int`
7068
Number of workers for parallel S3 downloads
@@ -77,19 +75,49 @@ def cifti_to_h5(
7775
0 if successful, 1 if failed.
7876
"""
7977
cohort_df = pd.read_csv(cohort_file)
80-
cohort_long = _cohort_to_long_dataframe(cohort_df, scalar_columns=scalar_columns)
81-
output_path = Path(output)
78+
cohort_long = cohort_to_long_dataframe(cohort_df, scalar_columns=scalar_columns)
8279
if cohort_long.empty:
8380
raise ValueError('Cohort file does not contain any scalar entries after normalization.')
84-
scalar_sources = _build_scalar_sources(cohort_long)
81+
scalar_sources = build_scalar_sources(cohort_long)
8582
if not scalar_sources:
8683
raise ValueError('Unable to derive scalar sources from cohort file.')
84+
scalar_names = list(scalar_sources.keys())
85+
split_scalar_outputs = bool(scalar_columns)
8786

8887
if backend == 'hdf5':
89-
scalars, last_brain_names = _load_cohort_cifti(cohort_long, s3_workers)
88+
if split_scalar_outputs:
89+
scalars, last_brain_names = load_cohort_cifti(cohort_long, s3_workers)
90+
greyordinate_table, structure_names = brain_names_to_dataframe(last_brain_names)
91+
outputs: list[Path] = []
92+
for scalar_name in scalar_names:
93+
scalar_output = cli_utils.prepare_output_parent(
94+
cli_utils.prefixed_output_path(output, scalar_name)
95+
)
96+
with h5py.File(scalar_output, 'w') as h5_file:
97+
cli_utils.write_table_dataset(
98+
h5_file,
99+
'greyordinates',
100+
greyordinate_table,
101+
extra_attrs={'structure_names': structure_names},
102+
)
103+
cli_utils.write_hdf5_scalar_matrices(
104+
h5_file,
105+
{scalar_name: scalars[scalar_name]},
106+
{scalar_name: scalar_sources[scalar_name]},
107+
storage_dtype=storage_dtype,
108+
compression=compression,
109+
compression_level=compression_level,
110+
shuffle=shuffle,
111+
chunk_voxels=chunk_voxels,
112+
target_chunk_mb=target_chunk_mb,
113+
)
114+
outputs.append(scalar_output)
115+
return int(not all(path.exists() for path in outputs))
116+
117+
scalars, last_brain_names = load_cohort_cifti(cohort_long, s3_workers)
90118
greyordinate_table, structure_names = brain_names_to_dataframe(last_brain_names)
91-
output_path = cli_utils.prepare_output_parent(output_path)
92-
with h5py.File(output_path, 'w') as h5_file:
119+
output = cli_utils.prepare_output_parent(output)
120+
with h5py.File(output, 'w') as h5_file:
93121
cli_utils.write_table_dataset(
94122
h5_file,
95123
'greyordinates',
@@ -107,9 +135,8 @@ def cifti_to_h5(
107135
chunk_voxels=chunk_voxels,
108136
target_chunk_mb=target_chunk_mb,
109137
)
110-
return int(not output_path.exists())
138+
return int(not output.exists())
111139

112-
output_path.mkdir(parents=True, exist_ok=True)
113140
if not scalar_sources:
114141
return 0
115142

@@ -126,8 +153,13 @@ def _process_scalar_job(scalar_name, source_files):
126153
rows.append(cifti_data)
127154

128155
if rows:
156+
scalar_output = (
157+
cli_utils.prefixed_output_path(output, scalar_name)
158+
if split_scalar_outputs
159+
else output
160+
)
129161
cli_utils.write_tiledb_scalar_matrices(
130-
output_path,
162+
scalar_output,
131163
{scalar_name: rows},
132164
{scalar_name: source_files},
133165
storage_dtype=storage_dtype,
@@ -140,13 +172,7 @@ def _process_scalar_job(scalar_name, source_files):
140172
)
141173
return scalar_name
142174

143-
scalar_names = list(scalar_sources.keys())
144-
worker_count = workers if isinstance(workers, int) and workers > 0 else None
145-
if worker_count is None:
146-
cpu_count = os.cpu_count() or 1
147-
worker_count = min(len(scalar_names), max(1, cpu_count))
148-
else:
149-
worker_count = min(len(scalar_names), worker_count)
175+
worker_count = min(len(scalar_names), workers)
150176

151177
if worker_count <= 1:
152178
for scalar_name in scalar_names:
@@ -178,5 +204,4 @@ def _parse_cifti_to_h5():
178204
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
179205
)
180206
add_to_modelarray_args(parser, default_output='greyordinatearray.h5')
181-
add_scalar_columns_arg(parser)
182207
return parser

src/modelarrayio/cli/h5_to_mif.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
from modelarrayio.cli import utils as cli_utils
1616
from modelarrayio.cli.parser_utils import _is_file, add_from_modelarray_args, add_log_level_arg
17-
from modelarrayio.utils.fixels import mif_to_nifti2, nifti2_to_mif
17+
from modelarrayio.utils.mif import mif_to_nifti2, nifti2_to_mif
1818

1919
logger = logging.getLogger(__name__)
2020

0 commit comments

Comments
 (0)