Skip to content

Commit 25f3c65

Browse files
authored
Merge pull request graphnet-team#852 from RasmusOrsoe/lmdb_pr
Add `lmdb` as alternative file format
2 parents ad36dc6 + 41d48c7 commit 25f3c65

26 files changed

Lines changed: 2121 additions & 84 deletions

docs/source/data_conversion/data_conversion.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -265,8 +265,8 @@ In this example, the writer will save the entire set of extractor outputs - a di
265265
266266
267267
268-
Two writers are implemented in GraphNeT; the :code:`SQLiteWriter` and :code:`ParquetWriter`, each of which output files that are directly used for
269-
training by :code:`ParquetDataset` and :code:`SQLiteDataset`.
268+
Three writers are implemented in GraphNeT; the :code:`SQLiteWriter`, :code:`ParquetWriter`, and :code:`LMDBWriter`, each of which output files that are directly used for
269+
training by :code:`SQLiteDataset`, :code:`ParquetDataset`, and :code:`LMDBDataset` respectively.
270270

271271

272272

docs/source/datasets/datasets.rst

Lines changed: 45 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -155,18 +155,19 @@ It looks like so:
155155
</details>
156156

157157

158-
:code:`SQLiteDataset` & :code:`ParquetDataset`
159-
----------------------------------------------
158+
:code:`SQLiteDataset`, :code:`ParquetDataset` & :code:`LMDBDataset`
159+
--------------------------------------------------------------------
160160

161-
The two specific implementations of :code:`Dataset` exists :
161+
The three specific implementations of :code:`Dataset` exists :
162162

163163
- `ParquetDataset <https://graphnet-team.github.io/graphnet/api/graphnet.data.parquet.parquet_dataset.html>`_ : Constructs :code:`Dataset` from files created by :code:`ParquetWriter`.
164164
- `SQLiteDataset <https://graphnet-team.github.io/graphnet/api/graphnet.data.sqlite.sqlite_dataset.html>`_ : Constructs :code:`Dataset` from files created by :code:`SQLiteWriter`.
165+
- `LMDBDataset <https://graphnet-team.github.io/graphnet/api/graphnet.data.dataset.lmdb.lmdb_dataset.html>`_ : Constructs :code:`Dataset` from files created by :code:`LMDBWriter`.
165166

166167

167168
To instantiate a :code:`Dataset` from your files, you must specify at least the following:
168169

169-
- :code:`pulsemaps`: These are named fields in your Parquet files, or tables in your SQLite databases, which store one or more pulse series from which you would like to create a dataset. A pulse series represents the detector response, in the form of a series of PMT hits or pulses, in some time window, usually triggered by a single neutrino or atmospheric muon interaction. This is the data that will be served as input to the `Model`.
170+
- :code:`pulsemaps`: These are named fields in your Parquet files, or tables in your SQLite or LMDB databases, which store one or more pulse series from which you would like to create a dataset. A pulse series represents the detector response, in the form of a series of PMT hits or pulses, in some time window, usually triggered by a single neutrino or atmospheric muon interaction. This is the data that will be served as input to the `Model`.
170171
- :code:`truth_table`: The name of a table/array that contains the truth-level information associated with the pulse series, and should contain the truth labels that you would like to reconstruct or classify. Often this table will contain the true physical attributes of the primary particle — such as its true direction, energy, PID, etc. — and is therefore graph- or event-level (as opposed to the pulse series tables, which are node- or hit-level) truth information.
171172
- :code:`features`: The names of the columns in your pulse series table(s) that you would like to include for training; they typically constitute the per-node/-hit features such as xyz-position of sensors, charge, and photon arrival times.
172173
- :code:`truth`: The columns in your truth table/array that you would like to include in the dataset.
@@ -225,6 +226,32 @@ Or similarly for Parquet files:
225226
226227
graph = dataset[0] # torch_geometric.data.Data
227228
229+
Or similarly for LMDB files:
230+
231+
.. code-block:: python
232+
233+
from graphnet.data.dataset.lmdb.lmdb_dataset import LMDBDataset
234+
from graphnet.models.detector.prometheus import Prometheus
235+
from graphnet.models.graphs import KNNGraph
236+
from graphnet.models.graphs.nodes import NodesAsPulses
237+
238+
graph_definition = KNNGraph(
239+
detector=Prometheus(),
240+
node_definition=NodesAsPulses(),
241+
nb_nearest_neighbours=8,
242+
)
243+
244+
dataset = LMDBDataset(
245+
path="data/examples/lmdb/prometheus/prometheus-events.lmdb",
246+
pulsemaps="total",
247+
truth_table="mc_truth",
248+
features=["sensor_pos_x", "sensor_pos_y", "sensor_pos_z", "t", ...],
249+
truth=["injection_energy", "injection_zenith", ...],
250+
graph_definiton = graph_definition,
251+
)
252+
253+
graph = dataset[0] # torch_geometric.data.Data
254+
228255
It's then straightforward to create a :code:`DataLoader` for training, which will take care of batching, shuffling, and such:
229256

230257
.. code-block:: python
@@ -250,10 +277,10 @@ By default, the following fields will be available in a graph built by :code:`Da
250277
- :code:`graph[truth_label] for truth_label in truth`: For each truth label in the :code:`truth` argument, the corresponding data is stored as a :code:`[num_rows, 1]` dimensional tensor. E.g., :code:`graph["energy"] = torch.tensor(26, dtype=torch.float)`
251278
- :code:`graph[feature] for feature in features`: For each feature given in the :code:`features` argument, the corresponding data is stored as a :code:`[num_rows, 1]` dimensional tensor. E.g., :code:`graph["sensor_x"] = torch.tensor([100, -200, -300, 200], dtype=torch.float)``
252279

253-
:code:`SQLiteDataset` vs. :code:`ParquetDataset`
254-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
280+
:code:`SQLiteDataset` vs. :code:`ParquetDataset` vs. :code:`LMDBDataset`
281+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
255282

256-
Besides working on different file formats, :code:`SQLiteDataset` and :code:`ParquetDataset` have significant differences,
283+
Besides working on different file formats, :code:`SQLiteDataset`, :code:`ParquetDataset`, and :code:`LMDBDataset` have significant differences,
257284
which may lead you to choose one over the other, depending on the problem at hand.
258285

259286
:SQLiteDataset: SQLite provides fast random access to all events inside it. This makes plotting and subsampling your dataset particularly easy,
@@ -265,13 +292,20 @@ which may lead you to choose one over the other, depending on the problem at han
265292
This means that the subsampling of your dataset needs to happen prior to the conversion to :code:`parquet`, unlike `SQLiteDataset` which allows for subsampling after conversion, due to it's fast random access.
266293
Conversion of files to :code:`parquet` is significantly faster than its :code:`SQLite` counterpart.
267294

295+
:LMDBDataset: LMDB databases produced by :code:`LMDBWriter` store events as key-value pairs with configurable serialization methods (pickle, json, msgpack, dill).
296+
:code:`LMDBDataset` supports two modes: reading raw tables and computing data representations in real-time (similar to :code:`SQLiteDataset`), or reading pre-computed data representations directly from the database for faster access.
297+
LMDB provides fast random access similar to SQLite, while also supporting efficient storage of pre-computed graph representations, making it suitable for scenarios where you want to pre-compute and cache data representations.
298+
LMDB takes up roughly half the space of SQLite, and is therefore a good compromise between SQLite and Parquet.
299+
268300

269301
.. note::
270302

271303
:code:`ParquetDataset` is scalable to ultra large datasets, but is more difficult to work with and has a higher memory consumption.
272304

273305
:code:`SQLiteDataset` does not scale to very large datasets, but is easy to work with and has minimal memory consumption.
274306

307+
:code:`LMDBDataset` provides a balance between SQLite and Parquet, offering fast random access and support for pre-computed representations, making it well-suited for scenarios where data representations are computed once and reused multiple times.
308+
275309

276310
Choosing a subset of events using `selection`
277311
----------------------------------------------
@@ -297,7 +331,7 @@ would produce a :code:`Dataset` with only those five events.
297331

298332
.. note::
299333

300-
For :code:`SQLiteDatase`, the :code:`selection` argument specifies individual events chosen for the dataset,
334+
For :code:`SQLiteDataset` and :code:`LMDBDataset`, the :code:`selection` argument specifies individual events chosen for the dataset,
301335
whereas for :code:`ParquetDataset`, the :code:`selection` argument specifies which batches are used in the dataset.
302336

303337

@@ -347,12 +381,14 @@ You can combine multiple instances of :code:`Dataset` from GraphNeT into a singl
347381
from graphnet.data import EnsembleDataset
348382
from graphnet.data.parquet import ParquetDataset
349383
from graphnet.data.sqlite import SQLiteDataset
384+
from graphnet.data.dataset.lmdb.lmdb_dataset import LMDBDataset
350385
351386
dataset_1 = SQLiteDataset(...)
352387
dataset_2 = SQLiteDataset(...)
353388
dataset_3 = ParquetDataset(...)
389+
dataset_4 = LMDBDataset(...)
354390
355-
ensemble_dataset = EnsembleDataset([dataset_1, dataset_2, dataset_3])
391+
ensemble_dataset = EnsembleDataset([dataset_1, dataset_2, dataset_3, dataset_4])
356392
357393
You can find a detailed example `here <https://github.com/graphnet-team/graphnet/blob/main/examples/02_data/04_ensemble_dataset.py>`_ .
358394

Lines changed: 136 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,15 @@
1-
"""Example of converting I3-files to SQLite and Parquet."""
1+
"""Example of converting I3-files to SQLite, Parquet, and LMDB.
22
3-
from glob import glob
3+
When using the LMDB backend, the ``--precompute-representation`` flag can be
4+
used to pre-compute a DataRepresentation and store it alongside the raw
5+
data. Pre-computed representations can later be loaded directly,
6+
avoiding the cost of real-time DataRepresentation construction during training.
7+
"""
48

9+
from glob import glob
10+
from typing import Any, Dict
511
from graphnet.constants import EXAMPLE_OUTPUT_DIR, TEST_DATA_DIR
12+
from graphnet.data.constants import FEATURES, TRUTH
613
from graphnet.data.extractors.icecube import (
714
I3FeatureExtractorIceCubeUpgrade,
815
I3FeatureExtractorIceCube86,
@@ -12,6 +19,9 @@
1219
from graphnet.data.dataconverter import DataConverter
1320
from graphnet.data.parquet import ParquetDataConverter
1421
from graphnet.data.sqlite import SQLiteDataConverter
22+
from graphnet.data.pre_configured.dataconverters import I3ToLMDBConverter
23+
from graphnet.models.detector.icecube import IceCube86, IceCubeUpgrade
24+
from graphnet.models.graphs import KNNGraph
1525
from graphnet.utilities.argparse import ArgumentParser
1626
from graphnet.utilities.imports import has_icecube_package
1727
from graphnet.utilities.logging import Logger
@@ -29,14 +39,18 @@
2939
)
3040

3141
CONVERTER_CLASS = {
42+
"lmdb": I3ToLMDBConverter,
3243
"sqlite": SQLiteDataConverter,
3344
"parquet": ParquetDataConverter,
3445
}
3546

3647

37-
def main_icecube86(backend: str) -> None:
48+
def main_icecube86(
49+
backend: str,
50+
precompute_representation: bool = False,
51+
num_workers: int = 1,
52+
) -> None:
3853
"""Convert IceCube-86 I3 files to intermediate `backend` format."""
39-
# Check(s)
4054
assert backend in CONVERTER_CLASS
4155

4256
inputs = [f"{TEST_DATA_DIR}/i3/oscNext_genie_level7_v02"]
@@ -45,45 +59,99 @@ def main_icecube86(backend: str) -> None:
4559
f"{TEST_DATA_DIR}/i3/oscNext_genie_level7_v02/*GeoCalib*"
4660
)[0]
4761

48-
converter = CONVERTER_CLASS[backend](
49-
extractors=[
50-
I3FeatureExtractorIceCube86("SRTInIcePulses"),
51-
I3TruthExtractor(),
52-
],
53-
outdir=outdir,
54-
gcd_rescue=gcd_rescue,
55-
workers=1,
56-
)
62+
extractors = [
63+
I3FeatureExtractorIceCube86("SRTInIcePulses"),
64+
I3TruthExtractor(),
65+
]
66+
67+
if backend == "lmdb":
68+
lmdb_kwargs: Dict[str, Any] = {}
69+
if precompute_representation:
70+
# Could be any DataRepresentation, not just KNNGraph
71+
data_representation = KNNGraph(
72+
detector=IceCube86(),
73+
nb_nearest_neighbours=8,
74+
input_feature_names=FEATURES.ICECUBE86,
75+
)
76+
lmdb_kwargs.update(
77+
data_representation=data_representation,
78+
pulsemap_extractor_name="SRTInIcePulses",
79+
truth_extractor_name="truth",
80+
truth_label_names=TRUTH.ICECUBE86,
81+
)
82+
converter: DataConverter = I3ToLMDBConverter(
83+
extractors=extractors,
84+
outdir=outdir,
85+
gcd_rescue=gcd_rescue,
86+
num_workers=num_workers,
87+
**lmdb_kwargs,
88+
)
89+
else:
90+
converter = CONVERTER_CLASS[backend](
91+
extractors=extractors,
92+
outdir=outdir,
93+
gcd_rescue=gcd_rescue,
94+
workers=num_workers,
95+
)
96+
5797
converter(inputs)
58-
if backend == "sqlite":
98+
if backend in ["sqlite", "lmdb"]:
5999
converter.merge_files()
60100

61101

62-
def main_icecube_upgrade(backend: str) -> None:
102+
def main_icecube_upgrade(
103+
backend: str,
104+
precompute_representation: bool = False,
105+
num_workers: int = 1,
106+
) -> None:
63107
"""Convert IceCube-Upgrade I3 files to intermediate `backend` format."""
64-
# Check(s)
65108
assert backend in CONVERTER_CLASS
66109

67110
inputs = [f"{TEST_DATA_DIR}/i3/upgrade_genie_step4_140028_000998"]
68111
outdir = f"{EXAMPLE_OUTPUT_DIR}/convert_i3_files/upgrade"
69112
gcd_rescue = glob(
70113
"{TEST_DATA_DIR}/i3/upgrade_genie_step4_140028_000998/*GeoCalib*"
71114
)[0]
72-
workers = 1
73-
74-
converter: DataConverter = CONVERTER_CLASS[backend](
75-
extractors=[
76-
I3TruthExtractor(),
77-
I3RetroExtractor(),
78-
I3FeatureExtractorIceCubeUpgrade("I3RecoPulseSeriesMap_mDOM"),
79-
I3FeatureExtractorIceCubeUpgrade("I3RecoPulseSeriesMap_DEgg"),
80-
],
81-
outdir=outdir,
82-
workers=workers,
83-
gcd_rescue=gcd_rescue,
84-
)
115+
116+
pulsemap = "I3RecoPulseSeriesMap_mDOM"
117+
extractors = [
118+
I3TruthExtractor(),
119+
I3RetroExtractor(),
120+
I3FeatureExtractorIceCubeUpgrade(pulsemap),
121+
I3FeatureExtractorIceCubeUpgrade("I3RecoPulseSeriesMap_DEgg"),
122+
]
123+
124+
if backend == "lmdb":
125+
lmdb_kwargs: Dict[str, Any] = {}
126+
if precompute_representation:
127+
data_representation = KNNGraph(
128+
detector=IceCubeUpgrade(),
129+
nb_nearest_neighbours=8,
130+
input_feature_names=FEATURES.UPGRADE,
131+
)
132+
lmdb_kwargs.update(
133+
data_representation=data_representation,
134+
pulsemap_extractor_name=pulsemap,
135+
truth_extractor_name="truth",
136+
truth_label_names=TRUTH.UPGRADE,
137+
)
138+
converter: DataConverter = I3ToLMDBConverter(
139+
extractors=extractors,
140+
outdir=outdir,
141+
gcd_rescue=gcd_rescue,
142+
num_workers=num_workers,
143+
**lmdb_kwargs,
144+
)
145+
else:
146+
converter = CONVERTER_CLASS[backend](
147+
extractors=extractors,
148+
outdir=outdir,
149+
gcd_rescue=gcd_rescue,
150+
workers=num_workers,
151+
)
152+
85153
converter(inputs)
86-
if backend == "sqlite":
154+
if backend in ["sqlite", "lmdb"]:
87155
converter.merge_files()
88156

89157

@@ -92,22 +160,55 @@ def main_icecube_upgrade(backend: str) -> None:
92160
if not has_icecube_package():
93161
Logger(log_folder=None).error(ERROR_MESSAGE_MISSING_ICETRAY)
94162
else:
95-
# Parse command-line arguments
96163
parser = ArgumentParser(
97164
description="""
98165
Convert I3 files to an intermediate format.
99166
"""
100167
)
101168

102-
parser.add_argument("backend", choices=["sqlite", "parquet"])
169+
parser.add_argument(
170+
"backend",
171+
nargs="?",
172+
choices=["lmdb", "sqlite", "parquet"],
173+
default="lmdb",
174+
help="Backend format to convert to (default: %(default)s)",
175+
)
103176
parser.add_argument(
104177
"detector", choices=["icecube-86", "icecube-upgrade"]
105178
)
179+
parser.add_argument(
180+
"--precompute-representation",
181+
action="store_true",
182+
default=False,
183+
help="Pre-compute a KNN graph representation and store it in "
184+
"the LMDB database. Only supported with the lmdb backend.",
185+
)
186+
parser.add_argument(
187+
"--workers",
188+
type=int,
189+
default=1,
190+
help="Number of worker processes for parallel conversion "
191+
"(default: %(default)s).",
192+
)
106193

107194
args, unknown = parser.parse_known_args()
108195

109-
# Run example script
196+
if args.precompute_representation and args.backend != "lmdb":
197+
Logger(log_folder=None).warning(
198+
"--precompute-representation is only supported with the lmdb "
199+
"backend. Ignoring."
200+
)
201+
args.precompute_representation = False
202+
110203
if args.detector == "icecube-86":
111-
main_icecube86(args.backend)
204+
main_icecube86(
205+
args.backend,
206+
args.precompute_representation,
207+
args.workers,
208+
)
112209
else:
113-
main_icecube_upgrade(args.backend)
210+
main_icecube_upgrade(
211+
args.backend,
212+
args.precompute_representation,
213+
args.workers,
214+
)

setup.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
"polars >=0.19",
2828
"torchscale==0.2.0",
2929
"h5py>= 3.7.0",
30+
"lmdb>=1.4.1",
3031
]
3132

3233
EXTRAS_REQUIRE = {

src/graphnet/data/dataclasses.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,12 @@ class I3FileSet: # noqa: D101
1010
gcd_file: str
1111

1212

13+
@dataclass
14+
class SQLiteFileSet: # noqa: D101
15+
db_path: str
16+
event_nos: List[int]
17+
18+
1319
@dataclass
1420
class Settings:
1521
"""Dataclass for workers in I3Deployer."""

0 commit comments

Comments
 (0)