You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/data_conversion/data_conversion.rst
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -265,8 +265,8 @@ In this example, the writer will save the entire set of extractor outputs - a di
265
265
266
266
267
267
268
-
Two writers are implemented in GraphNeT; the :code:`SQLiteWriter`and :code:`ParquetWriter`, each of which output files that are directly used for
269
-
training by :code:`ParquetDataset` and :code:`SQLiteDataset`.
268
+
Three writers are implemented in GraphNeT; the :code:`SQLiteWriter`, :code:`ParquetWriter`, and :code:`LMDBWriter`, each of which output files that are directly used for
269
+
training by :code:`SQLiteDataset`, :code:`ParquetDataset`, and :code:`LMDBDataset` respectively.
The two specific implementations of :code:`Dataset` exists :
161
+
The three specific implementations of :code:`Dataset` exists :
162
162
163
163
- `ParquetDataset <https://graphnet-team.github.io/graphnet/api/graphnet.data.parquet.parquet_dataset.html>`_ : Constructs :code:`Dataset` from files created by :code:`ParquetWriter`.
164
164
- `SQLiteDataset <https://graphnet-team.github.io/graphnet/api/graphnet.data.sqlite.sqlite_dataset.html>`_ : Constructs :code:`Dataset` from files created by :code:`SQLiteWriter`.
165
+
- `LMDBDataset <https://graphnet-team.github.io/graphnet/api/graphnet.data.dataset.lmdb.lmdb_dataset.html>`_ : Constructs :code:`Dataset` from files created by :code:`LMDBWriter`.
165
166
166
167
167
168
To instantiate a :code:`Dataset` from your files, you must specify at least the following:
168
169
169
-
- :code:`pulsemaps`: These are named fields in your Parquet files, or tables in your SQLite databases, which store one or more pulse series from which you would like to create a dataset. A pulse series represents the detector response, in the form of a series of PMT hits or pulses, in some time window, usually triggered by a single neutrino or atmospheric muon interaction. This is the data that will be served as input to the `Model`.
170
+
- :code:`pulsemaps`: These are named fields in your Parquet files, or tables in your SQLite or LMDB databases, which store one or more pulse series from which you would like to create a dataset. A pulse series represents the detector response, in the form of a series of PMT hits or pulses, in some time window, usually triggered by a single neutrino or atmospheric muon interaction. This is the data that will be served as input to the `Model`.
170
171
- :code:`truth_table`: The name of a table/array that contains the truth-level information associated with the pulse series, and should contain the truth labels that you would like to reconstruct or classify. Often this table will contain the true physical attributes of the primary particle — such as its true direction, energy, PID, etc. — and is therefore graph- or event-level (as opposed to the pulse series tables, which are node- or hit-level) truth information.
171
172
- :code:`features`: The names of the columns in your pulse series table(s) that you would like to include for training; they typically constitute the per-node/-hit features such as xyz-position of sensors, charge, and photon arrival times.
172
173
- :code:`truth`: The columns in your truth table/array that you would like to include in the dataset.
@@ -225,6 +226,32 @@ Or similarly for Parquet files:
225
226
226
227
graph = dataset[0] # torch_geometric.data.Data
227
228
229
+
Or similarly for LMDB files:
230
+
231
+
.. code-block:: python
232
+
233
+
from graphnet.data.dataset.lmdb.lmdb_dataset import LMDBDataset
234
+
from graphnet.models.detector.prometheus import Prometheus
235
+
from graphnet.models.graphs import KNNGraph
236
+
from graphnet.models.graphs.nodes import NodesAsPulses
It's then straightforward to create a :code:`DataLoader` for training, which will take care of batching, shuffling, and such:
229
256
230
257
.. code-block:: python
@@ -250,10 +277,10 @@ By default, the following fields will be available in a graph built by :code:`Da
250
277
- :code:`graph[truth_label] for truth_label in truth`: For each truth label in the :code:`truth` argument, the corresponding data is stored as a :code:`[num_rows, 1]` dimensional tensor. E.g., :code:`graph["energy"] = torch.tensor(26, dtype=torch.float)`
251
278
- :code:`graph[feature] for feature in features`: For each feature given in the :code:`features` argument, the corresponding data is stored as a :code:`[num_rows, 1]` dimensional tensor. E.g., :code:`graph["sensor_x"] = torch.tensor([100, -200, -300, 200], dtype=torch.float)``
252
279
253
-
:code:`SQLiteDataset` vs. :code:`ParquetDataset`
254
-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
280
+
:code:`SQLiteDataset` vs. :code:`ParquetDataset` vs. :code:`LMDBDataset`
Besides working on different file formats, :code:`SQLiteDataset`and :code:`ParquetDataset` have significant differences,
283
+
Besides working on different file formats, :code:`SQLiteDataset`, :code:`ParquetDataset`, and :code:`LMDBDataset` have significant differences,
257
284
which may lead you to choose one over the other, depending on the problem at hand.
258
285
259
286
:SQLiteDataset: SQLite provides fast random access to all events inside it. This makes plotting and subsampling your dataset particularly easy,
@@ -265,13 +292,20 @@ which may lead you to choose one over the other, depending on the problem at han
265
292
This means that the subsampling of your dataset needs to happen prior to the conversion to :code:`parquet`, unlike `SQLiteDataset` which allows for subsampling after conversion, due to it's fast random access.
266
293
Conversion of files to :code:`parquet` is significantly faster than its :code:`SQLite` counterpart.
267
294
295
+
:LMDBDataset: LMDB databases produced by :code:`LMDBWriter` store events as key-value pairs with configurable serialization methods (pickle, json, msgpack, dill).
296
+
:code:`LMDBDataset` supports two modes: reading raw tables and computing data representations in real-time (similar to :code:`SQLiteDataset`), or reading pre-computed data representations directly from the database for faster access.
297
+
LMDB provides fast random access similar to SQLite, while also supporting efficient storage of pre-computed graph representations, making it suitable for scenarios where you want to pre-compute and cache data representations.
298
+
LMDB takes up roughly half the space of SQLite, and is therefore a good compromise between SQLite and Parquet.
299
+
268
300
269
301
.. note::
270
302
271
303
:code:`ParquetDataset` is scalable to ultra large datasets, but is more difficult to work with and has a higher memory consumption.
272
304
273
305
:code:`SQLiteDataset` does not scale to very large datasets, but is easy to work with and has minimal memory consumption.
274
306
307
+
:code:`LMDBDataset` provides a balance between SQLite and Parquet, offering fast random access and support for pre-computed representations, making it well-suited for scenarios where data representations are computed once and reused multiple times.
308
+
275
309
276
310
Choosing a subset of events using `selection`
277
311
----------------------------------------------
@@ -297,7 +331,7 @@ would produce a :code:`Dataset` with only those five events.
297
331
298
332
.. note::
299
333
300
-
For :code:`SQLiteDatase`, the :code:`selection` argument specifies individual events chosen for the dataset,
334
+
For :code:`SQLiteDataset` and :code:`LMDBDataset`, the :code:`selection` argument specifies individual events chosen for the dataset,
301
335
whereas for :code:`ParquetDataset`, the :code:`selection` argument specifies which batches are used in the dataset.
302
336
303
337
@@ -347,12 +381,14 @@ You can combine multiple instances of :code:`Dataset` from GraphNeT into a singl
347
381
from graphnet.data import EnsembleDataset
348
382
from graphnet.data.parquet import ParquetDataset
349
383
from graphnet.data.sqlite import SQLiteDataset
384
+
from graphnet.data.dataset.lmdb.lmdb_dataset import LMDBDataset
0 commit comments