Skip to content

Commit 5694c18

Browse files
authored
Merge pull request #248 from markspec/241-nonreg
Support for Seismic Data Ingestion Without Dimension-Specific Indexing
2 parents 2b57c70 + 3e401e2 commit 5694c18

9 files changed

Lines changed: 1938 additions & 1297 deletions

File tree

poetry.lock

Lines changed: 1329 additions & 1193 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

src/mdio/commands/segy.py

Lines changed: 54 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
\b
4040
--header-names inline,crossline
4141
--header-locations 189,193
42-
--header-lengths 4,4
42+
--header-types int32,int32
4343
4444
\b
4545
Our recommended chunk sizes are:
@@ -212,8 +212,8 @@ def segy_import(
212212
index byte locations (starts from 1) are the minimum amount
213213
of information needed to index the file. However, we suggest
214214
giving names to the index dimensions, and if needed providing
215-
the header lengths if they are not standard. By default, all header
216-
entries are assumed to be 4-byte long.
215+
the header types if they are not standard. By default, all header
216+
entries are assumed to be 4-byte long (int32).
217217
218218
The chunk size depends on the data type, however, it can be
219219
chosen to accommodate any workflow's access patterns. See examples
@@ -267,7 +267,7 @@ def segy_import(
267267
Chunks: 8 shots x 2 cables x 256 channels x 512 samples
268268
--header-locations 9,213,13
269269
--header-names shot,cable,chan
270-
--header-lengths 4,2,4
270+
--header-types int32,int16,int32
271271
--chunk-size 8,2,256,512
272272
273273
We can override the dataset grid by the `grid_overrides` parameter.
@@ -278,8 +278,8 @@ def segy_import(
278278
a cable number and channel numbers are sequential (i.e. each cable
279279
doesn't start with channel number 1; we can tell MDIO to ingest
280280
this with the correct geometry by calculating cable numbers and
281-
wrapped channel numbers. Note the missing byte location and word
282-
length for the "cable" index.
281+
wrapped channel numbers. Note the missing byte location and type
282+
for the "cable" index.
283283
284284
285285
Usage:
@@ -289,7 +289,7 @@ def segy_import(
289289
Chunks: 8 shots x 2 cables x 256 channels x 512 samples
290290
--header-locations 9,None,13
291291
--header-names shot,cable,chan
292-
--header-lengths 4,None,4
292+
--header-types int32,None,int32
293293
--chunk-size 8,2,256,512
294294
--grid-overrides '{"ChannelWrap": True, "ChannelsPerCable": 800,
295295
"CalculateCable": True}'
@@ -299,7 +299,7 @@ def segy_import(
299299
sequential (aka. unwrapped), we can still ingest it like this.
300300
--header-locations 9,213,13
301301
--header-names shot,cable,chan
302-
--header-lengths 4,2,4
302+
--header-types int32,int16,int32
303303
--chunk-size 8,2,256,512
304304
--grid-overrides '{"ChannelWrap":True, "ChannelsPerCable": 800}'
305305
\b
@@ -309,6 +309,52 @@ def segy_import(
309309
In cases where the user does not know if the input has unwrapped
310310
channels but desires to store with wrapped channel index use:
311311
--grid-overrides '{"AutoChannelWrap": True}'
312+
313+
\b
314+
For cases with no well-defined trace header for indexing a NonBinned
315+
grid override is provided.This creates the index and attributes an
316+
incrementing integer to the trace for the index based on first in first
317+
out. For example a CDP and Offset keyed file might have a header for offset
318+
as real world offset which would result in a very sparse populated index.
319+
Instead, the following override will create a new index from 1 to N, where
320+
N is the number of offsets within a CDP ensemble. The index to be auto
321+
generated is called "trace". Note the required "chunksize" parameter in
322+
the grid override. This is due to the non-binned ensemble chunksize is
323+
irrelevant to the index dimension chunksizes and has to be specified
324+
in the grid override itself. Note the lack of offset, only indexing CDP,
325+
providing CDP header type, and chunksize for only CDP and Sample
326+
dimension. The chunksize for non-binned dimension is in the grid overrides
327+
as described above. The below configuration will yield 1MB chunks.
328+
\b
329+
--header-locations 21
330+
--header-names cdp
331+
--header-types int32
332+
--chunk-size 4,1024
333+
--grid-overrides '{"NonBinned": True, "chunksize": 64}'
334+
335+
\b
336+
A more complicated case where you may have a 5D dataset that is not
337+
binned in Offset and Azimuth directions can be ingested like below.
338+
However, the Offset and Azimuth dimensions will be combined to "trace"
339+
dimension. The below configuration will yield 1MB chunks.
340+
\b
341+
--header-locations 189,193
342+
--header-names inline,crossline
343+
--header-types int32,int32
344+
--chunk-size 4,4,1024
345+
--grid-overrides '{"NonBinned": True, "chunksize": 16}'
346+
347+
\b
348+
For dataset with expected duplicate traces we have the following
349+
parameterization. This will use the same logic as NonBinned with
350+
a fixed chunksize of 1. The other keys are still important. The
351+
below example allows multiple traces per receiver (i.e. reshoot).
352+
\b
353+
--header-locations 9,213,13
354+
--header-names shot,cable,chan
355+
--header-types int32,int16,int32
356+
--chunk-size 8,2,256,512
357+
--grid-overrides '{"HasDuplicates": True}'
312358
"""
313359
mdio.segy_to_mdio(
314360
segy_path=input_segy_path,

src/mdio/converters/segy.py

Lines changed: 82 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -258,22 +258,71 @@ def segy_to_mdio(
258258
channels but desires to store with wrapped channel index use:
259259
>>> grid_overrides={"AutoChannelWrap": True,
260260
"AutoChannelTraceQC": 1000000}
261-
"""
262-
num_index = len(index_bytes)
263261
264-
if chunksize is None:
265-
if num_index == 1:
266-
chunksize = (512,) * 2
262+
For cases with no well-defined trace header for indexing a NonBinned
263+
grid override is provided.This creates the index and attributes an
264+
incrementing integer to the trace for the index based on first in first
265+
out. For example a CDP and Offset keyed file might have a header for offset
266+
as real world offset which would result in a very sparse populated index.
267+
Instead, the following override will create a new index from 1 to N, where
268+
N is the number of offsets within a CDP ensemble. The index to be auto
269+
generated is called "trace". Note the required "chunksize" parameter in
270+
the grid override. This is due to the non-binned ensemble chunksize is
271+
irrelevant to the index dimension chunksizes and has to be specified
272+
in the grid override itself. Note the lack of offset, only indexing CDP,
273+
providing CDP header type, and chunksize for only CDP and Sample
274+
dimension. The chunksize for non-binned dimension is in the grid overrides
275+
as described above. The below configuration will yield 1MB chunks:
267276
268-
elif num_index == 2:
269-
chunksize = (64,) * 3
277+
>>> segy_to_mdio(
278+
... segy_path="prefix/cdp_offset_file.segy",
279+
... mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio",
280+
... index_bytes=(21,),
281+
... index_types=("int32",),
282+
... index_names=("cdp",),
283+
... chunksize=(4, 1024),
284+
... grid_overrides={"NonBinned": True, "chunksize": 64},
285+
... )
270286
271-
else:
272-
msg = (
273-
f"Default chunking for {num_index + 1}-D seismic data is "
274-
"not implemented yet. Please explicity define chunk sizes."
287+
A more complicated case where you may have a 5D dataset that is not
288+
binned in Offset and Azimuth directions can be ingested like below.
289+
However, the Offset and Azimuth dimensions will be combined to "trace"
290+
dimension. The below configuration will yield 1MB chunks.
291+
292+
>>> segy_to_mdio(
293+
... segy_path="prefix/cdp_offset_file.segy",
294+
... mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio",
295+
... index_bytes=(189, 193),
296+
... index_types=("int32", "int32"),
297+
... index_names=("inline", "crossline"),
298+
... chunksize=(4, 4, 1024),
299+
... grid_overrides={"NonBinned": True, "chunksize": 64},
300+
... )
301+
302+
For dataset with expected duplicate traces we have the following
303+
parameterization. This will use the same logic as NonBinned with
304+
a fixed chunksize of 1. The other keys are still important. The
305+
below example allows multiple traces per receiver (i.e. reshoot).
306+
307+
>>> segy_to_mdio(
308+
... segy_path="prefix/cdp_offset_file.segy",
309+
... mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio",
310+
... index_bytes=(9, 213, 13),
311+
... index_types=("int32", "int16", "int32"),
312+
... index_names=("shot", "cable", "chan"),
313+
... chunksize=(8, 2, 256, 512),
314+
... grid_overrides={"HasDuplicates": True},
315+
... )
316+
"""
317+
num_index = len(index_bytes)
318+
319+
if chunksize is not None:
320+
if len(chunksize) != len(index_bytes) + 1:
321+
message = (
322+
f"Length of chunks={len(chunksize)} must be ",
323+
f"equal to array dimensions={len(index_bytes) + 1}",
275324
)
276-
raise NotImplementedError(msg)
325+
raise ValueError(message)
277326

278327
if storage_options is None:
279328
storage_options = {}
@@ -296,14 +345,15 @@ def segy_to_mdio(
296345

297346
index_types = parse_index_types(index_types, num_index)
298347

299-
dimensions, index_headers = get_grid_plan(
348+
dimensions, chunksize, index_headers = get_grid_plan(
300349
segy_path=segy_path,
301350
segy_endian=endian,
302351
index_bytes=index_bytes,
303352
index_names=index_names,
304353
index_types=index_types,
305354
binary_header=binary_header,
306355
return_headers=True,
356+
chunksize=chunksize,
307357
grid_overrides=grid_overrides,
308358
)
309359

@@ -316,6 +366,10 @@ def segy_to_mdio(
316366

317367
# Check grid validity by comparing trace numbers
318368
if np.sum(grid.live_mask) != num_traces:
369+
for dim_name in grid.dim_names:
370+
dim_min, dim_max = grid.get_min(dim_name), grid.get_max(dim_name)
371+
logger.warning(f"{dim_name} min: {dim_min} max: {dim_max}")
372+
logger.warning(f"Ingestion grid shape: {grid.shape}.")
319373
raise GridTraceCountError(np.sum(grid.live_mask), num_traces)
320374

321375
zarr_root = create_zarr_hierarchy(
@@ -358,20 +412,24 @@ def segy_to_mdio(
358412
)
359413

360414
if chunksize is None:
361-
suffix = [str(x) for x in range(len(index_bytes) + 1)]
362-
suffix = "".join(suffix)
415+
dim_count = len(index_headers) + 1
416+
if dim_count == 2:
417+
chunksize = (512,) * 2
363418

364-
else:
365-
if len(chunksize) != len(index_bytes) + 1:
366-
message = (
367-
f"Length of chunks={len(chunksize)} must be ",
368-
f"equal to array dimensions={len(index_bytes) + 1}",
419+
elif dim_count == 3:
420+
chunksize = (64,) * 3
421+
422+
else:
423+
msg = (
424+
f"Default chunking for {dim_count}-D seismic data is "
425+
"not implemented yet. Please explicity define chunk sizes."
369426
)
370-
raise ValueError(message)
427+
raise NotImplementedError(msg)
371428

372-
suffix = [
373-
dim_chunksize if dim_chunksize > 0 else None for dim_chunksize in chunksize
374-
]
429+
suffix = [str(x) for x in range(dim_count)]
430+
suffix = "".join(suffix)
431+
else:
432+
suffix = [dim_chunks if dim_chunks > 0 else None for dim_chunks in chunksize]
375433
suffix = [str(idx) for idx, value in enumerate(suffix) if value is not None]
376434
suffix = "".join(suffix)
377435

0 commit comments

Comments
 (0)