Context
When working with muxed packet files generated from ground testing data, we often have to inefficiently parse thousands of files, filtering for the APID of interest. This is quite slow when create_dataset parses each file serially.
Driving Requirements
Support an optional n_workers kwarg to create_dataset to enable multiprocessing of input packet files.
Implementation Requirements
To save the overhead of creating many processes, multiprocessing inside create_dataset should probably spin up n_worker processes (or n_files processes if n_files < n_workers) and send roughly an equal number of packet files to each worker process.
This parallelization should apply only to the parsing loop itself with all the post processing for numpy data types occurring afterwards.
Considerations
Make sure we aren't forgetting about any consistency checking that occurs in the packet parsing loop (e.g. duplicates or anything like that). I don't think this will be an issue but good to think about.
Context
When working with muxed packet files generated from ground testing data, we often have to inefficiently parse thousands of files, filtering for the APID of interest. This is quite slow when
create_datasetparses each file serially.Driving Requirements
Support an optional
n_workerskwarg tocreate_datasetto enable multiprocessing of input packet files.Implementation Requirements
To save the overhead of creating many processes, multiprocessing inside create_dataset should probably spin up n_worker processes (or n_files processes if n_files < n_workers) and send roughly an equal number of packet files to each worker process.
This parallelization should apply only to the parsing loop itself with all the post processing for numpy data types occurring afterwards.
Considerations
Make sure we aren't forgetting about any consistency checking that occurs in the packet parsing loop (e.g. duplicates or anything like that). I don't think this will be an issue but good to think about.