Skip to content

Support Multiprocessing in create_dataset #241

@medley56

Description

@medley56

Context

When working with muxed packet files generated from ground testing data, we often have to inefficiently parse thousands of files, filtering for the APID of interest. This is quite slow when create_dataset parses each file serially.

Driving Requirements

Support an optional n_workers kwarg to create_dataset to enable multiprocessing of input packet files.

Implementation Requirements

To save the overhead of creating many processes, multiprocessing inside create_dataset should probably spin up n_worker processes (or n_files processes if n_files < n_workers) and send roughly an equal number of packet files to each worker process.

This parallelization should apply only to the parsing loop itself with all the post processing for numpy data types occurring afterwards.

Considerations

Make sure we aren't forgetting about any consistency checking that occurs in the packet parsing loop (e.g. duplicates or anything like that). I don't think this will be an issue but good to think about.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions