-
Notifications
You must be signed in to change notification settings - Fork 351
Next generation data storage requirements gathering
This page is meant to collect notes that will eventually turn into a spec for next generation data storage within qcodes.
Arguments to start design of new data store:
- Current QCoDeS database storage is slow due to the fact that it writes to an sqlite DB file and exports data after the end of the measurement to a desired target format.
- Current QCoDeS database storage is not optimized for storing large (multi-dimensional) arrays.
- Current QuTech core-tools database can only handle data stored as an n-dimensional array.
- Make data storing in QCoDeS more disk-space efficient by avoiding storing raw data to sqlite DB file (we are not aware of any users who desire to store raw data inside the sqlite DB file; and it is a legacy design decision with questionable justification).
Opportunities when designing a new data store:
- Enhance metadata and semantics for searching, presentation and interpretation of data, for example, extra metadata about the parameters of the measurement (e.g. this variable is a qubit state)
- easy processing of similar data in store.
- standardized mechanism to add semantics (and users can define those as per their needs)
- Include way for data synchronization mechanisms to remote storage outside qcodes to know if dataset has been updated/modified
- especially the administration of synchronization is needed to know whether data has been synchronized and whether there have been changes since last sync. Some fingerprint or hashcode could be used, or some logging mechanism.
- Synchronization could be "partitioned" to only update a part, e.g. metadata.
Future proof, ready for evolution:
- It should be possible to add features without breaking compatibility.
- "Old" tools should still be able to read newer files, although they cannot use the new features.
- New tools should be able to read older files.
- Include version and option specification in stored data.
- The data storage could already support/implement features that are not yet available when using QCoDeS Parameter.
- Data presentation and processing software could check for supported / unsupported features.
General:
- Data format should be based on a well-known standard and readable with some known public tools (all data can be read without qcodes, but qcodes will have likely the best ways of reading it and making sense of it).
- Writing data should be efficient without significant delays in the measurement.
- efficient way of transfering data from the measurement to disk e.g. avoid unnecessary data copying in memory
- avoid writing redundant data/coordinates/etc
- prefer writing data in chunks appending to already written data, as opposed to rewriting all data to disk from its beginning when a new chunk arrives
- Reading data should also be efficient.
- Opening a dataset with 1e8 values (10 variables, dim: 100x100x1000 points, ~1 GB) should take less than several seconds.
- Reliable storage: none or minimized data loss when measurement process is killed.
- e.g. acceptable that the current chunk is missing, but the already-written dataset must still be valid (NOT corrupted).
- It should be possible to load one variable without loading the whole dataset into memory first.
- The data storage layer should be independent of ways the measurement can be implemented, however it has to support the various flavours seen in scientific exploratory work.
Raw data requirements:
- Storage should support multiple data shapes and ordering:
- Storage format should allow efficient storage of n-dimensional arrays (gridded data) written in sequential order.
- Storage format should allow storage of "flat" data from a random sampled n-dimensional space.
- Storage format should have some flexibility in combining efficient sequential data and random sampling.
- Keeping open to support non-orthogoal grids or non-standard grids (e.g. a grid that starts large and then in particular region gets more precise).
- Multiple data types should be properly supported, like float, complex, int, bool and common sizes of those; and strings (also categorical)
- Secondary axis for the same coordinate should be supported.
Metadata requirements:
- Semantics and formats of metadata are well-documented such that it is easy to implement export/transformation into a different storage format if users so desire
- Structure of "flat" data should be specified (see Quantify). Is it a Grid, etc.?
- every "axis"/setpoint could have a specification. There are many possibilities: TODO collect example for labs.
- Integer data can be labelled as bitset. (qubit register)
- Integer data can be labelled as enum with specification of enum mapping. Thus also store strings.
- All coordinates have attributes (name, label, unit).
- Dimensions without coordinates have a range specification, and expand this data when loading it. This is especially useful for a dimension that is a repetition of the measurement. It is also useful for linear sweeps. It saves storage space, but also adds semantics. Note: right now about 1/3 of the data in the core-tools database is just repetitions of values from 0 to 999. About 1/2 of the data consists of linearly increasing setpoint values. Probably less than 1/6 of all stored data is a non-trivial sequence. Idea: investigate if compression algorithms can help with these cases.
- Implementation should provide an easy interface to write measurement data and to read measurement data.
Semantics (metadata): For data presentation, processing, searching, filtering
- Standard list with quantity and unit names: rad, degrees, ... (Like standard for meteorological data)
- Qubit state awareness: raw sensor value, single shot qubit state, averaged qubit state.
- Qubit registers: qubit numbering, including ancilla qubits
- Qubit measurement: initialization, mid-circuit, readout, post-selection (mask)
- Charge state: raw sensor value
- Sensor scan: coulomb peaks, resonance frequency
- Qubit measurement types:
- characterization: T1, T2, RB, Ramsey, Rabi, ...
- calibration: ...
- Relations, dependencies between variables:
- Measured variable that can be used on x-axis, e.g. Temperature.
- Relation to other datasets
- Grouping/Series, labelling/categories
Scenarios:
- Support ad-hoc/simple measurements like qcodes' doNd (not only the elaborate cases)
- Executing multi-qubit circuit with calibration steps on the fly. Store data with different dimensions.
Design hints (or why not use xarray):
- xarray uses dimensions and coordinates.
- xarray can store scattered / streamed variable data with coordinates using 1 "dim".
- xarray can store n-D data in an efficient way avoiding duplication of coordinate values.
- xarray allows dimensions without coordinate values.
- ideally the design / implementation should consist of a measurement aware interface layer and a well defined storage layer. It should be possible to replace the storage layer by another implementation without impact on the interface layer. Multiple storage layers could be possible: network / web / rest API, file based, database, in memory, ...
- metadata could be stored in json format in a way that allows easy extension.
- metadata could be spread over multiple (xarray) attributes.
Implementation notes:
- xarray zarr storage allows writing of slices along 1 dimension. When writing some dimensions will have to be "flattened".
- Backend formats to discuss: Database, hdf5, netcdf, zarr?
- [Jens] Start with interfaces mocked and a diagram of how things work together, decoupling as much as possible
- [Sander] Try out simple things with xarray and zarr, and learn how things work at the low level, what performance is, and how we can build on top of it
- Investigate/clarify how to approach storing non-gridded data (e.g. sampled measurement, or scattered data) and reading it back
- Investigate how storage space can be saved with compression and what performance hit it adds for reading/storing. Consider zarr, netcdf4, hdf5, parquet/arrow.
graph LR
AA[doNd] -- Creates --> A
AB[ManualCreate] -- Creates --> A
AC[Other?] -- Creates --> A
A[DataSetDefinition] -- dd.execute() --> D[Dataset]
A -- with dd.manual_measurement(): --> D[Dataset]
DataSetDefinition replaces current DataSetDefinition as well as RunDescription + InterDeps within the dataset.
The definition must be flexible enough to support both dataset on a grid and non-gridded dataset and be extendable to other data types as needed.
Need to discuss if the DataSetDefinition needs to embed logic for actions ...
with dd.manual_execute() as ds: <- create a dataset that is preallocated to the correct shape.
If shape is not known allocate N points and grow *2 on each reallocation.
Writes dataset metadata, guid etc
ds.add_result({param: val}) <-- Add data to buffer. For each chunk of data of size n (configurable)
data is flushed to disk. Could be zarr chunked dir or similar.
Backend must be configurable. Flushing should happen on background thread
__exit__ <- Finalize dataset (set shapes, cleanup extra allocation, zip, convert to other format, sends signal, trigger entry point etc.
What should happen for interrupted measurements. Should remaining data be written as NaN or excluded