-
Notifications
You must be signed in to change notification settings - Fork 0
PointData Design Document
- Storage of different numerical data types (whole matrix same datatype)
- Metadata is stored separately from the numerical data matrix
- Data backend support:
- Dense contiguous storage
- Sparse storage (CSR)
- Disk-backed chunked storage (TensorStore)
- Focus on simple and understandable API functions, while still allowing for highly performant code.
- Focus on row-based access (investigate whether we do more row-based access or column-based)
Put your requirement suggestions here..
Data library with interchangeable backends for different types of data (dense, sparse, chunked).
The responsibilities of the Points and PointData classes have moved beyond acting as a data storage and retrieval. It is also responsible for (de)serializing to our specific data format, has dimension picker actions inside of it, has to fire proper events, set icons, deal with proxy datasets, provide an info action, resolve selections, provide functions suggesting which selections are possible. Hence, when adhering to separation of concerns the data storage should at least be a separate class, purely focused on what you can do with the data that’s stored inside of it. However, with the various requirements we have, building one monolithic class that can do all of this is likely not a good option. Therefore, we need multiple different classes and libraries to support our goals. Hiding this behind a library interface ourselves can help save users of these classes from having to worry about most of these. In addition, multiple plugins might want to use this kind of data storage system without necessarily wanting the whole plugin wrapper that’s around it.
mdata::Frame / Table / Matrix / ?
mdata::View
mdata::Store
|-- mdata::DenseStore (contiguous, in-memory)
|-- mdata::SparseStore (sparse, in-memory)
|-- mdata::DenseChunkedStore (TensorStore, disk-backed, chunked)Frame - Main representation of the matrix, defines the external API and contains a Store backend object.
Store - Abstract template for data backend (dense, sparse, chunked), defines the internal API
-
DenseStore - Stores data in a variant of vectors. This is contiguous dense data, stored in memory.
-
SparseStore - Stores data in a CSR sparse data format. This consists of three separate lists of contiguous data, stored in memory.
-
DenseChunkedStore - Forwards API functions to TensorStore functions.
View - Represents a block-view on a Frame matrix. Main user-facing access point, users can only interact with the data through such a view. Views can be unindexed (full data), row indexed (subset) and column indexed (feature subset). Whether the view represents the full data or a subset should be opaque to the user, and they should be able to interact with every view as if it is a matrix in itself. Views should therefore resolve all their indices internally to the actual data values, and users should never have to manually resolve these.
Retrieving the value at a specific position within a dataset is a fundamental operation that users will regularly want to perform. Ideally, on larger datasets this type of access is not used for retrieving all the values in the dataset. Also if a row or column is required, the functions specific to this will be more efficient.
float valueAt(size_t row, size_t col) const;float x = dataset.valueAt(1, 3);Obtaining a particular row from the dataset is one of the most common scenarios for users, for example for comparing feature vectors, computing similarity, etc.
std::vector<float> row(size_t row) const;For the sake of efficiency it would be ideal if we could return a view on the row, rather than a copy. However, since we support multiple different data types the question is how we would return a view on the row. One possibility is an std::span providing typed access, however this requires knowing the stored data type, and dealing with all the possible data types in every class that calls this function. Another option is a row-view object internally storing an std::span<> with the right datatype and providing a valueAt() function to retrieve elements from it on the fly. Since the caller typically doesn’t want to bother itself with what datatype is stored in there, valueAt() would like return a float and perform a conversion, or it could be a template function to retrieve your datatype of choice, still with a conversion.
The question is then what the access pattern to this row would generally be like. If it is common to request a row but not access all elements within it, then returning a view of a row would be more efficient. However, if it is more common to request the row and use all values within (maybe multiple times), then it might be more efficient to return a float copy of the row, which only does the conversion once.
Given these two access patterns, it might be useful to provide both options. Both options could also be unified in a single RowData object, which either stores a std::span view of the row in case the dataset is made of floats (which is very common) or a std::vector copy of the row (in case the dataset is of a different data type). The downside of this approach is that it is not very transparent to the user what the lifetime of the returned data is, nor the efficiency of access to it.
After benchmarking the following performance scenarios were observed:
- Data is stored as float, and a float view is returned (Fastest)
- Data is stored as float, and a copy of the row is returned (30% Slower)
- Data is stored as another data type, and a copy of the row is returned (Similar to 2.)
- Data is stored as another data type, and a view of the row is returned (30% Slower than 2/3)
In general a tradeoff has to be made between the simplicity and performance of the API. Therefore, a good solution would be to provide the simple copy version in the basic API, and have an in-place visit function in the advanced API, that allows power-users to maximize performance (because zero-copy). Hence, this row function always returns a float copy of the row.
Rather than making the vector a parameter to the function to prevent copying on return, in modern C++ this is no longer a concern due to guaranteed copy elision (NRVO), resulting in a cleaner function signature.
The datatype of the argument is size_t to support sparse data indexing and be future proof.
std::vector<float> column(size_t col) const;If all of our data stores are optimized for row-based accessing, column access can be significantly slower. It is important this is mentioned in the corresponding documentation to this function. Still, having a convenience function to access a column of the data is very useful in a lot of cases. The function signature is based on the design considerations of row-based access.
View selectRows(std::vector<std::size_t> rows) const
View selectColumns(std::vector<std::size_t> cols) const
View select(std::vector<std::size_t> rows, std::vector<std::size_t> cols) constauto cells = dataset.selectRows(cellIndices);
auto genes = dataset.selectColumns(geneIndices);
auto subset = dataset.select(cellIndices, geneIndices);This section defines the advanced API. This API is meant to be limited in the number of functions to avoid pollution of the function list, while still keeping full flexibility. As of yet, therefore we propose only one advanced function visitRow which allows visiting every value in the data store in-place (no copies, no conversions) by providing a lambda function acting directly upon the datastore.
template <typename F>
void visitRow(std::size_t row, F&& f) constTBD
- Event Handling: Communication between plugins
- Dataset Handling
- Querying standard paths
- Dataset handles
- Plugin structure
- Writing your first plugin
- Dropping datasets on the plugin
- Learning center (coming soon!)
- Action GUI building blocks