-
Notifications
You must be signed in to change notification settings - Fork 0
PointData Design Document
Data backend support:
- Dense contiguous storage
- Sparse storage (CSR)
- Disk-backed chunked storage (TensorStore)
Focus on row-based access (investigate whether we do more row-based access or column-based)
Data library with interchangeable backends for different types of data (dense, sparse, chunked).
The responsibilities of the Points and PointData classes have moved beyond acting as a data storage and retrieval. It is also responsible for (de)serializing to our specific data format, has dimension picker actions inside of it, has to fire proper events, set icons, deal with proxy datasets, provide an info action, resolve selections, provide functions suggesting which selections are possible. Hence, when adhering to separation of concerns the data storage should at least be a separate class, purely focused on what you can do with the data that’s stored inside of it. However, with the various requirements we have, building one monolithic class that can do all of this is likely not a good option. Therefore, we need multiple different classes and libraries to support our goals. Hiding this behind a library interface ourselves can help save users of these classes from having to worry about most of these. In addition, multiple plugins might want to use this kind of data storage system without necessarily wanting the whole plugin wrapper that’s around it.
Retrieving the value at a specific position within a dataset is a fundamental operation that users will regularly want to perform. Ideally, on larger datasets this type of access is not used for retrieving all the values in the dataset. Also if a row or column is required, the functions specific to this will be more efficient.
float valueAt(size_t row, size_t col) const;float x = dataset.valueAt(1, 3);Obtaining a particular row from the dataset is one of the most common scenarios for users, for example for comparing feature vectors, computing similarity, etc.
std::vector<float> row(size_t row) const;For the sake of efficiency it would be ideal if we could return a view on the row, rather than a copy. However, since we support multiple different data types the question is how we would return a view on the row. One possibility is an std::span providing typed access, however this requires knowing the stored data type, and dealing with all the possible data types in every class that calls this function. Another option is a row-view object internally storing an std::span<> with the right datatype and providing a valueAt() function to retrieve elements from it on the fly. Since the caller typically doesn’t want to bother itself with what datatype is stored in there, valueAt() would like return a float and perform a conversion, or it could be a template function to retrieve your datatype of choice, still with a conversion.
The question is then what the access pattern to this row would generally be like. If it is common to request a row but not access all elements within it, then returning a view of a row would be more efficient. However, if it is more common to request the row and use all values within (maybe multiple times), then it might be more efficient to return a float copy of the row, which only does the conversion once.
Given these two access patterns, it might be useful to provide both options. Both options could also be unified in a single RowData object, which either stores a std::span view of the row in case the dataset is made of floats (which is very common) or a std::vector copy of the row (in case the dataset is of a different data type). The downside of this approach is that it is not very transparent to the user what the lifetime of the returned data is, nor the efficiency of access to it.
After benchmarking the following performance scenarios were observed:
- Data is stored as float, and a float view is returned (Fastest)
- Data is stored as float, and a copy of the row is returned (30% Slower)
- Data is stored as another data type, and a copy of the row is returned (Similar to 2.)
- Data is stored as another data type, and a view of the row is returned (30% Slower than 2/3)
In general a tradeoff has to be made between the simplicity and performance of the API. Therefore, a good solution would be to provide the simple copy version in the basic API, and have an in-place visit function in the advanced API, that allows power-users to maximize performance (because zero-copy). Hence, this row function always returns a float copy of the row.
std::vector<float> column(size_t col) const;View selectRows(std::vector<std::size_t> rows) const
View selectColumns(std::vector<std::size_t> cols) const
View select(std::vector<std::size_t> rows, std::vector<std::size_t> cols) constauto cells = dataset.selectRows(cellIndices);
auto genes = dataset.selectColumns(geneIndices);
auto subset = dataset.select(cellIndices, geneIndices);- Event Handling: Communication between plugins
- Dataset Handling
- Querying standard paths
- Dataset handles
- Plugin structure
- Writing your first plugin
- Dropping datasets on the plugin
- Learning center (coming soon!)
- Action GUI building blocks