PointData Design Document

Requirements

Functional

Data backend support:

Dense contiguous storage
Sparse storage (CSR)
Disk-backed chunked storage (TensorStore)

Non-functional

Focus on row-based access (investigate whether we do more row-based access or column-based)

Potential solution

Data library with interchangeable backends for different types of data (dense, sparse, chunked).

Why a library?

The responsibilities of the Points and PointData classes have moved beyond acting as a data storage and retrieval. It is also responsible for (de)serializing to our specific data format, has dimension picker actions inside of it, has to fire proper events, set icons, deal with proxy datasets, provide an info action, resolve selections, provide functions suggesting which selections are possible. Hence, when adhering to separation of concerns the data storage should at least be a separate class, purely focused on what you can do with the data that’s stored inside of it. However, with the various requirements we have, building one monolithic class that can do all of this is likely not a good option. Therefore, we need multiple different classes and libraries to support our goals. Hiding this behind a library interface ourselves can help save users of these classes from having to worry about most of these. In addition, multiple plugins might want to use this kind of data storage system without necessarily wanting the whole plugin wrapper that’s around it.

API Design

Basic API

Scalar access

Retrieving the value at a specific position within a dataset is a fundamental operation that users will regularly want to perform. Ideally, on larger datasets this type of access is not used for retrieving all the values in the dataset. Also if a row or column is required, the functions specific to this will be more efficient.

Proposed signature

float valueAt(size_t row, size_t col) const;

Example usage

float x = dataset.valueAt(1, 3);

Row access

Obtaining a particular row from the dataset is one of the most common scenarios for users, for example for comparing feature vectors, computing similarity, etc.

Proposed signature

std::vector<float> row(size_t row) const;

Design considerations

For the sake of efficiency it would be ideal if we could return a view on the row, rather than a copy. However, since we support multiple different data types the question is how we would return a view on the row. One possibility is an std::span providing typed access, however this requires knowing the stored data type, and dealing with all the possible data types in every class that calls this function. Another option is a row-view object internally storing an std::span<> with the right datatype and providing a valueAt() function to retrieve elements from it on the fly. Since the caller typically doesn’t want to bother itself with what datatype is stored in there, valueAt() would like return a float and perform a conversion, or it could be a template function to retrieve your datatype of choice, still with a conversion.

The question is then what the access pattern to this row would generally be like. If it is common to request a row but not access all elements within it, then returning a view of a row would be more efficient. However, if it is more common to request the row and use all values within (maybe multiple times), then it might be more efficient to return a float copy of the row, which only does the conversion once.

Given these two access patterns, it might be useful to provide both options. Both options could also be unified in a single RowData object, which either stores a std::span view of the row in case the dataset is made of floats (which is very common) or a std::vector copy of the row (in case the dataset is of a different data type). The downside of this approach is that it is not very transparent to the user what the lifetime of the returned data is, nor the efficiency of access to it.

After benchmarking the following performance scenarios were observed:

Data is stored as float, and a float view is returned (Fastest)
Data is stored as float, and a copy of the row is returned (30% Slower)
Data is stored as another data type, and a copy of the row is returned (Similar to 2.)
Data is stored as another data type, and a view of the row is returned (30% Slower than 2/3)

In general a tradeoff has to be made between the simplicity and performance of the API. Therefore, a good solution would be to provide the simple copy version in the basic API, and have an in-place visit function in the advanced API, that allows power-users to maximize performance (because zero-copy). Hence, this row function always returns a float copy of the row.

Column access

Proposed signature

std::vector<float> column(size_t col) const;

Subset creation

Proposed signature

View selectRows(std::vector<std::size_t> rows) const
View selectColumns(std::vector<std::size_t> cols) const
View select(std::vector<std::size_t> rows, std::vector<std::size_t> cols) const

Example usage

auto cells = dataset.selectRows(cellIndices);
auto genes = dataset.selectColumns(geneIndices);
auto subset = dataset.select(cellIndices, geneIndices);

Advanced API

Core interaction

Building plugins

Plugin structure
Writing your first plugin
Dropping datasets on the plugin
Learning center (coming soon!)
Action GUI building blocks
- Action Customization

Building applications

Data Hierarchy Interaction

Getting Child Datasets

PointData Design Document

Requirements

Functional

Non-functional

Potential solution

Why a library?

API Design

Basic API

Scalar access

Proposed signature

Example usage

Row access

Proposed signature

Design considerations

Column access

Proposed signature

Subset creation

Proposed signature

Example usage

Advanced API

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Core interaction

Building plugins

Building applications

Data Hierarchy Interaction

Additional Topics

Clone this wiki locally