You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+32-33Lines changed: 32 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
</h1>
5
5
6
6
<palign="center">
7
-
<strong>A GPU-accelerated tool for largescale scRNA-seq pipeline.</strong>
7
+
<strong>A GPU-accelerated tool for large-scale scRNA-seq pipeline.</strong>
8
8
</p>
9
9
10
10
<!-- <p align="center">
@@ -29,7 +29,7 @@
29
29
30
30
- Fast scRNA-seq pipeline including QC, Normalization, Batch-effect Removal, Dimension Reduction in a ***similar syntax*** as `scanpy` and `rapids-singlecell`.
31
31
- Scale to dataset with more than ***10M cells*** on a ***single*** GPU. (A100 80G)
32
-
- Chunk the data to avoid the ***`int32` limitation*** in `cupyx.scipy.sparse` used by `rapids-singlecell` that disables the computing for moderate-size dataset (~1.3M) without Multi-GPU support.
32
+
- Chunk the data to avoid the ***`int32` limitation*** in `cupyx.scipy.sparse` used by `rapids-singlecell` that disables the computing for moderate-size datasets (~1.3M) without Multi-GPU support.
33
33
- Reconcile output at each step to ***`scanpy`*** to reproduce the ***same*** results as on CPU end.
34
34
- Improvement on ***`harmonypy`*** which allows dataset with more than ***10M cells*** and more than ***1000 samples*** to be run on a single GPU.
35
35
- Speedup and optimize ***`NSForest`*** algorithm using GPU for ***better*** maker gene identification.
@@ -85,27 +85,26 @@
85
85
Requirements:
86
86
-[**RAPIDS**](https://rapids.ai/) from Nvidia
87
87
-[**rapids-singlecell**](https://rapids-singlecell.readthedocs.io/en/latest/index.html), an alternative of *scanpy* that employs GPU for acceleration.
88
-
-[**Conda**](https://docs.conda.io/projects/conda/en/latest/index.html), version >=22.11 is strongly encoruaged, because *conda-libmamba-solver* is set as default, which significant speeds up solving dependencies.
88
+
-[**Conda**](https://docs.conda.io/projects/conda/en/latest/index.html), version >=22.11 is strongly encouraged, because *conda-libmamba-solver* is set as default, which significantly speeds up solving dependencies.
89
89
-[**pip**](), a python package installer.
90
90
91
91
Environment Setup:
92
92
1. Install [**RAPIDS**](https://rapids.ai/) through Conda, \
Users have flexibility to install it according to their systems by using this [online selector](https://docs.rapids.ai/install/?_gl=1*1em94gj*_ga*OTg5MDQyNDkyLjE3MjM0OTAyNjk.*_ga_RKXFW6CM42*MTczMDIxNzIzOS4yLjAuMTczMDIxNzIzOS42MC4wLjA.#selector). We highly recommand to install `**RAPIDS**>=24.12`, it solves a bug related to the leiden algorithm which results in too many clusters.
Users have the flexibility to install it according to their systems by using this [online selector](https://docs.rapids.ai/install/?_gl=1*1em94gj*_ga*OTg5MDQyNDkyLjE3MjM0OTAyNjk.*_ga_RKXFW6CM42*MTczMDIxNzIzOS4yLjAuMTczMDIxNzIzOS42MC4wLjA.#selector). We highly recommend installing `**RAPIDS**>=24.12`, it solves a bug related to the Leiden algorithm, which results in too many clusters.
96
95
97
96
2. Activate conda env, \
98
97
`conda activate scalesc`
99
98
3. Install [**rapids-singlecell**](https://rapids-singlecell.readthedocs.io/en/latest/index.html) using pip, \
@@ -125,9 +124,9 @@ Please cite [ScaleSC](https://doi.org/10.1101/2025.01.28.635256), and [Scanpy](h
125
124
126
125
## Updates:
127
126
- 2/26/2025:
128
-
- adding a parameter `threshold` in function `adata_cluster_merge` to support cluster merging at various scales according to user's specification. `threshold` is between 0 and 1. set to 0 by default.
129
-
-updating a few more examples of cluster merging in the tutorial.
130
-
- future work: adding supports for loading from large `.h5ad` files.
127
+
- adding a parameter `threshold` in function `adata_cluster_merge` to support cluster merging at various scales according to the user's specification. `threshold` is between 0 and 1. Set to 0 by default.
128
+
-Updating a few more examples of cluster merging in the tutorial.
129
+
- future work: adding support for loading from large `.h5ad` files.
131
130
132
131
133
132
@@ -144,7 +143,7 @@ Please cite [ScaleSC](https://doi.org/10.1101/2025.01.28.635256), and [Scanpy](h
144
143
## <kbd>class</kbd> `ScaleSC`
145
144
ScaleSC integrated pipeline in a scanpy-like style.
146
145
147
-
It will automatcially load dataset in chunks, see `scalesc.util.AnnDataBatchReader` for details, and all methods in this class manipulate this chunked data.
146
+
It will automatically load the dataset in chunks, see `scalesc.util.AnnDataBatchReader` for details, and all methods in this class manipulate this chunked data.
148
147
149
148
150
149
@@ -156,7 +155,7 @@ It will automatcially load dataset in chunks, see `scalesc.util.AnnDataBatchRead
156
155
- <b>`max_cell_batch`</b> (`int`): Maximum number of cells in a single batch.
157
156
- <b>`Default`</b>: 100000.
158
157
- <b>`preload_on_cpu`</b> (`bool`): If load the entire chunked data on CPU. Default: `True`
159
-
- <b>`preload_on_gpu`</b> (`bool`): If load the entire chunked data on GPU, `preload_on_cpu` will be overwritten to `True` when this sets to `True`. Default is `True`.
158
+
- <b>`preload_on_gpu`</b> (`bool`): If the entire chunked data is on GPU, `preload_on_cpu` will be overwritten to `True` when this is set to `True`. The default is `True`.
160
159
- <b>`save_raw_counts`</b> (`bool`): If save `adata_X` to disk after QC filtering.
161
160
- <b>`Default`</b>: False.
162
161
- <b>`save_norm_counts`</b> (`bool`): If save `adata_X` data to disk after normalization.
@@ -193,15 +192,15 @@ __init__(
193
192
194
193
#### <kbd>property</kbd> adata
195
194
196
-
`AnnData`: An AnnData object that used to store all intermediate results without the count matrix.
195
+
`AnnData`: An AnnData object that is used to store all intermediate results without the count matrix.
197
196
198
-
Note: This is always on CPU.
197
+
Note: This is always on the CPU.
199
198
200
199
---
201
200
202
201
#### <kbd>property</kbd> adata_X
203
202
204
-
`AnnData`: An `AnnData` object that used to store all intermediate results including the count matrix. Internally, all chunks should be merged on CPU to avoid high GPU consumption, make sure to invoke `to_CPU()` before calling this object.
203
+
`AnnData`: An `AnnData` object that is used to store all intermediate results, including the count matrix. Internally, all chunks should be merged on CPU to avoid high GPU consumption; make sure to invoke `to_CPU()` before calling this object.
> Only `seurat_v3` is implemented. Raw count matrix is expected as input for `seurat_v3`. HVGs are set to `True` in `adata.var['highly_variable']`.
351
+
> Only `seurat_v3` is implemented. The raw count matrix is expected as input for `seurat_v3`. HVGs are set to `True` in `adata.var['highly_variable']`.
353
352
>
354
353
355
354
**Args:**
@@ -407,7 +406,7 @@ Compute a neighborhood graph of observations using `rapids-singlecell`.
0 commit comments