|
| 1 | +Usage |
| 2 | +===== |
| 3 | + |
| 4 | +``seqchromloader`` is composed of two types of functions: ``writer`` and ``loader``. You can use ``writer`` to dump dataset into webdataset format file for future use, or directly call ``loader`` to get tensors immediately. |
| 5 | + |
| 6 | +Generally ``seqchromloader`` would produce four kinds of tensors: **[seq, chrom, target, label]** |
| 7 | + |
| 8 | +* **seq** is one-hot coded DNA sequence tensor of shape *[batch_size, 4, len]* using the DNA mapping order of "ACGT" (which means, A = [1,0,0,0], C = [0,1,0,0], ...) |
| 9 | +* **chrom** is chromatin track tensor of shape *[batch_size, # tracks, len]*, chromatin track bigwig files are usually provided by ``bigwig_filelist`` parameter |
| 10 | +* **target** is the tensor representing the number of sequencing reads in the region, this is from the bam file given by ``target_bam`` parameter |
| 11 | +* **label** is the integer label of each sample, when given bed file input, this info would be from the fourth column. While given a pandas DataFrame, it should have a column named *label* |
| 12 | + |
| 13 | +Writer |
| 14 | +------ |
| 15 | + |
| 16 | +Currently only webdataset format is supported, you can write tensors into webdataset in this way: |
| 17 | + |
| 18 | +.. code-block:: python3 |
| 19 | +
|
| 20 | + import pandas as pd |
| 21 | + from seqchromloader import dump_data_webdataset |
| 22 | +
|
| 23 | + coords = pd.DataFrame({ |
| 24 | + "chrom": ["chr1", "chr10"], |
| 25 | + "start": [1000, 5000], |
| 26 | + "end": [1200, 5200], |
| 27 | + "label": [0, 1] |
| 28 | + }) |
| 29 | + wds_file_lists = dump_data_webdataset(coords, |
| 30 | + genome_fasta="mm10.fa", |
| 31 | + bigwig_filelist=["h3k4me3.bw", "atacseq.bw"], |
| 32 | + outdir="dataset/" |
| 33 | + outprefix="test", |
| 34 | + compress=True, |
| 35 | + numPorcessors=4, |
| 36 | + transforms={"chrom": lambda x: x+1}) |
| 37 | +
|
| 38 | +.. note:: |
| 39 | + Each region should be of the same length! As in this example, every region is 200bp long. |
| 40 | + |
| 41 | +The returned ``wds_file_lists`` contain the output file paths, every file has ~7000 samples. |
| 42 | + |
| 43 | +One thing worth noting is the ``transforms`` parameter here, ``transforms`` accepts a dictionary of function, each function will be called on the output that its key refers to. In this example, the add 1 lambda function was called on each ``chrom`` tensor, you can do more complicated transformations in this way, e.g., standardize the tensor. |
| 44 | + |
| 45 | +Loader |
| 46 | +------ |
| 47 | + |
| 48 | +You can easily load the webdataset files generated by ``seqchromloader.dump_data_webdataset`` above by: |
| 49 | + |
| 50 | +.. code-block:: python3 |
| 51 | +
|
| 52 | + from seqchromloader import SeqChromDatasetByWds |
| 53 | +
|
| 54 | + dataloader = SeqChromDatasetByWds(wds_file_lists, transforms=None, rank=0, world_size=1) |
| 55 | + seq, chrom, target, label = next(iter(dataloader)) |
| 56 | +
|
| 57 | +If you are using multiple GPUs, you can use ``rank`` and ``world_size`` to do sharding on dataset to ensure each GPU getting non-overlapped piece of dataset |
| 58 | + |
| 59 | +A more straightforward way is using ``seqchromloader.SeqChromDatasetByBed``, which can output tensors given a bed file and other required files. |
| 60 | + |
| 61 | +.. code-block:: python3 |
| 62 | +
|
| 63 | + from seqchromloader import SeqChromDatasetByBed |
| 64 | +
|
| 65 | + dataloader = SeqChromDatasetByWds(bed="regions.bed", |
| 66 | + genome_fasta="mm10.fa", |
| 67 | + bigwig_filelist=["h3k4me3.bw", "atacseq.bw"], |
| 68 | + target_bam="foxa1.bam", |
| 69 | + transforms={"label": lambda x: x-1}, |
| 70 | + dataloader_kws={num_workers: 4}) |
| 71 | + seq, chrom, target, label = next(iter(dataloader)) |
| 72 | +
|
| 73 | +Here I pass a dictionary describing the keywords arguments would be further passed to ``torch.utils.data.DataLoader`` to increase the number of workers (default is 1), you can refer to `Pytorch DataLoader Document <https://pytorch.org/docs/stable/data.html>`_ to explore more controls on DataLoader behavior |
| 74 | + |
| 75 | +API |
| 76 | +--- |
| 77 | + |
| 78 | +.. autofunction:: seqchromloader.dump_data_webdataset |
| 79 | + |
| 80 | +.. autofunction:: seqchromloader.SeqChromDatasetByBed |
| 81 | + |
| 82 | +.. autofunction:: seqchromloader.SeqChromDatasetByWds |
0 commit comments