forked from bimberlabinternal/BimberLabKeyModules
-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathreleaseNotes.html
More file actions
57 lines (50 loc) · 5.69 KB
/
releaseNotes.html
File metadata and controls
57 lines (50 loc) · 5.69 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
<h4>Release 3.0:</h4>
<ul>
<li>This release involves a major change in the processing of variants. All prior releases omitted variants in complex or repetitive regions. We originally excluded these because genotypes can be less accurate; however, we made this change because some repetitive regions overlap coding regions and can contain valuable information. This also results in a significant increase in the total number of variants.</a></li>
<li>This is the first release to include a second species. The dataset now contains both Rhesus macaques and Japanese macaques (which are separated into a separate VCF / browser track).</li>
</ul>
<h4>Release 2.5:</h4>
<ul>
<li>This release includes a major change in how we perform variant annotation, <a href="mgap-annotation.view">described in more detail here.</a></li>
<li>This release includes 107 additional datasets. The total number of variants dropped relative to release 2.4 due to a change in our filtering. Previously, certain conditions would allow a variant to remain in the dataset, even if no subjects had this genotype. After this correction, ~4m sites were removed, although no genotype calls should be changed.</li>
<li>We introduced an entirely new search feature! The new 'Variant Search' page allows you to perform genome-wide queries based on variant annotations (e.g. predicted coding impact or overlap with databases of human pathogenic variants)</li>
</ul>
<h4>Release 2.4:</h4>
<ul>
<li>This is an additional 470 animals over the prior version. All processing and informatics steps are identical.</li>
</ul>
<h4>Release 2.3:</h4>
<ul>
<li>This is an additional 560 animals over the prior version.</li>
<li>There are a sizable number of data processing changes, largely adaptations to handle the rapidly growing dataset size:</li>
<ol>
<li>All data used <a href="https://gatk.broadinstitute.org/hc/en-us/articles/4405443600667-ReblockGVCF">GATK Reblocked gVCFs</a> as inputs. This reduces processing, but can reduce sensitivity at homozygous-reference sites (resulting in greater numbers of no-call genotypes at homozygous ref sites)</li>
<li>To adapt to larger data size, we changed the structure of data processing. Previously, samples were each aggregated into one GenomicsDB workspace per data type (WGS or WXS). Next, GenotypeGVCFs was run on each workspace, with one job per contig. The resulting VCFs were filtered and merged. In this release, the upfront aggregation step was dropped, and we instead: 1) use reblocked gVCFs as input (entire set of samples), 2) chunk the genome into ~1000 bins with one job/bin, 3) per bin, run GenomicsDbImport to make a transient workspace using the job's intervals +/- 1000bp, 4) run GenotypeGVCFs against that workspace, 5) filter the result, including technology-aware thresholds (i.e. different depth filters for WGS/WXS). This process is both considerably more efficient and has the advantage of joint-genotyping across the entire cohort at once.</li>
</ol>
</ul>
<h4>Release 2.2:</h4>
<ul>
<li>This is an additional 103 animals over the prior version.</li>
<li>We made two changes to how variants are filters:</li>
<ol>
<li>We now use the official NCBI RepeatMasker data to mask variants in repetitive regions, instead of a custom RepeatMasker file generated in-house for version 2.1 and lower. In practice the result should be very similar.</li>
<li>We removed the SnpCluster filter from processing. In previous versions, any SNPs falling into a window with at least 3 SNPs in a 10bp window were filtered. As the dataset grew, this captured too many valid variants. This change resulted in a sizable increase in passing variants in this release</li>
</ol>
<li>We upgraded the mGAP website to use JBrowse 2, a faster, more powerful version of the JBrowse 1 browser used in prior versions. Please use the help link to ask any questions or report issues.</li>
</ul>
<h4>Release 2.1:</h4>
<ul>
<li>This is an iterative improvement to the dataset, adding 275 additional animals.</li>
<li>We removed 2 animals (m05247 and m03764), which we believe are poor quality/contaminated samples.</li>
<li>There was an update to how all whole exome data is processed, applied across all data. We now run MarkDuplicates on the BAM file, to mark/ignore duplicate reads prior to variant calling. This should also remove some previous artifacts from the dataset.</li>
</ul>
<h4>Release 2.0:</h4>
<ul>
<li>Substantial revamp of all data. All samples have been realigned to the <a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_003339765.1/">MMul_10 reference genome</a>, followed by our <a href="mgap-dataProcessing.view">standard GenotypeGVCFs pipeline</a>. The MMul_10 is the most complete rhesus macaque assembly to date, and we expect this should improve accuracy of variant calls. Further, because our data are now aligned to the same assembly as NCBI/Ensembl, it should be easier to translate between mGAP and other databases.</li>
<li>Our internal variant calling process has switched to use GATK's GenomicsDB to pre-aggregate data prior to calling with GenotypeGVCFs, as opposed to CombineGVCFs, which was used in prior releases. This should be a purely technical difference with no change in the resulting data</li>
</ul>
<h4>Future Plans:</h4>
<ul>
<li>We will support other modes of viewing and downloading variant data, in particular tabular views by gene.</li>
<li>We recognize that the mGAP release VCF can be enormous, particularly because of all the site-specific functional annotation. To support different types of users, upcoming releases will include 'slim' versions of the data, which will be downloadable files with certain information removed to save file size.</li>
</ul>