Skip to content

Commit 0028167

Browse files
committed
add analysis to characterize StrainID scores
For Supplementary figure 2 and supplementary table 11, we pulled 100 ENCODE datasets and ran StrainID on them using the full hg38_DepMap reference to build at table and histogram of StrainID scores showing that "correct" strainid scores show good separation from "incorrect" ones.
1 parent fbb8778 commit 0028167

8 files changed

Lines changed: 408 additions & 0 deletions

File tree

paper/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ ENCODEdata-CellLines/results/SupplementaryTable10.txt
3737
ENCODEdata-CellLines/results/BAM
3838
ENCODEdata-CellLines/results/BAM-nospike
3939
ENCODEdata-CellLines/results/ID
40+
ENCODEdata-CellLines/results/hg38_ID
4041
BY4742-chipseq/logs/*.out
4142
BY4742-chipseq/logs/*.err
4243
BY4742-chipseq/results/FASTQ

paper/ENCODEdata-CellLines/230612_hg38_100-TF-ChIP-seq.txt

Lines changed: 102 additions & 0 deletions
Large diffs are not rendered by default.

paper/ENCODEdata-CellLines/README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@ Run StrainID on ENCODE data and evaluate StrainID's performance.
55
| | |
66
| :--: | -- |
77
| Figure 6C | `ENCODEdata-CellLines/results/ID/` |
8+
| Supplementary Figure 2 | `ENCODEdata-CellLines/results/hg38_ID/` |
89
| Supplementary Table 10 | `ENCODEdata-CellLines/results/SupplementaryTable10.txt.gz` |
10+
| Supplementary Table 11 | `ENCODEdata-CellLines/results/SupplementaryTable11.txt.gz` |
911

1012
# Reference files
1113

@@ -14,6 +16,9 @@ ENCODE metadata was pulled on May 12, 2021 using the `scripts/get_metadata.py` s
1416

1517
Command used: `python scripts/get_metadata.py > 210512_sample_metadata.txt`
1618

19+
## 230612_hg38_100-TF-ChIP-seq.txt
20+
ENCODE metadata was pulled on June 12, 2023 according to the retrieval URL in the header. The filter criteria ensured files were BAM formats using the hg38 genome build, from the TF ChIP-seq assay, and from a cell line background in the core `hg19_VCF` set. Only the top 100 BAM files were saved here.
21+
1722
# Download ENCODE CellLine data and run through StrainID
1823
Use the 210512_sample_metadata.txt file with ENCODE accessions to download and process the
1924
data to the `results` directory. Then run the data through StrainID.
@@ -47,3 +52,18 @@ python scripts/build_violinscatter.py -i results/SupplementaryTable10.txt -o res
4752
python scripts/build_violinscatter.py -i results/SupplementaryTable10.txt -o results/Figure6C_CAGE.png -a "CAGE"
4853
python scripts/build_violinscatter.py -i results/SupplementaryTable10.txt -o results/Figure6C_small-RNA-seq.png -a "small RNA-seq"
4954
```
55+
56+
# Characterize StrainID scores from 100 ChIP-seq datasets
57+
To better understand the distribution of scores output by StrainID for "correct" and "incorrect" strain backgrounds, we ran StrainID on our largest reference set of over 1,000 cell lines (`hg38_DepMap`) in a sample of TF-ChIP-seq datasets from ENZCODE.
58+
59+
## Download data & Run StrainID
60+
```
61+
qsub job/02_indexed_runSID.pbs
62+
```
63+
64+
## Compile the results with the metadata information
65+
Compile the results into a table and plot the scores as two overlapping histograms.
66+
```
67+
python scripts/merge_sidscores.py -i results/hg38_ID/ -m 230612_hg38_100-TF-ChIP-seq.txt -o results/SupplementaryTable11.txt
68+
python scripts/build_sidscore_histogram.py -i results/hg38_ID/ -o results/SupplementaryFigure2.png
69+
```
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
#!/bin/bash
2+
#PBS -l nodes=1:ppn=6
3+
#PBS -l pmem=24gb
4+
#PBS -l walltime=05:00:00
5+
#PBS -A open
6+
#PBS -o logs/download.data.log.out
7+
#PBS -e logs/download.data.log.err
8+
#PBS -t 1-100
9+
10+
module load samtools
11+
#module load anaconda
12+
#source activate my-genopipe-env
13+
14+
WRK=/path/to/GenoPipe/paper/ENCODE-CellLines
15+
16+
# Execute from /path/to/GenoPipe/paper/ENCODE-CellLines
17+
18+
[ -d logs ] || mkdir logs
19+
[ -d results/hg38_BAM ] || mkdir -p results/hg38_BAM
20+
[ -d results/hg38_BAM-nospike ] || mkdir -p results/hg38_BAM-nospike
21+
[ -d results/hg38_ID ] || mkdir -p results/hg38_ID
22+
23+
METADATA=230612_hg38_100-TF-ChIP-seq.txt
24+
INFO=`sed "2d" $METADATA | sed "${PBS_ARRAYID}q;d"`
25+
ENCFF=`echo $INFO | awk '{FS="/t"}{print $2}'`
26+
#echo $INFO
27+
28+
# Store directory paths
29+
DATABASE=$WRK/../db/hg38_DepMap
30+
GENOME=$WRK/../input/hg38.fa
31+
SEED=$PBS_ARRAYID
32+
GENOPIPE=$WRK/../..
33+
BAM=$WRK/results/hg38_BAM/$ENCFF
34+
NOSPIKE=$WRK/results/hg38_BAM-nospike/$ENCFF
35+
ID=$WRK/results/hg38_ID
36+
37+
# ENCODE data download
38+
HREF=/files/$ENCFF/@@download/$ENCFF.bam
39+
echo "Fetching from https://www.encodeproject.org$HREF"
40+
wget -c https://www.encodeproject.org$HREF -O $BAM.bam
41+
42+
# Checksum of resulting BAM
43+
MD5SUM=`echo $INFO | awk '{FS="/t"}{print $3}'`
44+
#if [[ `md5sum $BAM` =~ $MD5SUM ]]; then
45+
# echo "($PBS_ARRAYID) $BAM passed."
46+
#else
47+
# echo "($PBS_ARRAYID) $BAM md5checksum failed!"
48+
# rm $BAM
49+
# exit
50+
#fi
51+
52+
# Index BAM file
53+
[ -f $BAM.bam.bai ] || samtools index $BAM.bam
54+
55+
# Strip Spike-in
56+
python scripts/filter_spikein.py -b $BAM.bam -g ../input/hg38.fa -o $NOSPIKE.bam
57+
samtools index $NOSPIKE.bam
58+
59+
# Count Bases
60+
[ -f $BAM.bam.bai ] && python scripts/estimate_bp_sequenced.py -b $BAM.bam > $BAM\_bpcount.txt
61+
[ -f $NOSPIKE.bam.bai ] && python scripts/estimate_bp_sequenced.py -b $NOSPIKE.bam > $NOSPIKE\_bpcount.txt
62+
63+
64+
#Check that BAM file was generated first
65+
if [ ! -f $NOSPIKE.bam ];
66+
then
67+
echo "BAM input for $NOSPIKE/$ENCFF does not exist. Exiting."
68+
exit
69+
fi
70+
71+
#Check that BAM Index file exists
72+
if [ ! -f $NOSPIKE.bam.bai ];
73+
then
74+
echo "BAI missing for for $ENCFF. Exiting."
75+
exit
76+
fi
77+
78+
# Set-up Temp directory
79+
TEMP=$WRK/temp-$PBS_ARRAYID
80+
[ -d $TEMP ] || mkdir $TEMP
81+
cd $TEMP
82+
echo $TEMP
83+
ln -s $NOSPIKE.bam
84+
ln -s $NOSPIKE.bam.bai
85+
86+
echo $TEMP
87+
## Execute Single StrainID and record time
88+
cd $GENOPIPE/StrainID
89+
echo "**Begin executing StrainID for ${ENCFF}..."
90+
{ time bash identify-Strain.sh -i $TEMP -g $GENOME -v $DATABASE -s $SEED -o $ID > $ID/$ENCFF.std ; } 2> $ID/$ENCFF.time
91+
echo "...single StrainID for ($PBS_ARRAYID) ${ENCFF} finished."
92+
cd $WRK
93+
94+
## Clean-up
95+
rm -r $TEMP
312 KB
Loading
912 KB
Binary file not shown.
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
#!/bin/python
2+
from os import listdir
3+
from os.path import isfile, join
4+
import sys
5+
import re
6+
import random
7+
import argparse
8+
import matplotlib.pyplot as plt
9+
import numpy as np
10+
import seaborn as sns
11+
import pandas as pd
12+
13+
# Python 3 needed for encoding feature for UTF-8
14+
# (ENCODE uses some capital delta chars in summary descriptions of GeneticModifications)
15+
16+
# Check Seaborn documentation: https://seaborn.pydata.org/generated/seaborn.swarmplot.html
17+
18+
ENCODEtoStrainID = {
19+
"HeLa-S3":"HELA",
20+
"LNCAP":"LNCAPCLONEFGC",
21+
"MCF-7":"MCF7",
22+
"SK-N-SH":"SKNSH"
23+
}
24+
25+
# K562 2640
26+
# A549 1189
27+
# MCF-7 556
28+
# SK-N-SH 210
29+
# HeLa-S3 196
30+
# HCT116 96
31+
32+
33+
def getParams():
34+
'''Parse parameters from the command line'''
35+
parser = argparse.ArgumentParser(description='Build histogram characterization plot from hg38 ENCODE StrainID results.')
36+
parser.add_argument('-m','--metadata', metavar='metadata_fn', required=True, help='the metadata file downloaded with ENCODE dataset')
37+
parser.add_argument('-i','--input-dir', metavar='input_dir', required=True, help='the directory where all the StrainID output files were saved (*strain.tab)')
38+
39+
parser.add_argument('-o','--output', metavar='png_fn', required=True, help='the output figure image')
40+
41+
parser.add_argument('-a','--assay', metavar='assay_name', default=None, help='the ENCODE assay name to filter datasets by (default:No Filter)')
42+
43+
args = parser.parse_args()
44+
return(args)
45+
46+
# ENCFF000DZC.bam
47+
#LnCap.vcf -5.082117812158647
48+
#MCF7.vcf -6.1143012059424935
49+
#SKnSH.vcf -5.7641601741217645
50+
#HepG2.vcf -5.595833186705702
51+
#K562.vcf 1.8812984639660986
52+
#A549.vcf -6.059318584695944
53+
#HCT116.vcf -4.847450343904915
54+
#HELA.vcf -4.906670711358038
55+
def parse_file(var_file, expected):
56+
# Parse file
57+
scores = pd.read_table(var_file, sep='\t', header=0, names=['Strain','Scores'])
58+
# Add filename info
59+
scores['Filename'] = var_file
60+
# Add match information
61+
scores['Match'] = scores['Strain']==expected
62+
# Return scores
63+
return(scores)
64+
65+
if __name__ == "__main__":
66+
'''Plot swarm'''
67+
args = getParams()
68+
69+
# Hardcode some presets
70+
SIZE=5
71+
strains2filter = ['P2URK562.vcf']
72+
strains2filter.extend(['HCT15.vcf', 'HCT8.vcf'])
73+
strains2filter.extend(['HEL9217.vcf', 'HEL.vcf'])
74+
strains2filter.extend(['HEP3B217.vcf'])
75+
strains2filter.extend(['MCF10A.vcf', 'MCF12A.vcf'])
76+
strains2filter.extend(['LN18.vcf', 'LN215.vcf', 'LN229.vcf', 'LN235.vcf', 'LN319.vcf', 'LN340.vcf', 'LN382.vcf', 'LN405.vcf', 'LN428.vcf', 'LN443.vcf', 'LN464.vcf', 'LNZTA3WT4.vcf', 'LNZ308.vcf'])
77+
78+
strains2filter.extend(['SKN3.vcf', 'SKNAS.vcf', 'SKNBE2.vcf', 'SKNDZ.vcf', 'SKNEP1.vcf', 'SKNFI.vcf', 'SKNMC.vcf', 'SKNMM.vcf', 'SKNO1.vcf','SKN.vcf'])
79+
strains2filter.extend(['SKBR3.vcf', 'SKBR5.vcf', 'SKBR7.vcf', 'SKCO1.vcf', 'SKES1.vcf', 'SKGIIIA.vcf', 'SKGII.vcf', 'SKGI.vcf', 'SKGT2.vcf', 'SKGT4.vcf',
80+
'SKHEP1.vcf', 'SKLMS1.vcf', 'SKLU1.vcf', 'SKM1.vcf', 'SKMEL19.vcf', 'SKMEL1.vcf', 'SKMEL24.vcf', 'SKMEL28.vcf', 'SKMEL2.vcf', 'SKMEL30.vcf',
81+
'SKMEL31.vcf', 'SKMEL3.vcf', 'SKMEL5.vcf', 'SKMES1.vcf', 'SKMG1.vcf', 'SKMM2.vcf', 'SKOV3.vcf', 'SKPNDW.vcf', 'SKRC20.vcf', 'SKRC31.vcf', 'SKUT1.vcf'])
82+
83+
84+
# Parse metadata
85+
filedata = pd.read_csv(args.metadata, sep='\t', header=1)
86+
filedata['BIOSAMPLE_NAME'] = None
87+
df_list_scores = []
88+
89+
# Loop through each sample
90+
for index, row in filedata.iterrows():
91+
# Map ENCODE-formatted strain to StrainID-formatted
92+
filedata['BIOSAMPLE_NAME'][index] = ENCODEtoStrainID.get(filedata['Biosample name'][index], filedata['Biosample name'][index])
93+
expected_vcfname = filedata['BIOSAMPLE_NAME'][index] + ".vcf"
94+
95+
# Check file exists
96+
id_file = join(args.input_dir,"%s_strain.tab" % filedata['Accession'][index])
97+
if(not isfile(id_file)):
98+
continue
99+
100+
# Parse ID file and add scores to final dataframe
101+
scores = parse_file(id_file, expected_vcfname)
102+
df_list_scores.append(scores)
103+
104+
# for FCL in strains2filter:
105+
# print(scores[scores['Strain']==FCL])
106+
107+
108+
# Concatenate the strains together
109+
all_scores = pd.concat(df_list_scores)
110+
111+
# Apply a hardcoded filter for parental strains
112+
113+
# all_scores = all_scores.loc[all_scores['Strain'] in strains2filter]
114+
for FCL in strains2filter:
115+
print(FCL)
116+
# print(all_scores[all_scores['Strain']==FCL])
117+
all_scores = all_scores[all_scores['Strain']!=FCL]
118+
119+
120+
# Get counts for samples with all NaNs and format for table
121+
data_nans = pd.DataFrame(all_scores[pd.isnull(all_scores['Scores'])])
122+
data_value = pd.DataFrame(all_scores[~pd.isnull(all_scores['Scores'])])
123+
124+
# print(data_nans['Strain'].value_counts())
125+
126+
# Plot violin, swarms, and table
127+
fig, ax = plt.subplots()
128+
sns.histplot(ax=ax, x="Scores", binwidth=.1, data=data_value[~data_value['Match']], color='cyan')
129+
ax2 = ax.twinx()
130+
sns.histplot(ax=ax2, x="Scores", binwidth=.1, data=data_value[data_value['Match']], color='orange')
131+
plt.tight_layout()
132+
# palette = {
133+
# "A549":"tab:blue",
134+
# "HCT116":"tab:orange",
135+
# "HELA":"tab:green",
136+
# "HepG2":"tab:red",
137+
# "K562":"tab:purple",
138+
# "LnCap":"tab:olive",
139+
# "MCF7":"tab:cyan",
140+
# "SKnSH":"tab:pink"
141+
# }
142+
143+
# Format figure
144+
ax.set_xlabel("StrainID -log2 score")
145+
ax.set_ylabel("number of scores for every sample x other cell lines (blue)")
146+
ax2.set_ylabel("number of scores for sample x matching cell line (orange)")
147+
# Save figure
148+
fig.set_size_inches(12,8)
149+
#plt.show()
150+
plt.savefig(args.output, dpi=500)
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
import sys, os
2+
import pandas as pd
3+
import argparse
4+
5+
def getParams():
6+
'''Parse parameters from the command line'''
7+
parser = argparse.ArgumentParser(description='Merge all tab files from StrainID results into a single table.')
8+
parser.add_argument('-m','--metadata', metavar='metadata_fn', required=True, help='the metadata file downloaded with ENCODE dataset')
9+
parser.add_argument('-i','--input-dir', metavar='input_dir', required=True, help='the directory where all the StrainID output files were saved (*strain.tab)')
10+
parser.add_argument('-o','--output', metavar='txt_fn', required=True, help='the output tab-delimited table path')
11+
12+
args = parser.parse_args()
13+
return(args)
14+
15+
if __name__ == "__main__":
16+
'''Merge results'''
17+
18+
args = getParams()
19+
20+
merged_df = None
21+
# Loop through each StrainID results file
22+
reader = open(args.metadata, 'r')
23+
for line in reader:
24+
if(line.find('/files/')!=0):
25+
continue
26+
tokens = line.split('\t')
27+
# Load results and name VCF filename column "CellLine"
28+
temp_df = pd.read_csv(args.input_dir + "/" + tokens[1] + "_strain.tab", sep='\t', header=0)
29+
temp_df.columns.values[0] = 'CellLine'
30+
temp_df.loc[len(temp_df.index)] = ['0Correct_Strain', tokens[12]]
31+
# Merge into merged_df
32+
if (merged_df is None):
33+
merged_df = temp_df
34+
continue
35+
merged_df = pd.merge(merged_df, temp_df, on='CellLine')
36+
reader.close()
37+
# sort values before saving
38+
merged_df = merged_df.sort_values(by=['CellLine'])
39+
# write merged_df to output file
40+
merged_df.to_csv(args.output, sep='\t')

0 commit comments

Comments
 (0)