Perform joint calling and VQSR filtering using GATK. Before calculating allele frequencies, ensure the data has undergone quality control (QC) and preprocessing.
# Quality Control
bcftools view -f PASS -m 2 -M 2 -v snps NGS2144.jointcalling.VQSR.variants.vcf.gz |bgzip -@ 16 -c > NGS2144.jointcalling.VQSR.variants.PASS.biallelic.SNPs.vcf.gz
tabix -p vcf NGS2144.jointcalling.VQSR.variants.PASS.biallelic.SNPs.vcf.gz1.Prepare Working Directory
# Create project directory and navigate into it
mkdir PGG_allele_frequency_pipeline
cd PGG_allele_frequency_pipeline2.Prepare Input Files Place the following files in their respective directories: Input VCF file (gzip-compressed format) → Save to /PGG_allele_frequency_pipeline/input/
Most downstream population genetic analyses use biallelic SNPs after QC. If your data is already filtered, skip this step.
This pipeline guides you through:
- Extracting genomic regions from VCF files
- Computing allele frequencies for different populations
- Using only command-line tools (
bcftools,vcftools,plink)
Here's how to install bcftools, vcftools, and PLINK for bioinformatics analysis:
Recommended method (via conda):
conda install -c bioconda bcftoolsAlternative methods:
# Ubuntu/Debian
sudo apt-get install bcftools
# CentOS/RHEL
sudo yum install bcftools
# From source
wget https://github.com/samtools/bcftools/releases/download/1.17/bcftools-1.17.tar.bz2
tar -xjf bcftools-1.17.tar.bz2
cd bcftools-1.17
make
sudo make installRecommended method (via conda):
conda install -c bioconda vcftoolsAlternative methods:
# Ubuntu/Debian
sudo apt-get install vcftools
# From source
wget https://github.com/vcftools/vcftools/releases/download/v0.1.16/vcftools-0.1.16.tar.gz
tar -xzf vcftools-0.1.16.tar.gz
cd vcftools-0.1.16
./configure
make
sudo make installRecommended method (via conda):
conda install -c bioconda plinkAlternative methods:
# Ubuntu/Debian
sudo apt-get install plink
# Mac (Homebrew)
brew install plink
# Direct download (Linux/Mac)
wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231017.zip # Linux
unzip plink_linux_x86_64_20231017.zip
chmod +x plink
sudo mv plink /usr/local/bin/
# PLINK 2 (newer version)
wget https://s3.amazonaws.com/plink2-assets/plink2_linux_avx2_20231017.zip
unzip plink2_linux_avx2_20231017.zipbcftools --version
vcftools --version
plink --version
conda create -n genomics bcftools vcftools plink -c bioconda
conda activate genomics-
Prepare input files
- Place the gzipped VCF file in
input/
- Place the gzipped VCF file in
-
Compute allele frequencies
# Calculating Allele Frequencies Using vcftools
vcftools --vcf your_file.vcf --freq --out output_frequency- Performance optimization
vcftoolsmay consume high memory for large files.
# Compress VCF file first
bgzip your_file.vcf
tabix -p vcf your_file.vcf.gz
# Then run using compressed file
vcftools --gzvcf your_file.vcf.gz --freq --out output_frequency
# For extremely large files:
# Process by Chromosome
for chr in {1..22} X Y; do
vcftools --vcf your_file.vcf --chr $chr --freq --out output_chr${chr}
doneAfter generating .frq files using vcftools/bcftools:
- Reformat the frequency data
- Calculate genotype frequencies
- Add metadata (sample size, dataset name)
- Convert to the required database format
🔹 For large joint-called files:
Use Linux awk for efficient processing.
# AF-based Frequency Processing Pipeline Usage:
bash af_processing_pipeline.sh [input_VCF] [output_directory]
# Simply run this shell script directly, and edit the input/output paths inline using vim as required.If the dataset contains multiple populations/provinces (e.g., Han, Balti, Tibetan), compute frequencies separately.
Han HG00123
Han HG00124
Han HG00125
Han HG00126
...
Shanghai HG00123
Shanghai HG00124
Beijing HG00125
Xizang HG00126
...
WGC022072D Xizang Sherpa
WGC022073D Xizang Sherpa
WGC022074D Xizang Sherpa
...
# Simplified Population-specific Frequency Calculation Usage:
bash pop_freq_pipeline.sh
# Simply run this shell script directly, and edit the input/output paths inline using vim as required.The genotype in the af_freq_pipeline is a placeholder and requires additional calculation of genotype frequency. Use PLINK/VCFtools to compute genotype frequency.
# Simplified Genotype Frequency Calculation Usage:
bash genotype_freq_pipeline.sh
# Simply run this shell script directly, and edit the input/output paths inline using vim as required.
If scripts fail after downloading via GitHub on Windows:
- Convert line endings to Unix format:
vim pipeline.sh #Switch the shell panel :set ff=unix :wq
less result.tsv Example Output:
| chr | rs_id | pos | ref | alt | ref_freq | alt_freq | dataset | sample_size | ... |
|---|---|---|---|---|---|---|---|---|---|
| chr1 | rs12345 | 100 | A | G | 0.75 | 0.25 | MyDataset | 100 | ... |
.fam/.bam/.bed/.bim.frq