GitHub - jcaperella29/gcp-fastq-event-pipeline: Cloud-native, event-driven FASTQ QC pipeline on GCP with serverless orchestration and containerized batch compute.

demultiplexes reads by barcode
aligns reads with BWA
sorts and indexes BAMs with samtools
counts features with featureCounts
generates output artifacts

Outputs are written to the results bucket.
Run status and output metadata are updated in Cloud SQL.

Current Inputs

Ingest bucket

The pipeline scans a GCS folder for FASTQ files, for example:

gs://<ingest-bucket>/data/

Marker file

A batch is triggered by uploading:

gs://<ingest-bucket>/data/READY.txt

Reference assets

The analysis job expects reference files such as:

reference.fasta
annotation.gtf

These are typically stored in a separate reference bucket.

Current Outputs

For each pipeline run, the job writes:

QC summary JSON
Contains run-level statistics such as reads processed, assigned reads, GC percentage, and processed input files.
Count matrix CSV
Gene-level counts by sample.
BED file
Read-level genomic interval output derived from alignments.

Example output layout:

gs://<results-bucket>/qc-results/run_<RUN_ID>.json
gs://<results-bucket>/count-matrices/count_matrix_run_<RUN_ID>.csv
gs://<results-bucket>/beds/reads_run_<RUN_ID>.bed


Barcode Demultiplexing

The current implementation uses fixed barcodes to split reads into samples before alignment.

Example mapping:

ACGTACGT → sample1
TGCATGCA → sample2
GATTACAG → sample3

These values are currently hard-coded in the analysis job and can be extended or externalized in future versions.

Tech Stack
Python 3.11
Google Cloud Storage
Google Cloud Functions Gen 2
Google Cloud Run Jobs
Google Cloud SQL
Docker
BWA
samtools
featureCounts
Repository Structure



Deployment Summary
Cloud Run Job

The Cloud Run Job runs the analysis container and expects environment variables such as:

DATA_DIR
REFERENCE_FASTA
ANNOTATION_GTF
RESULTS_BUCKET
RUN_ID
Trigger Function

The Cloud Function:

listens to the ingest bucket
ignores normal FASTQ uploads
launches the Cloud Run Job only when data/READY.txt is uploaded
Example Usage
Upload FASTQ files:
gsutil cp data/sample1.fastq gs://<ingest-bucket>/data/
gsutil cp data/sample2.fastq gs://<ingest-bucket>/data/
gsutil cp data/sample3.fastq gs://<ingest-bucket>/data/

Upload marker file:
echo ready > READY.txt
gsutil cp READY.txt gs://<ingest-bucket>/data/READY.txt

What This Project Demonstrates

This project is intended to show:

cloud-native pipeline orchestration
event-driven batch processing
managed container execution
run tracking with a relational database
integration of standard bioinformatics tools into a reproducible workflow
generation of structured output artifacts for downstream analysis
Current Status

This pipeline is a strong prototype and architecture demonstration. It uses standard alignment and counting tools, but current testing has focused primarily on synthetic datasets and workflow validation. Additional validation on public biological datasets, stricter filtering, and expanded QC are natural next steps.

Future Improvements

Planned or possible next steps include:

validation on real public datasets
stricter alignment filtering and QC thresholds
support for paired-end reads
external barcode/sample manifests
richer QC reporting
downstream visualization dashboards
workflow packaging with WDL or Nextflow
support for larger references and more realistic transcriptomic workflows
Why This Project Matters

Many pipelines work locally but fail when moved into a cloud production context. This project focuses on solving the engineering side of that transition: automated triggering, containerized execution, cloud-based tracking, and reproducible artifact generation.

It is designed as a practical example of how bioinformatics workflows can be deployed as managed, event-driven systems in the cloud.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
GCP_pipeline_project		GCP_pipeline_project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Current Inputs

Ingest bucket

Marker file

Reference assets

Current Outputs

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Current Inputs

Ingest bucket

Marker file

Reference assets

Current Outputs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages