gFetch is a simple and lightweight Python tool built around the NCBI Datasets CLI. It simplifies downloading and summarizing genomic, gene, and virus data from NCBI, and automatically switches to dehydrated download mode for large assemblies, with a threshold set on 10GB.
Note
If you have any suggestions for new features or a bug encountered, create an Issue or send me a message at: mikeph526@outlook.com. I'm happy to help.
gFetch requires python v.3.8+ and the following packages.
Install packages:
pip install requests richInstall datasets via conda:
conda install -c conda-forge ncbi-datasets-cliOr follow the official instructions at NCBI datatsets instructions.
Clone the repository and install depedencies:
git clone https://github.com/mikeph52/gfetch.gitTo run gfetch.py:
python gfetch.pyUsage:
gfetch [mode] [type]-genome/-gene/-virus <taxon>Modes: download/summary
Types: -genome/-gene/-virus
Currently, gfetch only supports the ncbi taxon numbers, the accesion number feature will be added in the next updates.
For genomes, when the download size surpasses the 10GB limit, the .zip file will be downloaded as a dehydrated file. The size limit was set after a lot experimentation with the ncbi datasets cli.
There's also an option to download only the reference genomes from selected taxons.
In the summary mode, gfetch uses a Terminal User Interface (TUI) to display features extracted from the json file datasets generates. An example of the organism Zootermopsis nevadensis is featured bellow:
Genomic Summary
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Field ┃ Value ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Organism │ Zootermopsis nevadensis │
│ Taxon ID │ 136037 │
│ Accession │ GCF_000696155.1 │
│ Assembly │ ZooNev1.0 │
│ Level │ Scaffold │
│ Release Date │ 2014-07-22 │
└──────────────┴─────────────────────────┘For genomes, an option to display only the summaries of the reference genomes from selected taxons is available.
gFetch is built on the NCBI Datasets tool:
O'Leary NA, Cox E, Holmes JB, Anderson WR, Falk R, Hem V, Tsuchiya MTN, Schuler GD, Zhang X, Torcivia J, et al. Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets. Sci Data. 2024 Jul 5;11(1):732. doi: 10.1038/s41597-024-03571-y.
NCBI Datasets is a product of the National Center for Biotechnology Information (NCBI), National Library of Medicine, NIH.
gFetch is released under the MIT License. You are free to use, modify, and distribute this software without restriction. See LICENSE for details.
The underlying NCBI Datasets CLI and data are subject to NCBI's own usage policies:
- Fix Issue #2.
- Fix summaries for genes and viruses.
- Added a Terminal User Interface (TUI) in summary function using
rich. - Added option for reference genomes in download mode.
- Help funtion fixed.
- First pre-release version.
- Fixed cli logic.
- Added summary for viruses.