pubtrends-datasets

Datasets integration for PubTrends

Getting started

Prerequisites

Python 3
uv

General Setup

To set up the project, run the setup.sh script:

scripts/setup.sh

This script will install the prerequisite packages using the uv package manager and configure the project.

After the script finishes, copy the config.properties file to ~/.pubtrends-datasets/config.properties. Feel free to edit this file if you need to override the default configurations.

Sentence-Transformers Service

The app uses this service to generate text embeddings for the Relevant Datasets feature.

Prerequisites for GPU Acceleration:

To use the --gpu flag, you must have a CUDA-capable GPU and the NVIDIA Container Toolkit installed.

Deployment steps:

Build the sentence-transformers container: scripts/build_sentence_transformers_container.sh
Run the container: scripts/run_sentence_transformers_container.sh

Once started, the pubtrends-embeddings container will be available on port 5001.

Configuration

The Relevant Datasets feature supports the following properties in config.properties:

embeddings_service_url - Base URL of the sentence-transformers embeddings service
max_tokens_per_chunk - Maximum number of tokens per semantic-search text chunk
overlap_sentences - Number of overlapping sentences between consecutive chunks
chunking_workers - Number of worker processes used for text chunking

Running the Application

You can start the app using this command:

uv run -- flask --app src.app.app run --port 5002

The app will be available at http://localhost:5002.

API Documentation

The API documentation is available at http://localhost:5002/apidocs.

GEO Dataset Downloading and Processing

Use the geometadb backfilling tool to synchronize the database with currently available GEO datasets:

# Backfill from March 6, 2024 (geometadb cutoff date), to the current date
uv run python -m src.db.utils.backfill_geometadb 2024-03-06 --ignore-failures

Positional arguments:

start_date - Start of the date range for which to download datasets
end_date - End of the date range for which to download datasets (default: today)

Flags:

--ignore-failures - Continue processing even if dataset updates fail.
--skip-existing - Skip datasets already present in the local database (default behavior is to process them)
--dont-redownload - Prevents dataset archive files that were downloaded from being redownloaded. However, they will still be processed.

To keep the database up to date, we suggest adding the following cron job via crontab -e:

0 23 * * * cd <path to this repository> && /home/<username>/.local/bin/uv run python -m src.db.utils.backfill_geometadb --ignore-failures $(date -d "now-2 days" "+\%Y-\%m-\%d")

Note

It seems that GEO datasets published within the last 24 hours are not indexed by ESearch. As a result, these datasets cannot be downloaded using the backfilling tool.

Configuration

Tweak these properties in config.properties to optimize performance on your hardware:

max_ncbi_connections - Maximum concurrent connections to NCBI's GEO download host
big_gzip_threshold_mb - Threshold for determining whether a dataset is large (larger than this size in MB)
big_dataset_parser_workers - Number of parallel worker processes for parsing large datasets.
small_dataset_parser_workers - Number of parallel worker processes for parsing small datasets.
archive_parser_chunk_size - The number of small datasets to process at a time in a single worker process.

Warning

RAM Management: High big_dataset_parser_workers counts can lead to RAM exhaustion when parsing large files. It is recommended to start with one or two workers and monitor usage before scaling up.

To customize the backfilling process, change these properties:

dataset_download_folder - Path for storing downloaded datasets
show_backfill_progress - Boolean to toggle the CLI progress bar.

Testing

Build the docker image for testing:

docker build -f resources/docker/test/Dockerfile -t biolabs/pubtrends-datasets-test --platform linux/amd64 .

Run the tests:

docker run --rm --platform linux/amd64 \
--name pubtrends-datasets-test \
--volume=$(pwd)/src:/pubtrends-datasets/src \
--volume=$(pwd)/pyproject.toml:/pubtrends-datasets/pyproject.toml \
--volume=$(pwd)/uv.lock:/pubtrends-datasets/uv.lock \
--volume=$(pwd)/resources/docker/test/test.config.properties:/home/user/.pubtrends-datasets/config.properties \
-i -t biolabs/pubtrends-datasets-test \
/bin/bash -c "cd /pubtrends-datasets; uv sync --locked; uv run python -m unittest discover src/test"

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
resources/docker		resources/docker
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
config.properties		config.properties
pyproject.toml		pyproject.toml
qodana.yaml		qodana.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pubtrends-datasets

Getting started

Prerequisites

General Setup

Sentence-Transformers Service

Prerequisites for GPU Acceleration:

Deployment steps:

Configuration

Running the Application

API Documentation

GEO Dataset Downloading and Processing

Configuration

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pubtrends-datasets

Getting started

Prerequisites

General Setup

Sentence-Transformers Service

Prerequisites for GPU Acceleration:

Deployment steps:

Configuration

Running the Application

API Documentation

GEO Dataset Downloading and Processing

Configuration

Testing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages