Skip to content

clevercanary/hca-ingest-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HCA Ingest Tools

A collection of tools for ingesting biological data into the Human Cell Atlas (HCA) infrastructure, including smart synchronization, manifest generation, and submission workflows.

Overview

This repository contains tools that help researchers submit biological data to HCA atlas projects:

  • Smart-Sync: Intelligent S3 synchronization with manifest generation and integrity verification
  • Upload Helpers: Batch upload utilities and progress tracking
  • Submission Tools: Workflow automation for data submission pipelines

Project Structure

hca-ingest-tools/
β”œβ”€β”€ smart-sync/
β”‚   β”œβ”€β”€ docs/                    # User documentation
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   └── hca_smart_sync/      # Main sync tool
β”‚   β”‚       β”œβ”€β”€ config/          # Configuration management
β”‚   β”‚       β”œβ”€β”€ cli.py           # Command-line interface
β”‚   β”‚       β”œβ”€β”€ sync_engine.py   # Core sync logic
β”‚   β”‚       β”œβ”€β”€ manifest.py      # Manifest generation
β”‚   β”‚       └── checksum.py      # Integrity verification
β”‚   β”œβ”€β”€ tests/                   # Test suite
β”‚   └── pyproject.toml           # Poetry configuration
└── README.md                    # This file

Installation

This project uses Poetry for dependency management. To install:

# Clone the repository
git clone https://github.com/your-org/hca-ingest-tools.git
cd hca-ingest-tools

# Install with Poetry
poetry install

# Or install globally with pipx (recommended for end users)
pipx install hca-ingest-tools

Quick Start

Smart-Sync Usage

# Sync local data to S3 with manifest generation
hca-smart-sync /path/to/local/data s3://bucket/bionetwork/atlas/source-datasets/

# Preview changes without uploading
hca-smart-sync --dry-run /path/to/local/data s3://bucket/bionetwork/atlas/source-datasets/

# Upload with custom manifest metadata
hca-smart-sync --metadata '{"study": "gut-atlas-v1", "submitter": "researcher@university.edu"}' \
  /path/to/local/data s3://bucket/gut/gut-v1/source-datasets/

Features

πŸš€ Smart Synchronization

  • Intelligent diffing - Only uploads changed files
  • Integrity verification - SHA256 checksums for all files
  • Manifest generation - Automatic submission metadata
  • Progress tracking - Real-time upload progress
  • Resume capability - Interrupted uploads can be resumed

πŸ”’ Security & Compliance

  • AWS profile integration - Uses your existing AWS credentials
  • IAM policy compliance - Respects bucket access restrictions
  • Data integrity - End-to-end checksum verification
  • Audit trails - Complete upload history and manifests

🧬 Biological Data Focus

  • H5AD file support - Optimized for biological data formats
  • Atlas structure - Understands bionetwork/atlas/folder hierarchy
  • Metadata enrichment - Biological context in manifests
  • Validation integration - Works with hca-validation-tools

Configuration

AWS Setup

# Configure AWS profile (if not already done)
aws configure --profile your-profile-name

# Set default profile for HCA tools
export HCA_AWS_PROFILE=your-profile-name

Tool Configuration

# Initialize configuration
hca-smart-sync --init

# Edit configuration file
vim ~/.config/hca-ingest-tools/config.yaml

Integration with HCA Ecosystem

This tool integrates with the broader HCA infrastructure:

hca-validation-tools  β†’  Validates data quality
         ↓
hca-ingest-tools     β†’  Uploads validated data (this repo)
         ↓
hca-atlas-tracker    β†’  Tracks submitted data

Workflow Integration

  1. Upload with hca-smart-sync (this tool)
  2. Track submission status in hca-atlas-tracker
  3. Validate data with hca-validation-tools
  4. Monitor validation results and processing

Development

Setup Development Environment

# Clone and setup
git clone https://github.com/clevercanary/hca-ingest-tools.git
cd hca-ingest-tools

See smart-sync readme for additional information.

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite (poetry run pytest)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Related Projects

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors