A collection of tools for ingesting biological data into the Human Cell Atlas (HCA) infrastructure, including smart synchronization, manifest generation, and submission workflows.
This repository contains tools that help researchers submit biological data to HCA atlas projects:
- Smart-Sync: Intelligent S3 synchronization with manifest generation and integrity verification
- Upload Helpers: Batch upload utilities and progress tracking
- Submission Tools: Workflow automation for data submission pipelines
hca-ingest-tools/
βββ smart-sync/
β βββ docs/ # User documentation
β βββ src/
β β βββ hca_smart_sync/ # Main sync tool
β β βββ config/ # Configuration management
β β βββ cli.py # Command-line interface
β β βββ sync_engine.py # Core sync logic
β β βββ manifest.py # Manifest generation
β β βββ checksum.py # Integrity verification
β βββ tests/ # Test suite
β βββ pyproject.toml # Poetry configuration
βββ README.md # This file
This project uses Poetry for dependency management. To install:
# Clone the repository
git clone https://github.com/your-org/hca-ingest-tools.git
cd hca-ingest-tools
# Install with Poetry
poetry install
# Or install globally with pipx (recommended for end users)
pipx install hca-ingest-tools# Sync local data to S3 with manifest generation
hca-smart-sync /path/to/local/data s3://bucket/bionetwork/atlas/source-datasets/
# Preview changes without uploading
hca-smart-sync --dry-run /path/to/local/data s3://bucket/bionetwork/atlas/source-datasets/
# Upload with custom manifest metadata
hca-smart-sync --metadata '{"study": "gut-atlas-v1", "submitter": "researcher@university.edu"}' \
/path/to/local/data s3://bucket/gut/gut-v1/source-datasets/- Intelligent diffing - Only uploads changed files
- Integrity verification - SHA256 checksums for all files
- Manifest generation - Automatic submission metadata
- Progress tracking - Real-time upload progress
- Resume capability - Interrupted uploads can be resumed
- AWS profile integration - Uses your existing AWS credentials
- IAM policy compliance - Respects bucket access restrictions
- Data integrity - End-to-end checksum verification
- Audit trails - Complete upload history and manifests
- H5AD file support - Optimized for biological data formats
- Atlas structure - Understands bionetwork/atlas/folder hierarchy
- Metadata enrichment - Biological context in manifests
- Validation integration - Works with hca-validation-tools
# Configure AWS profile (if not already done)
aws configure --profile your-profile-name
# Set default profile for HCA tools
export HCA_AWS_PROFILE=your-profile-name# Initialize configuration
hca-smart-sync --init
# Edit configuration file
vim ~/.config/hca-ingest-tools/config.yamlThis tool integrates with the broader HCA infrastructure:
hca-validation-tools β Validates data quality
β
hca-ingest-tools β Uploads validated data (this repo)
β
hca-atlas-tracker β Tracks submitted data
- Upload with
hca-smart-sync(this tool) - Track submission status in
hca-atlas-tracker - Validate data with
hca-validation-tools - Monitor validation results and processing
# Clone and setup
git clone https://github.com/clevercanary/hca-ingest-tools.git
cd hca-ingest-toolsSee smart-sync readme for additional information.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Run the test suite (
poetry run pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: docs/
- Issues: GitHub Issues
- hca-validation-tools - Data validation and quality tools
- hca-atlas-tracker - Submission tracking and management