Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 0 additions & 42 deletions .github/workflows/poetry-package-test.yml

This file was deleted.

28 changes: 28 additions & 0 deletions .github/workflows/uv-package-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Python CI with uv and pytest
on:
workflow_dispatch:
push:
branches: ["main", "develop"]
pull_request:
branches: ["main", "develop"]
jobs:
build:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13", "3.14"]
steps:
- uses: actions/checkout@v5

- name: Setup uv
uses: astral-sh/setup-uv@v7
with:
version: "0.9.15"
python-version: ${{ matrix.python-version }}

- name: Install the project
run: uv sync --all-extras --dev

- name: Run tests with pytest
run: uv run pytest tests
32 changes: 32 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
ci:
autoupdate_commit_msg: "chore: update pre-commit hooks"
autofix_commit_msg: "style: pre-commit fixes"

exclude: "^(tests/testdata/)"

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: "v6.0.0"
hooks:
- id: check-added-large-files
- id: check-case-conflict
- id: check-merge-conflict
- id: check-symlinks
- id: check-yaml
- id: debug-statements
- id: end-of-file-fixer
- id: mixed-line-ending
- id: name-tests-test
args: ["--pytest-test-first"]
- id: requirements-txt-fixer
- id: trailing-whitespace

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: "v0.14.13"
hooks:
# first, lint + autofix
- id: ruff
types_or: [python, pyi, jupyter]
args: ["--fix", "--show-fixes"]
# then, format
- id: ruff-format
2 changes: 1 addition & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter"
}
}
}
16 changes: 8 additions & 8 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,21 +45,21 @@ In short:

## 🔀 Development Workflow

Development is managed through the git repository at https://github.com/DataONEorg/hashstore. The repository is organized into several branches, each with a specific purpose.
Development is managed through the git repository at https://github.com/DataONEorg/hashstore. The repository is organized into several branches, each with a specific purpose.

**main**. The `main` branch represents the stable branch that is constantly maintained with the current release. It should generally be safe to install and use the `main` branch the same way as binary releases. The version number in all configuration files and the README on the `main` branch follows [semantic versioning](https://semver.org/) and should always be set to the current stable release, for example `2.8.5`.

**develop**. Development takes place on a single branch for integrated development and testing of the set of features
targeting the next release. Commits should only be pushed to this branch once they are ready to be deployed to
production immediately after being pushed. This keeps the `develop` branch in a state of readiness for the next release.
Any unreleased code changes on the `develop` branch represent changes that have been tested and staged for the next
release.
Any unreleased code changes on the `develop` branch represent changes that have been tested and staged for the next
release.
The tip of the `develop` branch always represents the set of features that are awaiting the next release. The develop
branch represents the opportunity to integrate changes from multiple features for integrated testing before release.

Version numbers on the `develop` branch represent either the planned next release number (e.g., `2.9.0`), or the planned next release number with a `beta` designator or release candidate `rc` designator appended as appropriate. For example, `2.8.6-beta1` or `2.9.0-rc1`.

**feature**. To isolate development on a specific set of capabilities, especially if it may be disruptive to other
**feature**. To isolate development on a specific set of capabilities, especially if it may be disruptive to other
developers working on the `develop` branch, feature branches should be created.

Feature branches are named as `feature-` + `{issue}` + `-{short-description}`, with `{issue}` being the GitHub issue number related to that new feature. e.g. `feature-23-refactor-storage`.
Expand All @@ -73,11 +73,11 @@ been tested and are awaiting release. Thus, each `feature-*` branch can be test
### Development flow overview

```mermaid
%%{init: { 'theme': 'base',
%%{init: { 'theme': 'base',
'gitGraph': {
'rotateCommitLabel': false,
'showCommitLabel': false
},
},
'themeVariables': {
'commitLabelColor': '#ffffffff',
'commitLabelBackground': '#000000'
Expand Down Expand Up @@ -110,8 +110,8 @@ gitGraph
changes that are desired in a release are merged into the `develop` branch, we run
the full set of tests on a clean checkout of the `develop` branch.
2. After testing, the `develop` branch is merged to main, and the `main` branch is tagged with
the new version number (e.g. `2.11.2`). At this point, the tip of the `main` branch will
reflect the new release and the `develop` branch can be fast-forwarded to sync with `main` to
the new version number (e.g. `2.11.2`). At this point, the tip of the `main` branch will
reflect the new release and the `develop` branch can be fast-forwarded to sync with `main` to
start work on the next release.
3. Releases can be downloaded from the [GitHub releases page](https://github.com/DataONEorg/hashstore/releases).

Expand Down
78 changes: 43 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,18 @@ Version: 1.1.0

Cite this software as:

> Dou Mok, Matthew Brooke, Jing Tao, Jeanette Clarke, Ian Nesbitt, Matthew B. Jones. 2024.
> Dou Mok, Matthew Brooke, Jing Tao, Jeanette Clarke, Ian Nesbitt, Matthew B. Jones. 2024.
> HashStore: hash-based object storage for DataONE data packages. Arctic Data Center.
> [doi:10.18739/A2ZG6G87Q](https://doi.org/10.18739/A2ZG6G87Q)

## Introduction

HashStore is a server-side python package that implements a hash-based object storage file system
for storing and accessing data and metadata for DataONE services. The package is used in DataONE
system components that need direct, filesystem-based access to data objects, their system
metadata, and extended metadata about the objects. This package is a core component of the
[DataONE federation](https://dataone.org), and supports large-scale object storage for a variety
of repositories, including the [KNB Data Repository](http://knb.ecoinformatics.org), the [NSF
HashStore is a server-side python package that implements a hash-based object storage file system
for storing and accessing data and metadata for DataONE services. The package is used in DataONE
system components that need direct, filesystem-based access to data objects, their system
metadata, and extended metadata about the objects. This package is a core component of the
[DataONE federation](https://dataone.org), and supports large-scale object storage for a variety
of repositories, including the [KNB Data Repository](http://knb.ecoinformatics.org), the [NSF
Arctic Data Center](https://arcticdata.io/catalog/), the [DataONE search service](https://search.dataone.org), and other repositories.

DataONE in general, and HashStore in particular, are open source, community projects.
Expand All @@ -38,17 +38,17 @@ contributions with us.

## Documentation

The documentation around HashStore's initial design phase can be found here in the [Metacat
The documentation around HashStore's initial design phase can be found here in the [Metacat
repository](https://github.com/NCEAS/metacat/blob/feature-1436-storage-and-indexing/docs/user/metacat/source/storage-subsystem.rst#physical-file-layout)
as part of the storage re-design planning. Future updates will include documentation here as the
package matures.

## HashStore Overview

HashStore is a hash-based object storage system that provides persistent file-based storage using
content hashes to de-duplicate data. The system stores data objects, references (refs) and
metadata in its respective directories and utilizes an identifier-based API for interacting
with the store. HashStore storage classes (like `filehashstore`) must implement the HashStore
HashStore is a hash-based object storage system that provides persistent file-based storage using
content hashes to de-duplicate data. The system stores data objects, references (refs) and
metadata in its respective directories and utilizes an identifier-based API for interacting
with the store. HashStore storage classes (like `filehashstore`) must implement the HashStore
interface to ensure the consistent and expected usage of HashStore.

### Public API Methods
Expand Down Expand Up @@ -160,11 +160,11 @@ metadata_cid_two = hashstore.store_metadata(pid, metadata, format_id)

### Working with objects (store, retrieve, delete)

In HashStore, data objects begin as temporary files while their content identifiers are
In HashStore, data objects begin as temporary files while their content identifiers are
calculated. Once the default hash algorithm list and their hashes are generated, objects are stored
in their permanent locations using the hash value of the store's configured algorithm, and
then divided accordingly based on the configured width and depth. Lastly, objects are 'tagged'
with a given identifier (ex. persistent identifier (pid)). This process produces reference
in their permanent locations using the hash value of the store's configured algorithm, and
then divided accordingly based on the configured width and depth. Lastly, objects are 'tagged'
with a given identifier (ex. persistent identifier (pid)). This process produces reference
files, which allow objects to be found and retrieved with a given identifier.

- Note 1: An identifier can only be used once
Expand All @@ -176,9 +176,9 @@ files, which allow objects to be found and retrieved with a given identifier.
By calling the various interface methods for `store_object`, the calling app/client can validate,
store and tag an object simultaneously if the relevant data is available. In the absence of an
identifier (ex. persistent identifier (pid)), `store_object` can be called to solely store an
object. The client is then expected to call `delete_if_invalid_object` when the relevant
object. The client is then expected to call `delete_if_invalid_object` when the relevant
metadata is available to confirm that the object is what is expected. And to finalize the data-only
storage process (to make the object discoverable), the client calls `tagObject``. In summary, there
storage process (to make the object discoverable), the client calls `tagObject``. In summary, there
are two expected paths to store an object:

```py
Expand Down Expand Up @@ -263,16 +263,16 @@ ex. `store_metadata(stream, pid, format_id)`).

### What are HashStore reference files?

HashStore assumes that every data object is referenced by its a respective identifier. This
identifier is then used when storing, retrieving and deleting an object. In order to facilitate
HashStore assumes that every data object is referenced by its a respective identifier. This
identifier is then used when storing, retrieving and deleting an object. In order to facilitate
this process, we create two types of reference files:

- pid (persistent identifier) reference files
- cid (content identifier) reference files

These reference files are implemented in HashStore underneath the hood with no expectation for
modification from the calling app/client. The one and only exception to this process is when the
calling client/app does not have an identifier available (i.e. they receive the stream to store
calling client/app does not have an identifier available (i.e. they receive the stream to store
the data object first without any metadata, thus calling `store_object(stream)`).

**'pid' Reference Files**
Expand All @@ -282,7 +282,7 @@ the data object first without any metadata, thus calling `store_object(stream)`)
- If an identifier is not available at the time of storing an object, the calling app/client must
create this association between a pid and the object it represents by calling `tag_object`
separately.
- Each pid reference file contains a single string that represents the content identifier of the
- Each pid reference file contains a single string that represents the content identifier of the
object it references
- Like how objects are stored once and only once, there is also only one pid reference file for each
data object.
Expand All @@ -297,10 +297,10 @@ the data object first without any metadata, thus calling `store_object(stream)`)

## Concurrency in HashStore

HashStore is both threading and multiprocessing safe, and by default synchronizes calls to store &
delete objects/metadata with Python's threading module. If you wish to use multiprocessing to
parallelize your application, please declare a global environment variable `USE_MULTIPROCESSING`
as `True` before initializing Hashstore. This will direct the relevant Public API calls to
HashStore is both threading and multiprocessing safe, and by default synchronizes calls to store &
delete objects/metadata with Python's threading module. If you wish to use multiprocessing to
parallelize your application, please declare a global environment variable `USE_MULTIPROCESSING`
as `True` before initializing Hashstore. This will direct the relevant Public API calls to
synchronize using the Python `multiprocessing` module's locks and conditions.
Please see below for example:

Expand All @@ -316,13 +316,23 @@ use_multiprocessing = os.getenv("USE_MULTIPROCESSING", "False") == "True"

## Development build

HashStore is a python package, and built using the [Python Poetry](https://python-poetry.org)
build tool.

To install `hashstore` locally, create a virtual environment for python 3.9+,
install poetry, and then install or build the package with `poetry install` or `poetry build`,
respectively. Note, installing `hashstore` with poetry will also make the `hashstore` command
available through the command line terminal (see `HashStore Client` section below for details).
HashStore is a python package. We recommend installing it using `uv`. Instructions on how to install and set up `uv` can be found [here](https://gist.github.com/datadavev/3975f244e5db500ba0328ef771ca74dd).

Friendly Notes:
- You may run into a `command not found: compdef` when adding code to your `.zshrc` file, this can be resolved by adjusting the code to be:
```sh
# .zshrc
autoload -Uz compinit
compinit
eval "$(uv generate-shell-completion zsh)"
eval "$(uvx --generate-shell-completion zsh)"
```
- When downloading the script `uv-python-symlink`, an extension may be added to it, for example: `uv-python-symlink.txt`. It may also not have an executable status. You can execute the following to adjust it:
```sh
$ mv uv-python-symlink uv-python-symlink.sh
chmod +x uv-python-symlink.sh
```
- After following the steps and navigating to the python project, `uv` may not have sufficient permissions to run. Follow the given prompts and execute `direnv allow`

To run tests, navigate to the root directory and run `pytest`. The test suite contains tests that
take a longer time to run (relating to the storage of large files) - to execute all tests, run
Expand Down Expand Up @@ -404,5 +414,3 @@ California.
[![DataONE_footer](https://user-images.githubusercontent.com/6643222/162324180-b5cf0f5f-ae7a-4ca6-87c3-9733a2590634.png)](https://dataone.org)

[![nceas_footer](https://www.nceas.ucsb.edu/sites/default/files/2020-03/NCEAS-full%20logo-4C.png)](https://www.nceas.ucsb.edu)


2 changes: 1 addition & 1 deletion hashstore.code-workspace
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@
}
],
"settings": {}
}
}
Loading