Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
93eaf6c
Add support for folder store / retrieve. This is a WIP.
datadavev Jan 15, 2026
154a1cb
Added notes about folders in hashstore
datadavev Jan 15, 2026
e3056b7
Adjust typehints for 3.9
datadavev Jan 15, 2026
c5a5e72
Adjust check_string to check for leading or trailing whitespace
datadavev Jan 15, 2026
b8ff3bf
Rename header and update folder hierarchy example
datadavev Jan 20, 2026
1b72b4d
KeyError instead of ValueError, add list_pids()
datadavev Feb 20, 2026
040868e
some refactoring for folder support
datadavev Apr 30, 2026
74f1870
Dependency updates require 3.10 minimum python version
datadavev May 8, 2026
fda4af6
Adjust folder related method signatures
datadavev May 8, 2026
880c355
Dependency updates
datadavev May 8, 2026
e84f4de
Simplify folder methods, build recursion is responsibility of caller;…
datadavev May 8, 2026
067955b
Add structure for folder entries
datadavev May 8, 2026
45ef848
Make find_object less noisy for common and expected cases
datadavev May 8, 2026
97ffe89
Make find_object part of base hashstore, tweak hints
datadavev May 8, 2026
d324b9f
WIP: adjust cli for revised hashstore folder support
datadavev May 8, 2026
ca34815
Merge branch 'feat-152_folder_store' of https://github.com/DataONEorg…
datadavev May 8, 2026
0970b1e
Added option to capture object creation events to an index file
datadavev May 8, 2026
f7b24be
Added folder inspection methods
datadavev May 8, 2026
9b12931
Enable folder creation without folderEntry cids
datadavev May 11, 2026
9cf4f21
Change type to bool for efficiency
datadavev May 11, 2026
67e2668
Use alternate path delimiter
datadavev May 11, 2026
b1403b3
switch path delim, path as list instead of str
datadavev May 12, 2026
8e852e3
Refactor to use path segments in api
datadavev May 12, 2026
8c2a673
Fix recursion step, allow delimtier to be specified
datadavev May 12, 2026
07e32a7
Initial store - WIP
datadavev May 14, 2026
fa1eb13
Limit code changes to folder support
datadavev May 14, 2026
dfa8dc7
WIP folder mutations, reduce overly agressive type checking
datadavev May 28, 2026
f9104cb
Setting up types for folder mutation operations
datadavev May 29, 2026
0967277
Overzealous parameter checking, revist after adding ful typehint support
datadavev May 29, 2026
fba35fd
Add start of clie delete action, folder delete todo
datadavev May 29, 2026
7dde564
Fix unclosed file resource
datadavev May 30, 2026
9011ba0
Fix unclosed file resource
datadavev May 30, 2026
2af0fb0
wrap open resource with try finally
datadavev May 30, 2026
f1213f2
use pathlib semantics
datadavev May 30, 2026
7087e92
Tests for FileHashStoreProperties
datadavev Jun 1, 2026
44cdaab
WIP - working through tests.
datadavev Jun 1, 2026
013461a
Make properties defaults match original
datadavev Jun 1, 2026
3088d55
Generalize hashstore properties, enforce property validation, path co…
datadavev Jun 1, 2026
0dcf91a
Add missing tests
datadavev Jun 1, 2026
70228f1
Merge branch 'develop' into feature-156-config
datadavev Jun 1, 2026
64242b5
Merge branch 'develop' into feat-152_folder_store
datadavev Jun 2, 2026
f2ea626
Enable ruff pre-commit
datadavev Jun 2, 2026
0efea46
Merge branch 'feature-156-config' into feat-152_folder_store
datadavev Jun 2, 2026
b194a06
update pre-commit to use ruff
datadavev Jun 2, 2026
c860f8e
Fix formating
datadavev Jun 2, 2026
36d8735
Dependencies no longer support python 3.9, remove from tests
datadavev Jun 2, 2026
ee889e6
Fleshout baseclass typehints
datadavev Jun 2, 2026
db3dd6b
Move ObjectProperties to base
datadavev Jun 2, 2026
908606a
Adjust min python version, mypy settings
datadavev Jun 2, 2026
c1bbb8e
Make ObjectMetadata top level
datadavev Jun 2, 2026
c6555b8
updated dependencies
datadavev Jun 2, 2026
c63156b
Fix imports
datadavev Jun 2, 2026
241554d
Move ObjectMetadata to base
datadavev Jun 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/uv-package-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13", "3.14"]
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
steps:
- uses: actions/checkout@v5

Expand Down
12 changes: 12 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,15 @@ repos:
args: ["--fix", "--show-fixes"]
# then, format
- id: ruff-format

# - repo: https://github.com/pre-commit/mirrors-mypy
# rev: "v1.19.1"
# hooks:
# - id: mypy
# name: mypy
# entry: uv run mypy
# files: src
# types: [python]
# pass_filenames: true
# args: []
# language: system
19 changes: 9 additions & 10 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
{
"python.terminal.activateEnvInCurrentTerminal": true,
"python.testing.pytestArgs": [
"tests"
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"editor.formatOnSave": true,
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter"
}
"python.terminal.activateEnvInCurrentTerminal": true,
"python.testing.pytestArgs": ["tests"],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"editor.formatOnSave": true,
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter",
},
"python-envs.pythonProjects": [],
}
164 changes: 164 additions & 0 deletions folder_operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Folders in HashStore

Describes storing directory trees in hashstore (hs).

## Assumptions

- The root of a folder hierarchy is identified by a PID
- A folder hierarchy (including content) identified by a PID is immutable
- A mutation to a folder hierarchy results in a new folder hierarchy identified by a new PID
- Any subfolder may optionally be identified by a PID
- Any file contained within a folder hierarchy may be identified by a PID
- Permissions are associated with a PID and so apply to content of PID identified containers or files.
- A folder hierarchy may reference all or part of another identified folder hierarchy
- A folder is represented by a `container` in hashstore.

## Virtual hashstore

When a folder is added to `hs`, it is necessary to calculate file and folder hashes and compare these with any existing content in the target `hs`. The efficiency of updating an existing folder entry in `hs` can be significantly improved by computing the hashes locally and determining what may need to be sent to the target `hs`. This is especially important for large folder structures that may have isolated changes.

A virtual `hs` (`vhs`) is a local folder structure that is similar to a `hs` except that the content bytes are not stored (except for containers), only hashes of the content. Time stamps of the hash entries are compared with content time stamps to identify candidates for hash recalculation. If hash values have changed, then the files are tagged for upload to the target hs.

A `vhs` is composed of CID and PID ref files, and container files for folder hashes. Even though content ids are calculated, the content files are not stored.



## Containers

Hashstore is augmented by adding an additional type of content that represents a `container`, the contents of which represent a single folder. A `container` has two types of entries: `file` that represents a single file and `folder` which represents a single subfolder. Each entry in a `container` has properties: `type`, `cid`, and `name`, where:

`type` - Indicates if the entry is a folder (`0`) or file (`1`).

`cid` - The content ID for the respective file or container.

`name` - The name component of the path to the entry. i.e. The last path segment for a subfolder or the file name (without path) for a file.

The CID for a container is computed from the serialized content on the container which includes the CID values for any subfolders. Hence, computing the CID for folders in a hierarchy requires a depth-first approach where the CIDs for leaves of a branch are computed before the branch.

A container is serialized space delimited rows in a text file. Each row represents an entry in the container, with values `type`, `cid`, and `name` in that order. Since folder or file names *may* contain whitespace, the `name` entry consumes the remainder of the row.

Since the CID for a container is dependent on its content, the content order is sorted by the `type` and `cid` values so hashing is consistent. Hence rows referencing subfolder containers will always appear before rows referencing files.

For example, given the folder hierarchy:

```
PID_1 <- dbc15
├── A <- ad5eb
│ ├── a1.txt
│ └── a2.txt
└── B <- cc08d
└── b1.csv
```

The following `container` entries are created (`cid` values are truncated):

Container `ad5eb`:
```
1 10fbd a1.txt
1 c880c a2.txt
```

Container `cc08d`:
```
1 00e99 b1.csv
```

Container `dbc15`:
```
0 ad5eb A
0 cc08d B
```

The hashstore entry for `PID_1` might be:
```
$ cat refs/pids/53/b2/f2/58a2f3061a7bee4ba8b157aab217795c4692e2a2d8856e2fd97eb7fa3f
dbc1516e49e7437ea441f279570d32b1e2f149c44ab0a77682629215f4a5970b

$ cat refs/cids/db/c1/51/6e49e7437ea441f279570d32b1e2f149c44ab0a77682629215f4a5970b
PID_1
```

Each container is resolveable by the combination of PID and path. So for example,
the folder `B` within the context of `PID_1` can be resolved using the identifier `PID_1 B`.
Similarly, the file `A/a2.txt` can be resolved with the identifier `PID_1 A/a2.txt`.
Corresponding entries in hashstore `refs/pids` and `refs/cids` are created.

## Operations

### Get an object by path

Given a PID and a path, retrieve the corresponding object (file or folder) from hashstore.

Persistent identifiers for objects within a folder hierarchy are constructed by concatenating the PID with the path using a space as a delimiter. For example, to retrieve the object at path `data/file1.txt` within the folder hierarchy identified by PID `abc123`, the identifier would be `abc123 data/file1.txt`.

```
hashstore = HashStore(...)
path_pid = "<PID>" + " " + "<path>"
object_stream = hashstore.retrieve_object(path_pid)
```

### Store a new folder hierarchy

To store a new folder hierarchy, recursively create `container` entries for each folder in the hierarchy, starting from the leaves and working up to the root. For each folder, create a `container` with entries for its subfolders and files, compute the CID for the container, and store it in hashstore. Finally, associate the root container's CID with the PID representing the entire folder hierarchy.

This is achieved by the `hashstore.store_folder()` method.

```
hashstore = HashStore(...)
pid = "<PID>"
source_path = "<local_folder_path>"
hashstore.store_folder(pid, source_path)
```

### Retrieve folder hierarchy structure

To retrieve the structure of a folder hierarchy identified by a PID, recursively resolve each `container` starting from the root PID. For each folder, read its `container` entries to identify subfolders and files, and continue resolving subfolders until the entire hierarchy is reconstructed.

This is achieved by the `hashstore.retrieve_folder()` method.

```
hashstore = HashStore(...)
pid = "<PID>"
destination_path = "<local_folder_path>"
hashstore.retrieve_folder(pid, destination_path)
```


---

## `add`

`add(PID:str, path:pathlib.Path)->None`

Add an object or folder to `vhs`.


## `init`

`init(path:pathlib.Path)->None`

Initializes a `vhs` folder within the current folder.


## `status`

`status()->VhsStatus`

Reports the status of the entries in the `vhs` versus the current contents of
registered content.


## `update`

`update(PID:str|None)`

Recalculates CID values based on the current content of registered entries.


## `commit`

`commit()`

Makes entries in the `vhs` immutable preventing any further updates to existing
PIDs. Any further changes require new PID.

Loading