Skip to content

Parallel decompression, stoppable callbacks, and example implementations#21

Merged
jaleman-vdr-wikimedia merged 2 commits into
wikimedia-enterprise:mainfrom
jaleman-vdr-wikimedia:main
Nov 3, 2025
Merged

Parallel decompression, stoppable callbacks, and example implementations#21
jaleman-vdr-wikimedia merged 2 commits into
wikimedia-enterprise:mainfrom
jaleman-vdr-wikimedia:main

Conversation

@jaleman-vdr-wikimedia
Copy link
Copy Markdown
Contributor

  • Replace standard gzip decompression in read_all with zlib_ng_threaded using threads=-1 for parallel processing, improving performance on multi-core systems.

  • Modify read_all to use tarfile's streaming mode (r|) and iterate with tar.next(), removing the need to read the full member list and preventing seeks on the non-seekable stream.

  • Implement true end-to-end streaming in read_all by removing the io.BytesIO(f.read()) buffering, making it safe for large files within archives.

  • Add the _TarfileStreamWrapper helper class to bridge compatibility issues between tarfile's streaming file objects and io.TextIOWrapper (missing .seekable(), .closed, .flush()).

  • Update ReadCallback type hint to Callable[[dict], bool].

  • Modify _read_loop, _subscribe_to_entity, and read_all to check the boolean return value of the callback, allowing users to gracefully stop processing early.

  • Added try... block to teardown class in integration suite, so clean up process is safer after tests are executed

  • Removed hardcoded date from integration suite and replaced it with automatic date determination

  • Fixed batches, snapshots, and streaming examples to accommodate and showcase stoppable callbacks.

  • Added a "callback" example, to specifically showcase stoppable callbacks, along its README doc.

  • Discovered that running examples back to back, causes issues during token revocation cleanup. Due to this, edited auth_client, to handle errors during token revocation.

- Replace standard gzip decompression in read_all with zlib_ng_threaded using threads=-1 for parallel processing, improving performance on multi-core systems.

- Modify read_all to use tarfile's streaming mode (r|) and iterate with tar.next(), removing the need to read the full member list and preventing seeks on the non-seekable stream.

- Implement true end-to-end streaming in read_all by removing the io.BytesIO(f.read()) buffering, making it safe for large files within archives.

- Add the _TarfileStreamWrapper helper class to bridge compatibility issues between tarfile's streaming file objects and io.TextIOWrapper (missing .seekable(), .closed, .flush()).

- Update ReadCallback type hint to Callable[[dict], bool].

- Modify _read_loop, _subscribe_to_entity, and read_all to check the boolean return value of the callback, allowing users to gracefully stop processing early.

- Added try... block to teardown class in integration suite, so clean up process is safer after tests are executed

- Removed hardcoded date from integration suite and replaced it with automatic date determination
- Fixed batches, snapshots, and streaming examples to accommodate and showcase stoppable callbacks.

- Added a "callback" example, to specifically showcase stoppable callbacks, along its README doc.

- Discovered that running examples back to back, causes issues during token revocation cleanup. Due to this, edited auth_client, to handle errors during token revocation.
@jaleman-vdr-wikimedia jaleman-vdr-wikimedia self-assigned this Oct 30, 2025
@jaleman-vdr-wikimedia jaleman-vdr-wikimedia added documentation Improvements or additions to documentation enhancement New feature or request labels Oct 30, 2025
@jaleman-vdr-wikimedia jaleman-vdr-wikimedia merged commit cb7b1ed into wikimedia-enterprise:main Nov 3, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant