Parallel decompression, stoppable callbacks, and example implementations#21
Merged
jaleman-vdr-wikimedia merged 2 commits intoNov 3, 2025
Merged
Conversation
- Replace standard gzip decompression in read_all with zlib_ng_threaded using threads=-1 for parallel processing, improving performance on multi-core systems. - Modify read_all to use tarfile's streaming mode (r|) and iterate with tar.next(), removing the need to read the full member list and preventing seeks on the non-seekable stream. - Implement true end-to-end streaming in read_all by removing the io.BytesIO(f.read()) buffering, making it safe for large files within archives. - Add the _TarfileStreamWrapper helper class to bridge compatibility issues between tarfile's streaming file objects and io.TextIOWrapper (missing .seekable(), .closed, .flush()). - Update ReadCallback type hint to Callable[[dict], bool]. - Modify _read_loop, _subscribe_to_entity, and read_all to check the boolean return value of the callback, allowing users to gracefully stop processing early. - Added try... block to teardown class in integration suite, so clean up process is safer after tests are executed - Removed hardcoded date from integration suite and replaced it with automatic date determination
- Fixed batches, snapshots, and streaming examples to accommodate and showcase stoppable callbacks. - Added a "callback" example, to specifically showcase stoppable callbacks, along its README doc. - Discovered that running examples back to back, causes issues during token revocation cleanup. Due to this, edited auth_client, to handle errors during token revocation.
cb7b1ed
into
wikimedia-enterprise:main
1 check passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace standard gzip decompression in read_all with zlib_ng_threaded using threads=-1 for parallel processing, improving performance on multi-core systems.
Modify read_all to use tarfile's streaming mode (r|) and iterate with tar.next(), removing the need to read the full member list and preventing seeks on the non-seekable stream.
Implement true end-to-end streaming in read_all by removing the io.BytesIO(f.read()) buffering, making it safe for large files within archives.
Add the _TarfileStreamWrapper helper class to bridge compatibility issues between tarfile's streaming file objects and io.TextIOWrapper (missing .seekable(), .closed, .flush()).
Update ReadCallback type hint to Callable[[dict], bool].
Modify _read_loop, _subscribe_to_entity, and read_all to check the boolean return value of the callback, allowing users to gracefully stop processing early.
Added try... block to teardown class in integration suite, so clean up process is safer after tests are executed
Removed hardcoded date from integration suite and replaced it with automatic date determination
Fixed batches, snapshots, and streaming examples to accommodate and showcase stoppable callbacks.
Added a "callback" example, to specifically showcase stoppable callbacks, along its README doc.
Discovered that running examples back to back, causes issues during token revocation cleanup. Due to this, edited auth_client, to handle errors during token revocation.