Open
Conversation
- Create abstract DataBackend interface for supporting multiple dataframe libraries - Implement PandasBackend for backward compatibility with existing code - Implement PolarsBackend for faster processing and larger-than-memory datasets - Add automatic backend detection via get_backend() function - Support both eager and lazy evaluation modes in Polars This addresses issue pytorch-tabular#402 by providing the foundation for supporting Polars, Spark, and other dataframe frameworks in TabularDatamodule.
- Create TabularDatamoduleV2 class with automatic backend detection - Add Polars as optional dependency in pyproject.toml - Update __init__.py to export new classes and backend utilities - Add comprehensive example showing Polars usage patterns - Create detailed documentation for multi-backend support - Support both eager (DataFrame) and lazy (LazyFrame) Polars modes - Include sampling utility for transformer fitting on large datasets - Maintain full backward compatibility with existing pandas-based code This implementation provides: - 2-5x faster CSV reading with Polars - Better memory efficiency through Arrow memory format - Support for larger-than-memory datasets via LazyFrames - Foundation for future Spark and Dask backend support Addresses issue pytorch-tabular#402 - Re-write DataModule for larger-than-memory support
- Test PandasBackend operations (shape, columns, transforms, etc.) - Test PolarsBackend with both eager and lazy DataFrames - Test TabularDatamoduleV2 with pandas and polars backends - Test sampling utility for large dataset transform fitting - Add pytest skipif for optional polars dependency Tests verify: - Automatic backend detection - Backend operation correctness - Lazy loading capability flags - Integration with TabularDatamoduleV2
Author
|
@fkiraly @manujosephv whenever you are free, please review this pr. Thanks! |
Collaborator
|
Thanks a lot @arnavk23 for the PR. I just skimmed through it and had a question. This PR doesnt enable lazy mode in Polars, right? just introduces a decoupled backend and adds Polars as an option? |
Author
|
@manujosephv This PR does not implement end-to-end lazy training in Polars. It mainly:
|
Collaborator
|
@fkiraly Tagging you here since you are leading major dev in here.. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This implementation provides the architectural foundation requested in #402, with initial support for Polars (including larger-than-memory datasets). Future work can add Spark and other backends following the same pattern.