Skip to content

Commit f240e83

Browse files
authored
Merge pull request #301 from Roche/dev
version 1.3.1
2 parents d74a751 + 0d50e12 commit f240e83

28 files changed

Lines changed: 20777 additions & 16229 deletions

CITATION.cff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ authors:
55
given-names: "Otto"
66
orcid: "https://orcid.org/0000-0002-3363-9287"
77
title: "Pyreadstat"
8-
version: 1.3.0
8+
version: 1.3.1
99
doi: 10.5281/zenodo.6612282
1010
date-released: 2018-09-24
1111
url: "https://github.com/Roche/pyreadstat"

README.md

Lines changed: 42 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# pyreadstat
22

33
A python package to read and write sas (sas7bdat, sas7bcat, xport), spps (sav, zsav, por) and stata (dta) data files
4-
into/from pandas dataframes.
4+
into/from pandas and polars dataframes.
55
<br>
66

77
This module is a wrapper around the excellent [Readstat](https://github.com/WizardMac/ReadStat) C library by
@@ -133,7 +133,8 @@ brings a big hit in performance. The situation can be improved tough by reading
133133

134134
## Dependencies
135135

136-
The module depends on pandas, which you normally have installed if you got Anaconda (highly recommended.)
136+
The module depends on numpy and narwhals, a package to interface with pandas and polars. In addition you will need to have installed
137+
either pandas or polars.
137138

138139
In order to compile from source you will need a C compiler (see installation).
139140
Only if you want to do changes to the cython source code, you will need cython (normally not necessary).
@@ -222,7 +223,7 @@ the folder build, otherwise you may be installing the old compilation again).
222223

223224
#### Reading files
224225

225-
Pass the path to a file to any of the functions provided by pyreadstat. It will return a pandas data frame and a metadata
226+
Pass the path to a file to any of the functions provided by pyreadstat. It will return a pandas or polars data frame and a metadata
226227
object. <br>
227228
The dataframe uses the column names. The metadata object contains the column names, column labels, number_rows,
228229
number_columns, file label
@@ -234,7 +235,8 @@ For example, in order to read a sas7bdat file:
234235
```python
235236
import pyreadstat
236237

237-
df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat')
238+
# output format by default is pandas. You can use polars to get a polars dataframe.
239+
df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', output_format="pandas")
238240

239241
# done! let's see what we got
240242
print(df.head())
@@ -257,25 +259,38 @@ df.columns = meta.column_labels
257259
df.columns = meta.column_names
258260
```
259261

262+
As mentioned before you can very easily read into a polars dataframe by using the output_format argument:
263+
264+
```python
265+
import pyreadstat
266+
267+
# this time df will be polars
268+
df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', output_format="polars")
269+
270+
# done! let's see what we got
271+
print(df.head())
272+
```
273+
260274
#### Writing files
261275

262276
Pyreadstat can write STATA (dta), SPSS (sav and zsav, por currently nor supported) and SAS (Xport, sas7bdat and sas7bcat
263-
currently not supported) files from pandas data frames.
277+
currently not supported) files from pandas or polars dataframes.
264278

265-
write functions take as first argument a pandas data frame (other data structures are not supported), as a second argument
279+
write functions take as first argument a pandas or polars dataframe (other data structures are not supported), as a second argument
266280
the path to the destination file. Optionally you can also pass a file label and a list with column labels.
267281

268282
```python
269283
import pandas as pd
270284
import pyreadstat
271285

286+
# this would work the same for a polars dataframe
272287
df = pd.DataFrame([[1,2.0,"a"],[3,4.0,"b"]], columns=["v1", "v2", "v3"])
273288
# column_labels can also be a dictionary with variable name as key and label as value
274289
column_labels = ["Variable 1", "Variable 2", "Variable 3"]
275290
pyreadstat.write_sav(df, "path/to/destination.sav", file_label="test", column_labels=column_labels)
276291
```
277292

278-
Some special arguments are available depending on the function. write_sav can take also notes as string, wheter to
293+
Some special arguments are available depending on the function. write_sav can take also notes as string or list of strings, wheter to
279294
compress or not as zsav or apply row compression, variable display widths and variable measures. write_dta can take a stata version.
280295
write_xport a name for the dataset. User defined missing values and value labels are also supported. See the
281296
[Module documentation](https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html) for more details.
@@ -434,7 +449,7 @@ function. The original values will be replaced by the values in the catalog.
434449
```python
435450
import pyreadstat
436451

437-
# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. There is also formats_as_ordered_category to get an ordered category, this by default is False.
452+
# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas/polars category column. There is also formats_as_ordered_category to get an ordered category, this by default is False.
438453
df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', catalog_file='/path/to/a/file.sas7bcat', formats_as_category=True, formats_as_ordered_category=False)
439454
```
440455

@@ -449,7 +464,7 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat')
449464
# read_sas7bdat returns an emtpy data frame and the catalog
450465
df_empty, catalog = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bcat')
451466
# enrich the dataframe with the catalog
452-
# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
467+
# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas/polars category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
453468
df_enriched, meta_enriched = pyreadstat.set_catalog_to_sas(df, meta, catalog,
454469
formats_as_category=True, formats_as_ordered_category=False)
455470
```
@@ -461,7 +476,7 @@ when reading the file using the option apply_value_formats, ...
461476
import pyreadstat
462477

463478
# apply_value_formats is by default False, so you have to set it to True manually if you want the labels
464-
# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
479+
# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas/polars category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
465480
df, meta = pyreadstat.read_sav("/path/to/sav/file.sav", apply_value_formats=True,
466481
formats_as_category=True, formats_as_ordered_category=False)
467482
```
@@ -530,9 +545,9 @@ example if one has a categorical variable representing if the person passed a te
530545
1 for pass, and as user defined missing variables 2 for did not show up for the test, 3 for unable to process the results,
531546
etc.
532547

533-
**By default both cases are represented by NaN when
548+
**By default both cases are represented by NaN in pandas and null in polars when
534549
read with pyreadstat**. Notice that the only possible missing value in pandas is NaN (Not a Number) for both string and numeric
535-
variables, date, datetime and time variables have NaT (Not a Time).
550+
variables, date, datetime and time variables have NaT (Not a Time). Polars use null for all datatypes.
536551

537552
##### SPSS
538553

@@ -599,16 +614,16 @@ translated as NaN by default and to the correspoding string value if
599614
user_missing is set to True. meta.missing_ranges will show the string
600615
value as well.
601616

602-
When writing a pandas dataframe to a sav file, if user defined missing values are not set, NaNs are translated to
617+
When writing a dataframe to a sav file, if user defined missing values are not set, NaNs are translated to
603618
empty strings, as there is no other possibility to represent those missing values and user defined missing values
604619
are not set automatically.
605620

606-
When reading a sav into a pandas dataframe, if the value in
607-
a character variable is an empty string (''), it will not be translated to NaN, but will stay as an empty string. This
621+
When reading a sav into a dataframe, if the value in
622+
a character variable is an empty string (''), it will not be translated to NaN/null, but will stay as an empty string. This
608623
is because the empty string is a valid character value in SPSS and pyreadstat preserves that property.
609624

610625
This behaviour generates an asymetrical situation that has to be managed by the user. You can convert
611-
empty strings to nan very easily with pandas if you think it is appropiate
626+
empty strings to nan very easily if you think it is appropiate
612627
for your dataset, or you can use defined missing values as described before.
613628

614629

@@ -700,42 +715,34 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', encoding="LATIN1
700715
```
701716

702717
You can preserve the original pandas behavior regarding dates (meaning dates are converted to pandas datetime) with the
703-
dates_as_pandas_datetime option
718+
dates_as_pandas_datetime option. This option is effective for pandas only, not for polars.
704719

705720
```python
706721
import pyreadstat
707722

708723
df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', dates_as_pandas_datetime=True)
709724
```
710725

711-
You can get a dictionary of numpy arrays instead of a pandas dataframe when reading any file format.
712-
In order to do that, set the parameter output_format='dict' (default is 'pandas'). This is useful if
713-
you want to transform the data to some other format different to pandas, as transforming the data to pandas is a costly
714-
process both in terms of speed and memory. Here for example an efficient way to transform the data to a polars dataframe:
715-
716-
```python
717-
import pyreadstat
718-
import polars
719-
720-
dicdata, meta = pyreadstat.read_sav('/path/to/a/file.sav', output_format='dict')
721-
df = polars.DataFrame(dicdata)
722-
```
726+
You can get a dictionary of numpy arrays instead of a pandas or polars dataframe when reading any file format.
727+
In order to do that, set the parameter output_format='dict' (default is 'pandas', the other option is 'polars'). This is useful if
728+
you want to transform the data to some other format different to pandas/polars, as transforming the data to pandas is a costly
729+
process both in terms of speed and memory.
723730

724731
For more information, please check the [Module documentation](https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html).
725732

726733
### More writing options
727734

728735
#### File specific options
729736

730-
Some special arguments are available depending on the function. write_sav can take also notes as string, wheter to
737+
Some special arguments are available depending on the function. write_sav can take also notes as string or list of strings, wheter to
731738
compress or not as zsav or apply row compression, variable display widths and variable measures. write_dta can take a stata version.
732739
write_xport a name for the dataset. See the
733740
[Module documentation](https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html) for more details.
734741

735742
#### Writing value labels
736743

737744
The argument variable_value_labels can be passed to write_sav and write_dta to write value labels. This argument must be a
738-
dictionary where keys are variable names (names must match column names in the pandas data frame). Values are another dictionary where
745+
dictionary where keys are variable names (names must match column names in the dataframe). Values are another dictionary where
739746
keys are the value present in the dataframe and values are the labels (strings).
740747

741748
```python
@@ -812,7 +819,7 @@ for the documentation of the original application.
812819
In the case of SPSS we have some presets for some formats:
813820
* restricted_integer: with leading zeros, equivalent to N + variable width (e.g N4)
814821
* integer: Numeric with no decimal places, equivalent to F + variable width + ".0" (0 decimal positions). A
815-
pandas column of type integer will also be translated into this format automatically.
822+
column of type integer will also be translated into this format automatically.
816823

817824
```python
818825
import pandas as pd
@@ -828,12 +835,12 @@ There is some information about the possible formats [here](https://www.gnu.org/
828835

829836
#### Variable type conversion
830837

831-
The following rules are used in order to convert from pandas/numpy/python types to the target file types:
838+
The following rules are used in order to convert from pandas/polars/numpy/python types to the target file types:
832839

833840
| Python Type | Converted Type |
834841
| ------------------- | --------- |
835-
| np.int32 or lower | integer (stata), numeric (spss, sas) |
836-
| int, np.int64, np.float | double (stata), numeric (spss, sas) |
842+
| np.int32/pl.int32 or lower | integer (stata), numeric (spss, sas) |
843+
| int, np.int64, pl.int64, np.float, pl.float64 | double (stata), numeric (spss, sas) |
837844
| str | character |
838845
| bool | integer (stata), numeric (spss, sas) |
839846
| datetime, date, time | numeric with datetime/date/time formatting |

change_log.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
# 1.3.1 (github, pypi and conda 2025.08.14)
2+
* make list of notes writable, solves #292
3+
* enable support for polars, solves #282
4+
15
# 1.3.0 (github, pypi and conda 2025.06.27)
26
* updated Readstat sources to commit b2d5407d62caf3c33caadc0495c9f7684b6a0df7
37
solves #128, #165, #261, #284,
0 Bytes
Binary file not shown.

docs/_build/doctrees/index.doctree

-1.2 KB
Binary file not shown.

docs/_build/html/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: 88463747b8ee475dc12dfb96b6180375
3+
config: 371405f5ed6b8ef2c10a8837783eb668
44
tags: 645f666f9bcd5a90fca523b33c5a78b7

docs/_build/html/.buildinfo.bak

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
2-
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: c408a8fb34ee89dd28b50ef275d0f07c
2+
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
3+
config: 88463747b8ee475dc12dfb96b6180375
44
tags: 645f666f9bcd5a90fca523b33c5a78b7

docs/_build/html/_static/documentation_options.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
const DOCUMENTATION_OPTIONS = {
2-
VERSION: '1.3.0',
2+
VERSION: '1.3.1',
33
LANGUAGE: 'en',
44
COLLAPSE_INDEX: false,
55
BUILDER: 'html',

docs/_build/html/genindex.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@
55
<head>
66
<meta charset="utf-8" />
77
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
8-
<title>Index &mdash; pyreadstat 1.3.0 documentation</title>
8+
<title>Index &mdash; pyreadstat 1.3.1 documentation</title>
99
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=03e43079" />
1010
<link rel="stylesheet" type="text/css" href="_static/css/theme.css?v=e59714d7" />
1111

1212

13-
<script src="_static/documentation_options.js?v=1f29e9d3"></script>
13+
<script src="_static/documentation_options.js?v=bb516dca"></script>
1414
<script src="_static/doctools.js?v=9bcbadda"></script>
1515
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
1616
<script src="_static/js/theme.js"></script>

0 commit comments

Comments
 (0)