11# pyreadstat
22
33A python package to read and write sas (sas7bdat, sas7bcat, xport), spps (sav, zsav, por) and stata (dta) data files
4- into/from pandas dataframes.
4+ into/from pandas and polars dataframes.
55<br >
66
77This module is a wrapper around the excellent [ Readstat] ( https://github.com/WizardMac/ReadStat ) C library by
@@ -133,7 +133,8 @@ brings a big hit in performance. The situation can be improved tough by reading
133133
134134## Dependencies
135135
136- The module depends on pandas, which you normally have installed if you got Anaconda (highly recommended.)
136+ The module depends on numpy and narwhals, a package to interface with pandas and polars. In addition you will need to have installed
137+ either pandas or polars.
137138
138139In order to compile from source you will need a C compiler (see installation).
139140Only if you want to do changes to the cython source code, you will need cython (normally not necessary).
@@ -222,7 +223,7 @@ the folder build, otherwise you may be installing the old compilation again).
222223
223224#### Reading files
224225
225- Pass the path to a file to any of the functions provided by pyreadstat. It will return a pandas data frame and a metadata
226+ Pass the path to a file to any of the functions provided by pyreadstat. It will return a pandas or polars data frame and a metadata
226227object. <br >
227228The dataframe uses the column names. The metadata object contains the column names, column labels, number_rows,
228229number_columns, file label
@@ -234,7 +235,8 @@ For example, in order to read a sas7bdat file:
234235``` python
235236import pyreadstat
236237
237- df, meta = pyreadstat.read_sas7bdat(' /path/to/a/file.sas7bdat' )
238+ # output format by default is pandas. You can use polars to get a polars dataframe.
239+ df, meta = pyreadstat.read_sas7bdat(' /path/to/a/file.sas7bdat' , output_format = " pandas" )
238240
239241# done! let's see what we got
240242print (df.head())
@@ -257,25 +259,38 @@ df.columns = meta.column_labels
257259df.columns = meta.column_names
258260```
259261
262+ As mentioned before you can very easily read into a polars dataframe by using the output_format argument:
263+
264+ ``` python
265+ import pyreadstat
266+
267+ # this time df will be polars
268+ df, meta = pyreadstat.read_sas7bdat(' /path/to/a/file.sas7bdat' , output_format = " polars" )
269+
270+ # done! let's see what we got
271+ print (df.head())
272+ ```
273+
260274#### Writing files
261275
262276Pyreadstat can write STATA (dta), SPSS (sav and zsav, por currently nor supported) and SAS (Xport, sas7bdat and sas7bcat
263- currently not supported) files from pandas data frames .
277+ currently not supported) files from pandas or polars dataframes .
264278
265- write functions take as first argument a pandas data frame (other data structures are not supported), as a second argument
279+ write functions take as first argument a pandas or polars dataframe (other data structures are not supported), as a second argument
266280the path to the destination file. Optionally you can also pass a file label and a list with column labels.
267281
268282``` python
269283import pandas as pd
270284import pyreadstat
271285
286+ # this would work the same for a polars dataframe
272287df = pd.DataFrame([[1 ,2.0 ," a" ],[3 ,4.0 ," b" ]], columns = [" v1" , " v2" , " v3" ])
273288# column_labels can also be a dictionary with variable name as key and label as value
274289column_labels = [" Variable 1" , " Variable 2" , " Variable 3" ]
275290pyreadstat.write_sav(df, " path/to/destination.sav" , file_label = " test" , column_labels = column_labels)
276291```
277292
278- Some special arguments are available depending on the function. write_sav can take also notes as string, wheter to
293+ Some special arguments are available depending on the function. write_sav can take also notes as string or list of strings , wheter to
279294compress or not as zsav or apply row compression, variable display widths and variable measures. write_dta can take a stata version.
280295write_xport a name for the dataset. User defined missing values and value labels are also supported. See the
281296[ Module documentation] ( https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html ) for more details.
@@ -434,7 +449,7 @@ function. The original values will be replaced by the values in the catalog.
434449``` python
435450import pyreadstat
436451
437- # formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. There is also formats_as_ordered_category to get an ordered category, this by default is False.
452+ # formats_as_category is by default True, and it means the replaced values will be transformed to a pandas/polars category column. There is also formats_as_ordered_category to get an ordered category, this by default is False.
438453df, meta = pyreadstat.read_sas7bdat(' /path/to/a/file.sas7bdat' , catalog_file = ' /path/to/a/file.sas7bcat' , formats_as_category = True , formats_as_ordered_category = False )
439454```
440455
@@ -449,7 +464,7 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat')
449464# read_sas7bdat returns an emtpy data frame and the catalog
450465df_empty, catalog = pyreadstat.read_sas7bdat(' /path/to/a/file.sas7bcat' )
451466# enrich the dataframe with the catalog
452- # formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
467+ # formats_as_category is by default True, and it means the replaced values will be transformed to a pandas/polars category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
453468df_enriched, meta_enriched = pyreadstat.set_catalog_to_sas(df, meta, catalog,
454469 formats_as_category = True , formats_as_ordered_category = False )
455470```
@@ -461,7 +476,7 @@ when reading the file using the option apply_value_formats, ...
461476import pyreadstat
462477
463478# apply_value_formats is by default False, so you have to set it to True manually if you want the labels
464- # formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
479+ # formats_as_category is by default True, and it means the replaced values will be transformed to a pandas/polars category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.
465480df, meta = pyreadstat.read_sav(" /path/to/sav/file.sav" , apply_value_formats = True ,
466481 formats_as_category = True , formats_as_ordered_category = False )
467482```
@@ -530,9 +545,9 @@ example if one has a categorical variable representing if the person passed a te
5305451 for pass, and as user defined missing variables 2 for did not show up for the test, 3 for unable to process the results,
531546etc.
532547
533- ** By default both cases are represented by NaN when
548+ ** By default both cases are represented by NaN in pandas and null in polars when
534549read with pyreadstat** . Notice that the only possible missing value in pandas is NaN (Not a Number) for both string and numeric
535- variables, date, datetime and time variables have NaT (Not a Time).
550+ variables, date, datetime and time variables have NaT (Not a Time). Polars use null for all datatypes.
536551
537552##### SPSS
538553
@@ -599,16 +614,16 @@ translated as NaN by default and to the correspoding string value if
599614user_missing is set to True. meta.missing_ranges will show the string
600615value as well.
601616
602- When writing a pandas dataframe to a sav file, if user defined missing values are not set, NaNs are translated to
617+ When writing a dataframe to a sav file, if user defined missing values are not set, NaNs are translated to
603618empty strings, as there is no other possibility to represent those missing values and user defined missing values
604619are not set automatically.
605620
606- When reading a sav into a pandas dataframe, if the value in
607- a character variable is an empty string (''), it will not be translated to NaN, but will stay as an empty string. This
621+ When reading a sav into a dataframe, if the value in
622+ a character variable is an empty string (''), it will not be translated to NaN/null , but will stay as an empty string. This
608623is because the empty string is a valid character value in SPSS and pyreadstat preserves that property.
609624
610625This behaviour generates an asymetrical situation that has to be managed by the user. You can convert
611- empty strings to nan very easily with pandas if you think it is appropiate
626+ empty strings to nan very easily if you think it is appropiate
612627for your dataset, or you can use defined missing values as described before.
613628
614629
@@ -700,42 +715,34 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', encoding="LATIN1
700715```
701716
702717You can preserve the original pandas behavior regarding dates (meaning dates are converted to pandas datetime) with the
703- dates_as_pandas_datetime option
718+ dates_as_pandas_datetime option. This option is effective for pandas only, not for polars.
704719
705720``` python
706721import pyreadstat
707722
708723df, meta = pyreadstat.read_sas7bdat(' /path/to/a/file.sas7bdat' , dates_as_pandas_datetime = True )
709724```
710725
711- You can get a dictionary of numpy arrays instead of a pandas dataframe when reading any file format.
712- In order to do that, set the parameter output_format='dict' (default is 'pandas'). This is useful if
713- you want to transform the data to some other format different to pandas, as transforming the data to pandas is a costly
714- process both in terms of speed and memory. Here for example an efficient way to transform the data to a polars dataframe:
715-
716- ``` python
717- import pyreadstat
718- import polars
719-
720- dicdata, meta = pyreadstat.read_sav(' /path/to/a/file.sav' , output_format = ' dict' )
721- df = polars.DataFrame(dicdata)
722- ```
726+ You can get a dictionary of numpy arrays instead of a pandas or polars dataframe when reading any file format.
727+ In order to do that, set the parameter output_format='dict' (default is 'pandas', the other option is 'polars'). This is useful if
728+ you want to transform the data to some other format different to pandas/polars, as transforming the data to pandas is a costly
729+ process both in terms of speed and memory.
723730
724731For more information, please check the [ Module documentation] ( https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html ) .
725732
726733### More writing options
727734
728735#### File specific options
729736
730- Some special arguments are available depending on the function. write_sav can take also notes as string, wheter to
737+ Some special arguments are available depending on the function. write_sav can take also notes as string or list of strings , wheter to
731738compress or not as zsav or apply row compression, variable display widths and variable measures. write_dta can take a stata version.
732739write_xport a name for the dataset. See the
733740[ Module documentation] ( https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html ) for more details.
734741
735742#### Writing value labels
736743
737744The argument variable_value_labels can be passed to write_sav and write_dta to write value labels. This argument must be a
738- dictionary where keys are variable names (names must match column names in the pandas data frame ). Values are another dictionary where
745+ dictionary where keys are variable names (names must match column names in the dataframe ). Values are another dictionary where
739746keys are the value present in the dataframe and values are the labels (strings).
740747
741748``` python
@@ -812,7 +819,7 @@ for the documentation of the original application.
812819In the case of SPSS we have some presets for some formats:
813820* restricted_integer: with leading zeros, equivalent to N + variable width (e.g N4)
814821* integer: Numeric with no decimal places, equivalent to F + variable width + " .0" (0 decimal positions). A
815- pandas column of type integer will also be translated into this format automatically.
822+ column of type integer will also be translated into this format automatically.
816823
817824```python
818825import pandas as pd
@@ -828,12 +835,12 @@ There is some information about the possible formats [here](https://www.gnu.org/
828835
829836# ### Variable type conversion
830837
831- The following rules are used in order to convert from pandas/ numpy/ python types to the target file types:
838+ The following rules are used in order to convert from pandas/ polars / numpy/ python types to the target file types:
832839
833840| Python Type | Converted Type |
834841| ------------------ - | -------- - |
835- | np.int32 or lower | integer (stata), numeric (spss, sas) |
836- | int , np.int64, np.float | double (stata), numeric (spss, sas) |
842+ | np.int32/ pl.int32 or lower | integer (stata), numeric (spss, sas) |
843+ | int , np.int64, pl.int64, np.float, pl.float64 | double (stata), numeric (spss, sas) |
837844| str | character |
838845| bool | integer (stata), numeric (spss, sas) |
839846| datetime, date, time | numeric with datetime/ date/ time formatting |
0 commit comments