@@ -333,7 +333,8 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', usecols=["variab
333333A challenge when reading large files is the time consumed in the operation. In order to alleviate this
334334pyreadstat provides a function "read\_ file\_ multiprocessing" to read a file in parallel processes using
335335 the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If
336- that is not the case look at Reading rows in chunks (next section)
336+ that is not the case look at Reading rows in chunks (next section). Notice however that you can combine reading in parallel
337+ with reading in chunks as described in the next section.
337338
338339Speed ups in the process will depend on a number of factors such as number of processes available, RAM,
339340content of the file etc.
@@ -598,11 +599,17 @@ translated as NaN by default and to the correspoding string value if
598599user_missing is set to True. meta.missing_ranges will show the string
599600value as well.
600601
601- If the value in
602+ When writing a pandas dataframe to a sav file, if user defined missing values are not set, NaNs are translated to
603+ empty strings, as there is no other possibility to represent those missing values and user defined missing values
604+ are not set automatically.
605+
606+ When reading a sav into a pandas dataframe, if the value in
602607a character variable is an empty string (''), it will not be translated to NaN, but will stay as an empty string. This
603- is because the empty string is a valid character value in SPSS and pyreadstat preserves that property. You can convert
608+ is because the empty string is a valid character value in SPSS and pyreadstat preserves that property.
609+
610+ This behaviour generates an asymetrical situation that has to be managed by the user. You can convert
604611empty strings to nan very easily with pandas if you think it is appropiate
605- for your dataset.
612+ for your dataset, or you can use defined missing values as described before .
606613
607614
608615##### SAS and STATA
@@ -641,7 +648,6 @@ df, meta = pyreadstat.read_dta("/path/to/file.dta", user_missing=True, apply_val
641648
642649Empty strings are still transtaled as empty strings and not as NaN.
643650
644-
645651The information about what values are user missing is stored in the meta object, in the variable missing_user_values.
646652This is a list listing all user defined missing values.
647653
@@ -798,7 +804,7 @@ pyreadstat.write_sav(df, path, variable_format=formats)
798804```
799805
800806The appropiate formats to use are beyond the scope of this documentation. Probably you want to read a file
801- produced in the original application and use meta.original_value_formats to get the formats. Otherwise look
807+ produced in the original application and use meta.original_variable\ _types to get the formats. Otherwise look
802808for the documentation of the original application.
803809
804810# #### SPSS
0 commit comments