{"id":19964719,"url":"https://github.com/roche/pyreadstat","last_synced_at":"2025-05-16T03:03:45.222Z","repository":{"id":32910844,"uuid":"145536189","full_name":"Roche/pyreadstat","owner":"Roche","description":"Python package to read sas, spss and stata files into pandas data frames. It is a wrapper for the C library readstat.","archived":false,"fork":false,"pushed_at":"2025-05-12T09:00:05.000Z","size":42286,"stargazers_count":355,"open_issues_count":22,"forks_count":62,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-05-12T10:23:04.473Z","etag":null,"topics":["conversion","pandas-dataframe","python","readstat","sas7bdat","spss","stata-files"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Roche.png","metadata":{"files":{"readme":"README.md","changelog":"change_log.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-08-21T08:58:10.000Z","updated_at":"2025-04-30T14:35:19.000Z","dependencies_parsed_at":"2025-04-13T12:22:23.346Z","dependency_job_id":"43242a40-4ed8-49a5-896b-ba67a792f689","html_url":"https://github.com/Roche/pyreadstat","commit_stats":{"total_commits":292,"total_committers":16,"mean_commits":18.25,"dds":0.2534246575342466,"last_synced_commit":"06dbeec2b367bc8b465416d37c041d07c40912dd"},"previous_names":[],"tags_count":41,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Roche%2Fpyreadstat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Roche%2Fpyreadstat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Roche%2Fpyreadstat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Roche%2Fpyreadstat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Roche","download_url":"https://codeload.github.com/Roche/pyreadstat/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254459084,"owners_count":22074604,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["conversion","pandas-dataframe","python","readstat","sas7bdat","spss","stata-files"],"created_at":"2024-11-13T02:25:00.608Z","updated_at":"2025-05-16T03:03:40.214Z","avatar_url":"https://github.com/Roche.png","language":"C","readme":"# pyreadstat\n\nA python package to read and write sas (sas7bdat, sas7bcat, xport), spps (sav, zsav, por) and stata (dta) data files\ninto/from pandas dataframes.\n\u003cbr\u003e\n\nThis module is a wrapper around the excellent [Readstat](https://github.com/WizardMac/ReadStat) C library by\n[Evan Miller](https://www.evanmiller.org/). Readstat is the library used in the back of the R library\n[Haven](https://github.com/tidyverse/haven),\nmeaning pyreadstat is a python equivalent to R Haven.\n\nDetailed documentation on all available methods is in the\n[Module documentation](https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html)\n\nIf you would like to read R RData and Rds files into python in an easy way,\ntake a look to [pyreadr](https://github.com/ofajardo/pyreadr), a wrapper\naround the C library [librdata](https://github.com/WizardMac/librdata)\n\nIf you would like to effortlessly produce beautiful summaries from pandas dataframes take\na look to [pysummaries](https://github.com/Genentech/pysummaries)!\n\n\n**DISCLAIMER**\n\n**Pyreadstat is not a validated package. The results may have inaccuracies deriving from the fact most of the data formats\nare not open. Do not use it for critical tasks such as reporting to the authorities. Pyreadstat is not meant to replace\nthe original applications in this regard.**  \n\n## Table of Contents\n\n* [Motivation](#motivation)\n* [Dependencies](#dependencies)\n* [Installation](#installation)\n  + [Using pip](#using-pip)\n  + [Using conda](#using-conda)\n  + [From the latest sources](#from-the-latest-sources)\n  + [Compiling on Windows and Mac](#compiling-on-windows-and-mac)\n* [Usage](#usage)\n  + [Basic Usage](#basic-usage)\n    - [Reading Files](#reading-files)\n    - [Writing Files](#writing-files)\n  + [More reading options](#more-reading-options)\n    - [Reading only the headers](#reading-only-the-headers)\n    - [Reading selected columns](#reading-selected-columns)\n    - [Reading files in parallel processes](#reading-files-in-parallel-processes)\n    - [Reading rows in chunks](#reading-rows-in-chunks)\n    - [Reading value labels](#reading-value-labels)\n    - [Missing Values](#missing-values)\n      + [SPSS](#spss)\n      + [SAS and STATA](#sas-and-stata)\n    - [Reading datetime and date columns](#reading-datetime-and-date-columns)\n    - [Other options](#other-options)\n  + [More writing options](#more-writing-options)\n    - [File specific options](#file-specific-options)\n    - [Writing value labels](#writing-value-labels)\n    - [Writing user defined missing values](#writing-user-defined-missing-values)\n    - [Setting variable formats](#setting-variable-formats)\n    - [Variable type conversion](#variable-type-conversion)\n* [Roadmap](#roadmap)\n* [CD/CI and wheels](#cdci_and_wheels)\n* [Known limitations](#known-limitations)\n* [Python 2.7 support.](#python-27-support)\n* [Change log](#change-log)\n* [License](#license)\n* [Contributing](#contributing)\n* [People](#people)\n\n\n## Motivation\n\nThe original motivation came from reading sas7bdat files in python. That is already possible using either the (pure\npython) package [sas7bdat](https://pypi.org/project/sas7bdat/) or the (cythonized) method\n[read_sas](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html)\nfrom pandas. However, those methods are slow (important if you want to read several large files), do not give the\npossibility to recover value labels (stored in\nthe file itself in the case of spss or stata, or in catalog files in sas), convert both dates and datetime variables to datetime,\nand you have to specify the encoding otherwise in python 3 instead of strings you get bytes.\n\nThis package corrects those problems.\n\n**1. Good Performance:** Here a comparison of reading a 190 Mb sas7dat file having 202 K rows \nby 70 columns with numeric, character and date-like columns using different methods. As you can see\npyreadstat is the fastest for python and matches the speeds of R Haven.\n\n| Method | time  |\n| :----- | :-----------------: |\n| Python 3 - sas7bdat | 6 min |\n| Python 3- pandas | 42 s |\n| Python 3- pyreadstat | 7 s  |\n| R - Haven | 7 s |\n\n**2. Reading Value Labels** Neither sas7bdat and pandas.read_sas gives the possibility to read sas7bcat catalog files.\nPyreadstat can do that and also extract value labels from SPSS and STATA files.\n\n**3. Reading dates and datetimes** sas7bdat and pandas.read_sas convert both date and datetime variables into datetime.\nThat means if you have a date such a '01-01-2018' it will be transformed to '01-01-2018 00:00:00' (it always inserts a\ntime), making it impossible\nto know looking only at the data if the variable was originally a datetime (if it had a time) or not.\nPyreadstat transforms dates to dates and datetimes to datetimes, so that you have a better correspondence with the original\ndata. However, it is possible to keep the original pandas behavior and get always datetimes.\n\n**4. Encoding** On python 3, pandas.read_sas reads all strings as bytes. If you want strings you have to specify the encoding manually.\npyreadstat read strings as str. Thas is possible because readstat extracts the original encoding and translates\nto utf-8, so that you don't have to care about that anymore. However it is still possible to manually set the encoding.\n\nIn addition pyreadstat exposes the variable labels in an easy way (see later). As pandas dataframes cannot handle value\nlabels, you as user will have to take the decision whether to use those values or not. Pandas read_sas reads those labels,\nbut in order to recover them you have to work a bit harder.\n\nCompared to R Haven, pyreadstat offers the possibility to read only the headers: Sometimes you want to take a\nlook to many (sas) files looking for the datasets that contain\nsome specific columns, and you want to do it quick. This package offers the possibility to read only the metadata making\nit possible a very fast metadata scraping (Pandas read_sas can also do it if you pass the value iterator=True).\nIn addition it offers the capability to read sas7bcat files separately from the sas7bdat files.\n\nMore recently there has been a lot of interest from users on using pyreadstat to read SPSS sav files. After improvements\nin pyreadstat 1.0.3 below some benchmarks are presented. The small file is 200K rows x 100 columns (152 Mb)\ncontaining only numeric columns  and\nthe big file is 294K rows x 666 columns (1.5 Gb). There are two versions of the big file: one containing numeric\ncolumns only and one with a mix of numeric and character. Pyreadstat gives two ways to read files: reading in\na single process using read_sav and reading it in multiple processes using read_file_multiprocessing (see later\nin the readme for more information).\n\n| Method | small  | big numeric | big mixed |\n| :----- | :----: | :---------: | :-------: |\n| pyreadstat read_sav | 2.3 s | 28 s | 40 s |\n| pyreadstat read_file_multiprocessing | 0.8 s | 10 s | 21 s |\n\nAs you see performance degrades in pyreadstat when reading a table with both numeric and character types. This\nis because numpy and pandas do not have a native type for strings but they use a generic object type which\nbrings a big hit in performance. The situation can be improved tough by reading files in multiple processes.\n\n\n## Dependencies\n\nThe module depends on pandas, which you normally have installed if you got Anaconda (highly recommended.)\n\nIn order to compile from source you will need a C compiler (see installation).\nOnly if you want to do changes to the cython source code, you will need cython (normally not necessary).\nIf you want to compile for python 2.7 or windows, you will need cython (see python 2.7 support\nlater).\n\nReadstat depends on the C library iconv to handle character encodings. On mac, the library is found on the system, but\nusers have sometimes reported problems. In those cases it may help to install libiconv with conda (see later, compilation\non mac). Readstat also depends on zlib; it was reported not to be installed by default on Lubuntu. If you face this problem installing the\nlibrary solves it.\n\n## Installation\n\n### Using pip\n\nProbably the easiest way: from your conda, virtualenv or just base installation do:\n\n```\npip install pyreadstat\n```\n\nIf you are running on a machine without admin rights, and you want to install against your base installation you can do:\n\n```\npip install pyreadstat --user\n```\n\nAt the moment we offer pre-compiled wheels for windows, mac and\nlinux. Look at the [pypi webpage](https://pypi.org/project/pyreadstat/) to find out which python versions\nare currently supported. If there is no pre-compiled\nwheel available, pip will attempt to compile the source code.\n\n### Using conda\n\nThe package is also available in [conda-forge](https://anaconda.org/conda-forge/pyreadstat) for windows, mac and linux\n64 bit. Visit the Conda forge webpage to find out which python versions are currently supported.\n\nIn order to install:\n\n```\nconda install -c conda-forge pyreadstat\n```\n\n### From the latest sources\n\nDownload or clone the repo, open a command window and type:\n\n```\npython3 setup.py install\n```\n\nIf you don't have admin privileges to the machine (for example on Bee) do:\n\n```\npython3 setup.py install --user\n```\n\nYou can also install from the github repo directly (without cloning). Use the flag --user if necessary.\n\n```\npip install git+https://github.com/Roche/pyreadstat.git\n```\n\nYou need a working C compiler and cython \u003e=3.0.0.\n\n### Compiling on Windows and Mac\n\nCompiling on linux is very easy, but on windows you need some extra preparation.\nSome instructions are found [here](https://github.com/Roche/pyreadstat/blob/master/windows_compilation.md)\n\nCompiling on mac is usually easy. Readstat depends however on the C library iconv to handle character encodings; while\non linux is part of gclib, on mac it is a separated shared library found on the system (h file is in /usr/include and shared\nlibrary on /usr/lib). While compiling against this usually works fine, some users have reported problems (for example\nmissing symbol _iconv, or libiconv version too old). In those cases it helped to install libiconv with conda:\n\n```\nconda install libiconv\n```\n\nand then recompile again (be sure to delete any cache, if using pip do pip --no-cache-dir, if using setup.py remove\nthe folder build, otherwise you may be installing the old compilation again).\n\n## Usage\n\n### Basic Usage\n\n#### Reading files\n\nPass the path to a file to any of the functions provided by pyreadstat. It will return a pandas data frame and a metadata\nobject. \u003cbr\u003e\nThe dataframe uses the column names. The metadata object contains the column names, column labels, number_rows,\nnumber_columns, file label\n(if any), file encoding (if applicable), notes and objects about value labels (if present). Be aware that file_label and\nfile_encoding may be None, not all columns may have labels, notes may not be present and there may be no value labels.\n\nFor example, in order to read a sas7bdat file:\n\n```python\nimport pyreadstat\n\ndf, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat')\n\n# done! let's see what we got\nprint(df.head())\nprint(meta.column_names)\nprint(meta.column_labels)\nprint(meta.column_names_to_labels)\nprint(meta.number_rows)\nprint(meta.number_columns)\nprint(meta.file_label)\nprint(meta.file_encoding)\n# there are other metadata pieces extracted. See the documentation for more details.\n```\n\nYou can replace the column names by column labels very easily (but check first that all columns have distinct labels!):\n\n```python\n# replace column names with column labels\ndf.columns = meta.column_labels\n# to go back to column names\ndf.columns = meta.column_names\n```\n\n#### Writing files\n\nPyreadstat can write STATA (dta), SPSS (sav and zsav, por currently nor supported) and SAS (Xport, sas7bdat and sas7bcat\ncurrently not supported) files from pandas data frames.\n\nwrite functions take as first argument a pandas data frame (other data structures are not supported), as a second argument\nthe path to the destination file. Optionally you can also pass a file label and a list with column labels.\n\n```python\nimport pandas as pd\nimport pyreadstat\n\ndf = pd.DataFrame([[1,2.0,\"a\"],[3,4.0,\"b\"]], columns=[\"v1\", \"v2\", \"v3\"])\n# column_labels can also be a dictionary with variable name as key and label as value\ncolumn_labels = [\"Variable 1\", \"Variable 2\", \"Variable 3\"]\npyreadstat.write_sav(df, \"path/to/destination.sav\", file_label=\"test\", column_labels=column_labels)\n```\n\nSome special arguments are available depending on the function. write_sav can take also notes as string, wheter to\ncompress or not as zsav or apply row compression, variable display widths and variable measures. write_dta can take a stata version.\nwrite_xport a name for the dataset. User defined missing values and value labels are also supported. See the\n[Module documentation](https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html) for more details.\n\nHere there is a relation of all functions available.\nYou can also check the [Module documentation](https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html).\n\n| Function in this package | Purpose |\n| ------------------- | ----------- |\n| read_sas7dat        | read SAS sas7bdat files |\n| read_xport          | read SAS Xport (XPT) files |\n| read_sas7bcat       | read SAS catalog files |\n| read_dta            | read STATA dta files |\n| read_sav            | read SPSS sav and zsav files  |\n| read_por            | read SPSS por files  |\n| set_catalog_to_sas  | enrich sas dataframe with catalog formats |\n| set_value_labels    | replace values by their labels |\n| read_file_in_chunks | generator to read files in chunks |\n| write_sav           | write SPSS sav and zsav files |\n| write_por           | write SPSS Portable (POR) files |\n| write_dta           | write STATA dta files |\n| write_xport         | write SAS Xport (XPT) files version 8 and 5 |\n\n\n### More reading options\n\n#### Reading only the headers\n\nAll functions accept a keyword argument \"metadataonly\" which by default is False. If True, then no data will be read,\nbut still both the metadata and the dataframe will be returned. The metadata will contain all fields as usual, but\nthe dataframe will be emtpy, although with the correct columns names. Sometimes number_rows may be None if it was not\npossible to determine the number of rows without reading the data.\n\n```python\nimport pyreadstat\n\ndf, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', metadataonly=True)\n```\n\n#### Reading selected columns\n\nAll functions accept a keyword \"usecols\" which should be a list of column names. Only the columns which names match those\nin the list will be imported (case sensitive). This decreases memory consumption and speeds up the process. Usecols must\nalways be a list, even if there is only one member.\n\n```python\nimport pyreadstat\n\ndf, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', usecols=[\"variable1\", \"variable2\"])\n\n```\n\n#### Reading files in parallel processes\n\nA challenge when reading large files is the time consumed in the operation. In order to alleviate this\npyreadstat provides a function \"read\\_file\\_multiprocessing\" to read a file in parallel processes using\n the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If\nthat is not the case look at Reading rows in chunks (next section)\n\nSpeed ups in the process will depend on a number of factors such as number of processes available, RAM, \ncontent of the file etc.\n\n```python\nimport pyreadstat\n\nfpath = \"path/to/file.sav\" \ndf, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath, num_processes=4) \n```\n\nnum_processes is the number of workers and it defaults to 4 (or the number of cores if less than 4). You can play with it to see where you \nget the best performance. You can also get the number of all available workers like this:\n\n```\nimport multiprocessing\nnum_processes = multiprocessing.cpu_count()\n```\n\n**Notes for Xport, Por and some defective SAV files not having the number of rows in the metadata**\n1. In all Xport, Por and some defective SAV files, the number of rows cannot be determined from the metadata. In such cases,\n   you can use the parameter num\\_rows to be equal or larger to the number of rows in the dataset. This number can be obtained\n   reading the file without multiprocessing, reading in another application, etc.\n\n**Notes for windows**\n\n1. For this to work you must include a __name__ == \"__main__\" section in your script. See [this issue](#85)\nfor more details.\n\n```\nimport pyreadstat\n\nif __name__ == \"__main__\":\n     df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, 'sample.sav')\n```\n2. If you include too many workers or you run out of RAM you main get a message about not enough page file\nsize. See [this issue](#87)\n\n#### Reading rows in chunks\n\nReading large files with hundred of thouseds of rows can be challenging due to memory restrictions. In such cases, it may be helpful\nto read the files in chunks.\n\nEvery reading function has two arguments row_limit and row_offset that help achieving this. row_offset makes to skip a number of rows before\nstart reading. row_limit makes to stop after a number of rows are read. Combining both you can read the file in chunks inside or outside a loop.\n\n```python\nimport pyreadstat\n\ndf, meta = pyreadstat.read_sas7bdat(\"/path/to/file.sas7bdat\", row_offset=1, row_limit=1)\n# df will contain only the second row of the file\n```\n\nPyreadstat also has a convienence function read_file_in_chunks, which returns a generator that helps you to iterate through the file in\nchunks. This function takes as first argument a pyreadstat reading function and a second argument a path to a file. Optionally you can\nchange the size of the chunks with chunksize (default to 100000), and also add an offset and limit. You can use any keyword argument\nyou wish to pass to the pyreadstat reading function.\n\n```python\nimport pyreadstat\nfpath = \"path/to/file.sas7bdat\"\nreader = pyreadstat.read_file_in_chunks(pyreadstat.read_sas7bdat, fpath, chunksize= 10, offset=2, limit=100, disable_datetime_conversion=True)\n\nfor df, meta in reader:\n    print(df) # df will contain 10 rows except for the last one\n    # do some cool calculations here for the chunk\n```\n\nFor very large files it may be convienient to speed up the process by reading each chunks in parallel. For\nthis purpose you can pass the argument multiprocess=True. This is a combination of read_file_in_chunks and\nread_file_multiprocessing. Here you can use the arguments row_offset and row_limit to start reading the\nfile from an offest and stop after a row_offset+row_limit. \n\n```python\nimport pyreadstat\nfpath = \"path/to/file.sav\"\nreader = pyreadstat.read_file_in_chunks(pyreadstat.read_sav, fpath, chunksize= 10000, multiprocess=True, num_processes=4)\n\nfor df, meta in reader:\n    print(df) # df will contain 10000 rows except for the last one\n    # do some cool calculations here for the chunk\n```\n\n**If using multiprocessing, please read the notes in the previous section regarding Xport, Por and some defective SAV files not\nhaving the number of rows in the metadata**\n \n**For Windows, please check the notes on the previous section reading files in parallel processes**\n\n#### Reading value labels\n\nFor sas7bdat files, value labels are stored in separated sas7bcat files. You can use them in combination with the sas7bdat\nor read them separately.\n\nIf you want to read them in combination with the sas7bdat files, pass the path to the sas7bcat files to the read_sas7bdat\nfunction. The original values will be replaced by the values in the catalog.\n\n```python\nimport pyreadstat\n\n# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. There is also formats_as_ordered_category to get an ordered category, this by default is False.\ndf, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', catalog_file='/path/to/a/file.sas7bcat', formats_as_category=True, formats_as_ordered_category=False)\n```\n\nIf you prefer to read the sas7bcat file separately, you can apply the formats later with the function set_catalog_to_sas.\nIn this way you can have two copies of the dataframe, one with catalog and one without.\n\n```python\nimport pyreadstat\n\n# this df will have the original values\ndf, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat')\n# read_sas7bdat returns an emtpy data frame and the catalog\ndf_empty, catalog = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bcat')\n# enrich the dataframe with the catalog\n# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.\ndf_enriched, meta_enriched = pyreadstat.set_catalog_to_sas(df, meta, catalog, \n                             formats_as_category=True, formats_as_ordered_category=False)\n```\n\nFor SPSS and STATA files, the value labels are included in the files. You can choose to replace the values by the labels\nwhen reading the file using the option apply_value_formats, ...\n\n```python\nimport pyreadstat\n\n# apply_value_formats is by default False, so you have to set it to True manually if you want the labels\n# formats_as_category is by default True, and it means the replaced values will be transformed to a pandas category column. formats_as_ordered_category is by default False meaning by default categories are not ordered.\ndf, meta = pyreadstat.read_sav(\"/path/to/sav/file.sav\", apply_value_formats=True, \n                                formats_as_category=True, formats_as_ordered_category=False)\n```\n\n... or to do it later with the function set_value_labels:\n\n```python\nimport pyreadstat\n\n# This time no value labels.\ndf, meta = pyreadstat.read_sav(\"/path/to/sav/file.sav\", apply_value_formats=False)\n# now let's add them to a second copy\ndf_enriched = pyreadstat.set_value_labels(df, meta, formats_as_category=True, formats_as_ordered_category=False)\n```\n\nInternally each variable is associated with a label set. This information is stored in meta.variable_to_label. Each\nlabel set contains a map of the actual value in the variable to the label, this informtion is stored in\nmeta.variable_value_labels. By combining both you can get a dictionary of variable names to a dictionary of actual\nvalues to labels.\n\nFor SPSS and STATA:\n\n```python\nimport pyreadstat\n\ndf, meta = pyreadstat.read_sav(\"test_data/basic/sample.sav\")\n# the variables mylabl and myord are associated to the label sets labels0 and labels1 respectively\nprint(meta.variable_to_label)\n#{'mylabl': 'labels0', 'myord': 'labels1'}\n\n# labels0 and labels1 contain a dictionary of actual value to label\nprint(meta.value_labels)\n#{'labels0': {1.0: 'Male', 2.0: 'Female'}, 'labels1': {1.0: 'low', 2.0: 'medium', 3.0: 'high'}}\n\n# both things have been joined by pyreadstat for convienent use\nprint(meta.variable_value_labels)\n#{'mylabl': {1.0: 'Male', 2.0: 'Female'}, 'myord': {1.0: 'low', 2.0: 'medium', 3.0: 'high'}}\n\n```\n\nSAS is very similar except that meta.variable_to_label comes from the sas7bdat file and meta.value_labels comes from the\nsas7bcat file. That means if you read a sas7bdat file and a sas7bcat file togheter meta.variable_value_labels will be\nfilled in. If you read only the sas7bdat file only meta.variable_to_label will be available and if you read the\nsas7bcat file only meta.value_labels will be available. If you read a sas7bdat file and there are no associated label\nsets, SAS will assign by default the variable format as label sets.\n\n```python\nimport pyreadstat\n\ndf, meta = pyreadstat.read_sas7bdat(\"test_data/sas_catalog/test_data_linux.sas7bdat\")\nmeta.variable_to_label\n{'SEXA': '$A', 'SEXB': '$B'}\n\ndf2, meta2 = pyreadstat.read_sas7bcat(\"test_data/sas_catalog/test_formats_linux.sas7bcat\")\nmeta2.value_labels\n{'$A': {'1': 'Male', '2': 'Female'}, '$B': {'2': 'Female', '1': 'Male'}}\n\n```\n\n\n#### Missing Values\n\nThere are two types of missing values: system and user defined. System are assigned by the program by default. User defined are\nvalid values that the user decided to give the meaning of missing in order to differentiate between several situations.For\nexample if one has a categorical variable representing if the person passed a test, you could have 0 for did not pass,\n1 for pass, and as user defined missing variables 2 for did not show up for the test, 3 for unable to process the results,\netc.\n\n**By default both cases are represented by NaN when\nread with pyreadstat**. Notice that the only possible missing value in pandas is NaN (Not a Number) for both string and numeric\nvariables, date, datetime and time variables have NaT (Not a Time).\n\n##### SPSS\n\nIn the case of SPSS sav files, the user can assign to a numeric variable either up to three discrete missing values or\none range plus one discrete missing value. As mentioned by default all of these possiblities are translated into NaN,\nbut one can get those original values by passing the argument user_missing=True to the read_sav function:\n\n```python\n# user set with default missing values\nimport pyreadstat\ndf, meta = pyreadstat.read_sav(\"/path/to/file.sav\")\nprint(df)\n\u003e\u003e test_passed\n   1\n   0\n   NaN\n   NaN\n```\n\nNow, reading the user defined missing values:\n\n```python\n# user set with user defined missing values\nimport pyreadstat\ndf, meta = pyreadstat.read_sav(\"/path/to/file.sav\", user_missing=True)\nprint(df)\n\u003e\u003e test_passed\n   1\n   0\n   2\n   3\n```\n\nAs you see now instead o NaN the values 2 and 3 appear. In case the dataset had value labels, we could bring those in\n```python\n# user set with user defined missing values and labels\nimport pyreadstat\ndf, meta = pyreadstat.read_sav(\"/path/to/file.sav\", user_missing=True, apply_value_formats=True)\nprint(df)\n\u003e\u003e test_passed\n   \"passed\"\n   \"not passed\"\n   \"not shown\"\n   \"not processed\"\n```\n\nFinally, the information about what values are user missing is stored in the meta object, in the variable missing_ranges.\nThis is a dicitonary with the key being the name of the variable, and as value a list of dictionaries, each dictionary\ncontains the elements \"hi\" and \"lo\" to represent the lower and upper bound of the range, however for discrete values\nas in the example, both boundaries are also present although the value is the same in both cases.\n\n```python\n# user set with default missing values\nimport pyreadstat\ndf, meta = pyreadstat.read_sav(\"/path/to/file.sav\", user_missing=True, apply_value_formats=True)\nprint(meta.missing_ranges)\n\u003e\u003e\u003e {'test_passed':[{'hi':2, 'lo':2}, {'hi':3, 'lo':3}]}\n```\n\nSPSS sav files also support up to 3 discrete user defined missing values for non numeric (character) variables.\nPyreadstat is able to read those and the behavior is the same as for discrete\nnumerical user defined missing values. That means those values will be\ntranslated as NaN by default and to the correspoding string value if\nuser_missing is set to True. meta.missing_ranges will show the string\nvalue as well.\n\nIf the value in\na character variable is an empty string (''), it will not be translated to NaN, but will stay as an empty string. This\nis because the empty string is a valid character value in SPSS and pyreadstat preserves that property. You can convert\nempty strings to nan very easily with pandas if you think it is appropiate\nfor your dataset.\n\n\n##### SAS and STATA\n\nIn SAS the user can assign values from .A to .Z and ._ as user defined missing values. In Stata values from\n.a to .z. As in SPSS, those are normally translated to\nNaN. However, using user_missing=True with read_sas7bdat or read_dta\nwill produce values from A to Z and _ for SAS and a to z for dta. In addition a variable\nmissing_user_values will appear in the metadata object, being a list with those values that are user defined missing\nvalues.\n\n```python\nimport pyreadstat\n\ndf, meta = pyreadstat.read_sas7bdat(\"/path/to/file.sas7bdat\", user_missing=True)\n\ndf, meta = pyreadstat.read_dta(\"/path/to/file.dta\", user_missing=True)\n\nprint(meta.missing_user_values)\n\n```\n\nThe user may also assign a label to user defined missing values. In such\ncase passing the corresponding sas7bcat file to read_sas7bdat or using\nthe option apply_value_formats to read_dta will show those labels instead\nof the user defined missing value.\n\n```python\nimport pyreadstat\n\ndf, meta = pyreadstat.read_sas7bdat(\"/path/to/file.sas7bdat\", catalog_file=\"/path/to/file.sas7bcat\", user_missing=True)\n\ndf, meta = pyreadstat.read_dta(\"/path/to/file.dta\", user_missing=True, apply_value_formats=True)\n\n```\n\nEmpty strings are still transtaled as empty strings and not as NaN.\n\n\nThe information about what values are user missing is stored in the meta object, in the variable missing_user_values.\nThis is a list listing all user defined missing values.\n\nUser defined missing values are currently not supported for file types other than sas7bdat,\nsas7bcat and dta.\n\n#### Reading datetime and date columns\n\nSAS, SPSS and STATA represent datetime, date and other similar concepts as a numeric column and then applies a \ndisplay format on top. Roughly speaking, internally there are two possible representations: one for concepts with a day or lower \ngranularity (date, week, quarter, year, etc.) and those with a higher granularity than a day (datetime, time, hour, etc).\nThe first group is suceptible to be converted to a python date object and the second to a python datetime object. \n\nPyreadstat attempts to read columns with datetime, date and time formats that are convertible\nto python datetime, date and time objects automatically. However there are other formats that are not fully convertible to\nany of these formats, for example SAS \"YEAR\" (displaying only the year), \"MMYY\" (displaying only month and year), etc.\nBecause there are too many of these formats and these keep changing, it is not possible to implement a rule for each of\nthose, therefore these columns are not transformed and the user will obtain a numeric column. \n\nIn order to cope with this issue, there are two options for each reader function: extra\\_datetime\\_formats and\n extra\\_date\\_formats that allow the user to\npass these datetime or date formats, to transform the numeric values into datetime or date python objects. Then, the user\ncan format those columns appropiately; for example extracting the year only to an integer column in the case of 'YEAR' or\nformatting it to a string 'YYYY-MM' in the case of 'MMYY'. The choice between datetime or date format depends on the granularity\nof the data as explained above.\n\nThis arguments are also useful in the case you have a valid datetime, date or time format that is currently not recognized in pyreadstat.\nIn those cases, feel free to file an issue to ask those to be added to the list, in the meantime you can use these arguments to do\nthe conversion.\n\n```python\nimport pyreadstat\n\ndf, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', extra_date_formats=[\"YEAR\", \"MMYY\"])\n```\n\n#### Other options\n\nYou can set the encoding of the original file manually. The encoding must be a [iconv-compatible encoding](https://gist.github.com/hakre/4188459).\nThis is absolutely necessary if you are handling old xport files with\nnon-ascii characters. Those files do not have stamped the encoding in the\nfile itself, therefore the encoding must be set manually. For SPSS POR files it is not possible to set the encoding and\nfiles are assumed to be always encoded in UTF-8.\n\n\n```python\nimport pyreadstat\n\ndf, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', encoding=\"LATIN1\")\n```\n\nYou can preserve the original pandas behavior regarding dates (meaning dates are converted to pandas datetime) with the\ndates_as_pandas_datetime option\n\n```python\nimport pyreadstat\n\ndf, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', dates_as_pandas_datetime=True)\n```\n\nYou can get a dictionary of numpy arrays instead of a pandas dataframe when reading any file format.\nIn order to do that, set the parameter output_format='dict' (default is 'pandas'). This is useful if\nyou want to transform the data to some other format different to pandas, as transforming the data to pandas is a costly\nprocess both in terms of speed and memory. Here for example an efficient way to transform the data to a polars dataframe:\n\n```python\nimport pyreadstat\nimport polars\n\ndicdata, meta = pyreadstat.read_sav('/path/to/a/file.sav', output_format='dict')\ndf = polars.DataFrame(dicdata)\n```\n\nFor more information, please check the [Module documentation](https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html).\n\n### More writing options\n\n#### File specific options\n\nSome special arguments are available depending on the function. write_sav can take also notes as string, wheter to\ncompress or not as zsav or apply row compression, variable display widths and variable measures. write_dta can take a stata version.\nwrite_xport a name for the dataset. See the\n[Module documentation](https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html) for more details.\n\n#### Writing value labels\n\nThe argument variable_value_labels can be passed to write_sav and write_dta to write value labels. This argument must be a\ndictionary where keys are variable names (names must match column names in the pandas data frame). Values are another dictionary where\nkeys are the value present in the dataframe and values are the labels (strings).\n\n```python\nimport pandas as pd\nimport pyreadstat\ndf = pd.DataFrame([[1,1],[2,2],[1,3]], columns=['mylabl', 'myord'])\nvariable_value_labels = {'mylabl': {1: 'Male', 2: 'Female'}, 'myord': {1: 'low', 2: 'medium', 3: 'high'}}\npath = \"/path/to/somefile.sav\"\npyreadstat.write_sav(df, path, variable_value_labels=variable_value_labels)\n```\n\n#### Writing user defined missing values\n\n##### SPSS\n\nThe argument missing_ranges can be passed to write_sav to write user defined missing values.\nThis argument be a dictionary with keys as variable names matching variable\nnames in the dataframe. The values must be a list. Each element in that list can either be\neither a discrete numeric or string value (max 3 per variable) or a dictionary with keys 'hi' and 'lo' to\nindicate the upper and lower range for numeric values (max 1 range value + 1 discrete value per\nvariable). hi and lo may also be the same value in which case it will be interpreted as a discrete\nmissing value. For this to be effective, values in the dataframe must be the same as reported here and not NaN.\n\n```python\nimport pandas as pd\nimport pyreadstat\ndf = pd.DataFrame([[\"a\",1],[\"c\",2],[\"c\",3]], columns=['mychar', 'myord'])\nmissing_ranges = {'mychar':['a'], 'myord': [{'hi':2, 'lo':1}]}\npath = \"/path/to/somefile.sav\"\npyreadstat.write_sav(df, path, missing_ranges=missing_ranges)\n```\n\n##### STATA\n\nThe argument missing_user_values can be passed to write_dta to write user defined missing values only for numeric variables.\nThis argument be a dictionary with keys as variable names matching variable\nnames in the dataframe. The values must be a list of missing values, valid values are single character strings\nbetween a and z. Optionally a value label can also be attached to those missing values using variable_value_labels.\n\n```python\nimport pandas as pd\nimport pyreadstat\ndf = pd.DataFrame([[\"a\", 1],[2.2, 2],[3.3, \"b\"]], columns=['Var1', 'Var2'])\nvariable_value_labels = {'Var1':{'a':'a missing value'}\nmissing_ranges = {'Var1':['a'], 'Var2': ['b']}\npath = \"/path/to/somefile.sav\"\npyreadstat.write_sav(df, path, missing_ranges=missing_ranges, variable_value_labels=variable_value_labels)\n```\n\n#### Setting variable formats\n\nNumeric types in SPSS, SAS and STATA can have formats that affect how those values are displayed to the user\nin the application. Pyreadstat automatically sets the formatting in some cases, as for example when translating\ndates or datetimes (which in SPSS/SAS/STATA are just numbers with a special format). The user can however specify custom formats\nfor their columns with the argument \"variable_format\", which is\na dictionary with the column name as key and a string with the format as values:\n\n```python\nimport pandas as pd\nimport pyreadstat\n\npath = \"path/where/to/write/file.sav\"\ndf = pd.DataFrame({'restricted':[1023, 10], 'integer':[1,2]})\nformats = {'restricted':'N4', 'integer':'F1.0'}\npyreadstat.write_sav(df, path, variable_format=formats)\n```\n\nThe appropiate formats to use are beyond the scope of this documentation. Probably you want to read a file\nproduced in the original application and use meta.original_value_formats to get the formats. Otherwise look\nfor the documentation of the original application.\n\n##### SPSS\n\nIn the case of SPSS we have some presets for some formats:\n* restricted_integer: with leading zeros, equivalent to N + variable width (e.g N4)\n* integer: Numeric with no decimal places, equivalent to F + variable width + \".0\" (0 decimal positions). A \n  pandas column of type integer will also be translated into this format automatically.\n\n```python\nimport pandas as pd\nimport pyreadstat\n\npath = \"path/where/to/write/file.sav\"\ndf = pd.DataFrame({'restricted':[1023, 10], 'integer':[1,2]})\nformats = {'restricted':'restricted_integer', 'integer':'integer'}\npyreadstat.write_sav(df, path, variable_format=formats)\n```\n\nThere is some information about the possible formats [here](https://www.gnu.org/software/pspp/pspp-dev/html_node/Variable-Record.html).\n\n#### Variable type conversion\n\nThe following rules are used in order to convert from pandas/numpy/python types to the target file types:\n\n| Python Type         | Converted Type    |\n| ------------------- | --------- |\n| np.int32 or lower   | integer (stata), numeric (spss, sas) |\n| int, np.int64, np.float  | double (stata), numeric (spss, sas)   |\n| str                 | character |\n| bool                | integer (stata), numeric (spss, sas) |\n| datetime, date, time | numeric with datetime/date/time formatting |\n| category            | depends on the original dtype |\n| any other object    | character |\n| column all missing  | integer (stata), numeric (spss, sas)   |\n| column with mixed types | character |\n\nColumns with mixed types are translated to character. This does not apply to column\ncotaining np.nan, where the missing values are correctly translated. It also does not apply to columns with\nuser defined missing values in stata/sas where characters (a to z, A to Z, \\_) will be recorded as numeric.\n\n## Roadmap\n\n* Include latest releases from Readstat as they come out.\n\n## CD/CI and Wheels\n\nA CD/CI pipeline producing the wheels is available [here](https://github.com/ofajardo/pyreadstat_wheels4). Contributions\nare welcome.\n\n## Known limitations\n\npyreadstat builds on top of Readstat and therefore inherits its limitations. Currently those include:\n\n* Cannot write SAS sas7bdat. Those files can be written but not read in\nSAS and therefore are not supported in pyreadstat. (see [here](https://github.com/WizardMac/ReadStat/issues/98))\n\nConverting data types from foreign applications into python some times also bring some limitations:\n\n* Pyreadstat transforms date, datetime and time like variables which are internally represented in the original application as\n numbers to python datetime objects. Python datetime objects are however limited in the range of dates they can represent\n (for example the max year is 10,000), while in other applications it is possible (although probably an error in the data)\n to have very high or very low dates. In this cases pyreadstat would raise an error:\n\n ```\n OverflowError: date value out of range\n ```\n\n  The workaround is to deal with this include using the keyword argument disable_datetime_conversion so that you will\n  get numbers instead of datetime objects or skipping reading such columns with the argument usecols.\n\n## Python 2.7 support.\n\nAs version 1.2.3 Python 2.7 is not supported. In previous versions it was possible to compile it for \nmac and linux but not for windows, but no wheels were provided. In linux and mac it will fail if\n the path file contains non-ascii characters.\n\n## Change log\n\nA log with the changes for each version can be found [here](https://github.com/Roche/pyreadstat/blob/master/change_log.md)\n\n\n## License\n\npyreadstat is distributed under Apache 2.0 license. Readstat is distributed under MIT license. See the License file for\nmore information.\n\n\n## Contributing\n\nContributions are welcome! Those include corrections to the documentation, bugs reporting, testing,\nand of course code pull requests. For code pull requests please\nconsider opening an issue explaining what you plan to do, so that we can get aligned before you start investing time on\nit (this also avoids duplicated efforts).\n\nThe ReadStat code in this repo (under the subfolder src) is coming from the main Readstat trunk and should not be\nmodified in order to\nkeep full compatibility with the original. In that way improvements in ReadStat can be taken here with almost\nno effort. If you would like to propose new features involving changes in the ReadStat code, please submit a\npull request to ReadStat first.\n\n## People\n\n[Otto Fajardo](https://github.com/ofajardo) - author, maintainer\n\n[Matthew Brett](http://matthew.dynevor.org/) - contributor [python wheels](https://github.com/MacPython/pyreadstat-wheels)\n\n[Jonathon Love](https://jona.thon.love/) - contributor: open files with international characters. Function to open files for writing.\n\n[Clemens Brunner](https://github.com/cbrnr) - integration with pandas.read_spss\n\n[Thomas Grainger](https://github.com/graingert) - corrections and suggestions to source code\n\n[benjello](https://github.com/benjello), [maxwell8888](https://github.com/maxwell8888), [drcjar](https://github.com/drcjar), [labenech](https://github.com/labenech): improvements to documentation\n\n[alchemyst](https://github.com/alchemyst): improvements to docstrings\n\n[bmwiedemann](https://github.com/bmwiedemann), [toddrme2178 ](https://github.com/toddrme2178), [Martin Thorsen Ranang](https://github.com/mtr): improvements to source code\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froche%2Fpyreadstat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Froche%2Fpyreadstat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froche%2Fpyreadstat/lists"}