{"id":13423434,"url":"https://github.com/ofajardo/pyreadr","last_synced_at":"2025-05-15T01:04:31.882Z","repository":{"id":38364469,"uuid":"163268635","full_name":"ofajardo/pyreadr","owner":"ofajardo","description":"Python package to read and write R RData and Rds files into/from pandas dataframes. No R or other external dependencies required.","archived":false,"fork":false,"pushed_at":"2025-03-04T08:59:54.000Z","size":9083,"stargazers_count":308,"open_issues_count":11,"forks_count":25,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-12T03:14:16.727Z","etag":null,"topics":["pandas-dataframe","python","r","rdata","rds","rds-files"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ofajardo.png","metadata":{"files":{"readme":"README.md","changelog":"change_log.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-12-27T08:29:04.000Z","updated_at":"2025-04-01T20:10:09.000Z","dependencies_parsed_at":"2024-07-29T17:11:21.793Z","dependency_job_id":"a8fdd7a5-5230-487d-b8a4-3f01a76a3418","html_url":"https://github.com/ofajardo/pyreadr","commit_stats":{"total_commits":162,"total_committers":8,"mean_commits":20.25,"dds":"0.22839506172839508","last_synced_commit":"c7152ff67aa517d313c26addbe699c066b6a2752"},"previous_names":[],"tags_count":38,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ofajardo%2Fpyreadr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ofajardo%2Fpyreadr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ofajardo%2Fpyreadr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ofajardo%2Fpyreadr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ofajardo","download_url":"https://codeload.github.com/ofajardo/pyreadr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248785676,"owners_count":21161333,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pandas-dataframe","python","r","rdata","rds","rds-files"],"created_at":"2024-07-31T00:00:34.461Z","updated_at":"2025-04-13T21:30:17.556Z","avatar_url":"https://github.com/ofajardo.png","language":"C","funding_links":[],"categories":["C"],"sub_categories":[],"readme":"# py\u003cspan style=\"color:blue\"\u003er\u003c/span\u003eead\u003cspan style=\"color:blue\"\u003er\u003c/span\u003e\n\nA python package to read and write R RData and Rds files into/from \npandas dataframes. It does not need to have R or other external\ndependencies installed.\n\u003cbr\u003e \n\n**It can read mainly R data frames and tibbles. Also supports vectors, matrices, arrays and tables.\nR lists and R S4 objects (such as those from Bioconductor) are not supported. Please read the\nKnown limitations section and the section on what objects can be read for more information.**\n\u003cbr\u003e\n\nThis package is based on the [librdata](https://github.com/WizardMac/librdata) C library by \n[Evan Miller](https://www.evanmiller.org/) and a modified version of the cython wrapper around \nlibrdata\n[jamovi-readstat](https://github.com/jamovi/jamovi-readstat)\nby the [Jamovi](https://www.jamovi.org/) team.\n\nDetailed documentation on all available methods is in the \n[Module documentation](https://ofajardo.github.io/pyreadr/)\n\nIf you would like to read SPSS, SAS or STATA files into python in an easy way,\ntake a look to [pyreadstat](https://github.com/Roche/pyreadstat), a wrapper\naround the C library [ReadStat](https://github.com/WizardMac/ReadStat).\n\nIf you would like to effortlessly produce beautiful summary tables from pandas \ndataframes, take a look to [pysummaries](https://github.com/Genentech/pysummaries)\n\n## Table of Contents\n\n- [Dependencies](#dependencies)\n- [Installation](#installation)\n  * [Using pip](#using-pip)\n  * [Using conda](#using-conda)\n  * [From the latest sources](#from-the-latest-sources)\n- [Usage](#usage)\n  * [Basic Usage: reading files](#basic-usage--reading-files)\n  * [Basic Usage: writing files](#basic-usage--writing-files)\n  * [Reading files from internet](#reading-files-from-internet)\n  * [Reading selected objects](#reading-selected-objects)\n  * [List objects and column names](#list-objects-and-column-names)\n  * [Reading timestamps and timezones](#reading-timestamps-and-timezones)\n  * [What objects can be read](#what-objects-can-be-read-and-written)\n  * [More on writing files](#more-on-writing-files)\n- [Known limitations](#known-limitations)\n- [Contributing](#contributing)\n- [Change Log](#change-log)\n- [People](#people)\n\n## Dependencies\n\nThe package depends on pandas, which you normally have installed if you got Anaconda (highly recommended.) If creating\na new conda or virtual environment or if you don't have it in your base installation, pandas should get installed automatically.\n\nIf you are reading 3D arrays, you will need to install xarray manually. This is not installed automatically as most users\nwon't need it.\n\nIn order to compile from source, you will need a C compiler (see installation) and cython \n(version \u003e= 0.28).\n\nlibrdata also depends on zlib, bzip2 and lzma; it was reported not to be installed on Lubuntu or docker base ubuntu\nimages. If you face this problem intalling the libraries solves it.\n\n## Installation\n\n### Using pip\n\nProbably the easiest way: from your conda, virtualenv or just base installation do:\n\n```\npip install pyreadr\n```\n\nIf you are running on a machine without admin rights, and you want to install against your base installation you can do:\n\n```\npip install pyreadr --user\n```\n\nWe offer pre-compiled wheels for Windows,\nlinux and macOs.\n\n### Using conda\n\nThe package is also available in [conda-forge](https://anaconda.org/conda-forge/pyreadr) \nfor windows, mac and linux 64 bit.\n\nIn order to install:\n\n```\nconda install -c conda-forge pyreadr \n```\n\n### From the latest sources\n\nDownload or clone the repo, open a command window and type:\n\n```\npython3 setup.py install\n```\n\nIf you don't have admin privileges to the machine do:\n\n```\npython3 setup.py install --user\n```\n\nYou can also install from the github repo directly (without cloning). Use the flag --user if necessary.\n\n```\npip install git+https://github.com/ofajardo/pyreadr.git\n```\n\nYou need a working C compiler and cython. You may also need to install bzlib (on ubuntu install libbz2-dev).\n\nIn order to run the tests:\n\n```\npython tests/test_basic.py\n``` \n\nYou can also install and test in place with:\n\n```\npython setup.py build_ext --inplace\npython tests/test_basic.py --inplace\n```\n\n## Usage\n\n### Basic Usage: reading files\n\nPass the path to a RData or Rds file to the function read_r. It will return a dictionary \nwith object names as keys and pandas data frames as values.\n\nFor example, in order to read a RData file:\n\n```python\nimport pyreadr\n\nresult = pyreadr.read_r('test_data/basic/two.RData')\n\n# done! let's see what we got\nprint(result.keys()) # let's check what objects we got\ndf1 = result[\"df1\"] # extract the pandas data frame for object df1\n```\n\nreading a Rds file is equally simple. Rds files have one single object, \nwhich you can access with the key None:\n\n```python\nimport pyreadr\n\nresult = pyreadr.read_r('test_data/basic/one.Rds')\n\n# done! let's see what we got\nprint(result.keys()) # let's check what objects we got: there is only None\ndf1 = result[None] # extract the pandas data frame for the only object available\n```\n\nHere there is a relation of all functions available. \nYou can also check the [Module documentation](https://ofajardo.github.io/pyreadr/).\n\n| Function in this package | Purpose |\n| ------------------- | ----------- |\n| read_r        | reads RData and Rds files |\n| list_objects  | list objects and column names contained in RData or Rds file |\n| download_file | download file from internet |\n| write_rdata   | writes RData files |\n| write_rds     | writes Rds files   |\n\n### Basic Usage: writing files\n\nPyreadr allows you to write one single pandas data frame into a single R dataframe\nand store it into a RData or Rds file. Other python or R object types \nare not supported. Writing more than one object is not supported.\n\n\n```python\nimport pyreadr\nimport pandas as pd\n\n# prepare a pandas dataframe\ndf = pd.DataFrame([[\"a\",1],[\"b\",2]], columns=[\"A\", \"B\"])\n\n# let's write into RData\n# df_name is the name for the dataframe in R, by default dataset\npyreadr.write_rdata(\"test.RData\", df, df_name=\"dataset\")\n\n# now let's write a Rds\npyreadr.write_rds(\"test.Rds\", df)\n\n# done!\n\n```\n\nnow you can check the result in R:\n\n```r\nload(\"test.RData\")\nprint(dataset)\n\ndataset2 \u003c- readRDS(\"test.Rds\")\nprint(dataset2)\n\n```\n\nBy default the resulting files will be uncompressed, you can activate gzip compression\nby passing the option compress=\"gzip\". This is useful in case you have big files.\n\n\n```python\nimport pyreadr\nimport pandas as pd\n\n# prepare a pandas dataframe\ndf = pd.DataFrame([[\"a\",1],[\"b\",2]], columns=[\"A\", \"B\"])\n\n# write a compressed RData file\npyreadr.write_rdata(\"test.RData\", df, df_name=\"dataset\", compress=\"gzip\")\n\n# write a compressed Rds file\npyreadr.write_rds(\"test.Rds\", df, compress=\"gzip\")\n\n```\n\n### Reading files from internet\n\nLibrdata, the C backend of pyreadr absolutely needs a file in disk and only a string with the path\ncan be passed as argument, therefore you cannot pass an url to pyreadr.read_r. \n\nIn order to help with this limitation, pyreadr provides a funtion download_file which as its name\nsuggests downloads a file from an url to disk:\n\n```python\nimport pyreadr\n\nurl = \"https://github.com/hadley/nycflights13/blob/master/data/airlines.rda?raw=true\"\ndst_path = \"/some/path/on/disk/airlines.rda\"\ndst_path_again = pyreadr.download_file(url, dst_path)\nres = pyreadr.read_r(dst_path)\n```\n\nAs you see download_file returns the path where the file was written, therefore you can pass it\nto pyreadr.read_r directly:\n\n```python\nimport pyreadr\n\nurl = \"https://github.com/hadley/nycflights13/blob/master/data/airlines.rda?raw=true\"\ndst_path = \"/some/path/on/disk/airlines.rda\"\nres = pyreadr.read_r(pyreadr.download_file(url, dst_path), dst_path)\n```\n\n\n### Reading selected objects\n\nYou can use the argument use_objects of the function read_r to specify which objects\nshould be read. \n\n```python\nimport pyreadr\n\nresult = pyreadr.read_r('test_data/basic/two.RData', use_objects=[\"df1\"])\n\n# done! let's see what we got\nprint(result.keys()) # let's check what objects we got, now only df1 is listed\ndf1 = result[\"df1\"] # extract the pandas data frame for object df1\n```\n\n### List objects and column names\n\nThe function list_objects gives a dictionary with object names contained in the\nRData or Rds file as keys and a list of column names as values.\nIt is not always possible to retrieve column names without reading the whole file\nin those cases you would get None instead of a column name.\n\n```python\n\nimport pyreadr\n\nobject_list = pyreadr.list_objects('test_data/basic/two.RData')\n\n# done! let's see what we got\nprint(object_list) # let's check what objects we got and what columns those have\n\n```\n\n### Reading timestamps and timezones\n\nR Date objects are read as datetime.date objects.\n\nR datetime objects (POSIXct and POSIXlt) are internally stored as UTC timestamps, and may have additional timezone\ninformation if the user set it explicitly. If no timezone information\nwas set by the user R uses the local timezone for display. \n\nlibrdata cannot retrieve that timezone information, therefore pyreadr display UTC time by default, which will not match the\ndisplay in R. You can set explicitly some timezone (your local timezone for example) with the argument timezone for the\nfunction read_r\n\n```python\nimport pyreadr\n\nresult = pyreadr.read_r('test_data/basic/two.RData', timezone='CET')\n\n```\n\nif you would like to just use your local timezone as R does, you can \nget it with tzlocal (you need to install it first with pip) and pass the \ninformation to read_r:\n\n```python\n\nimport tzlocal\nimport pyreadr\n\nmy_timezone = tzlocal.get_localzone().zone\nresult = pyreadr.read_r('test_data/basic/two.RData', timezone=my_timezone)\n\n```\n\nIf you have control over the data in R, a good option to avoid all of this is to transform\nthe POSIX object to character, then transform it to a datetime in python.\n\nWhen writing these kind of objects pyreadr transforms them to characters. Those can be easily\ntransformed back to POSIX with as.POSIXct/lt (see later).\n\n### What objects can be read and written\n\nData frames composed of character, numeric (double), integer, timestamp (POSIXct \nand POSIXlt), date, logical atomic vectors. Factors are also supported.\n\nTibbles are also supported.\n\nAtomic vectors as described before can also be directly read and are \ntranslated to a pandas data frame with one column. \n\nMatrices, arrays and tables are also read and translated to pandas data frames\n(because those objects in R can be named, and plain numpy arrays do not support\ndimension names). The only exception is 3D arrays, which are translated to a\nxarray DataArray (as pandas does not support more than 2 dimensions). This is also\nthe only time that an object different from a pandas dataframe is returned by read_r.\n\nFor 3D arrays, consider that python prints these in a different way as R does, but still\nyou are looking at the same array (see for example [here](https://rstudio.github.io/reticulate/articles/arrays.html#displaying-arrays) for an explanation.)\n\nOnly single pandas data frames can be written into R data frames.\n\nLists and S4 objects (such as those coming from Bioconductor are not supported. Please read the Known limitations section for more\ninformation.\n\n### More on writing files\n\nFor converting python/numpy types to R types the following rules are\nfollowed:\n\n| Python Type         | R Type    |\n| ------------------- | --------- |\n| np.int32 or lower   | integer   |\n| np.int64, np.float  | numeric   |\n| str                 | character |\n| bool                | logical   |\n| datetime, date      | character |\n| category            | depends on the original dtype |\n| any other object    | character |\n| column all missing  | logical   |\n| column with mixed types | character |\n\n\n* datetime and date objects are translated to character to avoid problems\nwith timezones. These characters can be easily translated back to POSIXct/lt in R\nusing as.POSIXct/lt. The format of the datetimes/dates is prepared for this\nbut can be controlled with the arguments dateformat and datetimeformat \nfor write_rdata and write_rds. Those arguments take python standard\nformatting strings.\n\n* Pandas categories are NOT translated to R factors. Instead the original\ndata type of the category is preserved and transformed according to the\nrules. This is because R factors are integers and levels are always\nstrings, in pandas factors can be any type and leves any type as well, therefore\nit is not always adecquate to coerce everything to the integer/character system.\nIn the other hand, pandas category level information is lost in the process.\n\n* Any other object is transformed to a character using the str representation\nof the object.\n\n* Columns with mixed types are translated to character. This does not apply to column\ncotaining np.nan, where the missing values are correctly translated.\n\n* R integers are 32 bit. Therefore python 64 bit integer have to be \npromoted to numeric in order to fit.\n\n* A pandas column containing only missing values is transformed to logical,\nfollowing R's behavior.\n\n* librdata writes Numeric missing values as NaN instead of NA. In pandas we only have np.nan both as \nNaN and missing value representation, and it will always be written as NaN in R.\n\n## Known limitations\n\n* POSIXct and POSIXlt objects in R are stored internally as UTC timestamps and may have\nin addition time zone information. librdata does not return time zone information and\nthefore the display of the tiemstamps in R and in pandas may differ.\n\n* Librdata reads arrays with a maximum of 3 dimensions. If more dimensions are present\nyou will get an error. Please submit an issue if this is the case. \n\n* **Lists are not read**.\n\n* **S4 Objects and probably other kind of objects, including those that depend on non base R packages (Bioconductor for example) cannot be read.**\n The error code in this case is as follows:\n\n```python\n\"pyreadr.custom_errors.LibrdataError: The file contains an unrecognized object\"\n```\n\n* Data frames with special values like arrays, matrices and other data frames\nare not supported.\n\n* librdata first de-compresses the file in memory and then extracts the\ndata. That means you need more free RAM than the decompress file ocuppies\nin memory. RData and Rds files are highly compressed: they can occupy\nin memory easily 40 or even more times in memory as in disk. Take it into\naccount in case you get a \"Unable to allocate memory\" error (see [this](https://github.com/ofajardo/pyreadr/issues/3) )\n\n* When writing numeric missing values are translated\nto NaN instead of NA.\n\n* Writing rownames is currently not supported.\n\n* Writing is supported only for a single pandas data frame to a single\nR data frame. Other data types are not supported. Multiple data frames\nfor rdata files are not supported.\n\n* RData and Rds files produced by R are (by default) compressed. Files produced\nby pyreadr are not compressed by default and therefore pretty bulky in comparison. You\ncan pass the option compress=\"gzip\" to write_rds or write_rda in order to activate \ngzip compression.\n\n* Pyreadr writing is a relative slow operation\ncompared to doint it in R.\n\n* Cannot read RData or rds files in encodings other than utf-8.\n\nSolutions to some of these limitations have been proposed in the upstream librdata [issues](https://github.com/WizardMac/librdata/issues) (points 1-4 are addressed by issue 12, point 5 by issue 16 and point 7 by issue 17). However there is no guarantee that these changes will be made and there are no timelines either. If you think it would be nice if these issues are solved, please express your support in the librdata issues.\n\n## Contributing\n\nContributions are welcome! Please chech the document [CONTRIBUTING.md](CONTRIBUTING.md) for more details.\n\n## Change Log\n\nA log with the changes for each version can be found [here](https://github.com/ofajardo/pyreadr/blob/write_support/change_log.md)\n\n## People\n\n[Otto Fajardo](https://github.com/ofajardo) - author, maintainer\n\n[Jonathon Love](https://jona.thon.love/) - contributor (original cython wrapper from jamovi-readstat and msvc compatible librdata)\n\n[deenes](https://github.com/deeenes) -  reading lzma compression\n\n[Daniel M. Sullivan](www.danielmsullivan.com) - added license information to setup.py\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fofajardo%2Fpyreadr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fofajardo%2Fpyreadr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fofajardo%2Fpyreadr/lists"}