{"id":21714970,"url":"https://github.com/floswald/psidr","last_synced_at":"2025-04-12T18:43:08.219Z","repository":{"id":7873940,"uuid":"9247059","full_name":"floswald/psidR","owner":"floswald","description":"R package to easily build panel data sets from the PSID","archived":false,"fork":false,"pushed_at":"2024-11-07T16:02:54.000Z","size":5459,"stargazers_count":54,"open_issues_count":1,"forks_count":39,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-04-03T20:12:57.637Z","etag":null,"topics":["dataset","panel-data","psid","r"],"latest_commit_sha":null,"homepage":"http://floswald.github.io/psidR/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/floswald.png","metadata":{"files":{"readme":"readme.md","changelog":"NEWS","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2013-04-05T17:48:43.000Z","updated_at":"2025-02-24T13:15:35.000Z","dependencies_parsed_at":"2022-08-24T13:39:39.653Z","dependency_job_id":"fa63c5f2-8798-4e1c-9ac7-7b0a32e3e35f","html_url":"https://github.com/floswald/psidR","commit_stats":{"total_commits":145,"total_committers":6,"mean_commits":"24.166666666666668","dds":"0.12413793103448278","last_synced_commit":"df6e57b68b8dc3a9b6fd6f8e8bcc6c0c78415668"},"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/floswald%2FpsidR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/floswald%2FpsidR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/floswald%2FpsidR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/floswald%2FpsidR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/floswald","download_url":"https://codeload.github.com/floswald/psidR/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248617127,"owners_count":21134190,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","panel-data","psid","r"],"created_at":"2024-11-26T00:39:49.198Z","updated_at":"2025-04-12T18:43:08.186Z","avatar_url":"https://github.com/floswald.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\n\n# psidR: make building panel data from PSID easy\n\n\n| Build  | DOI  | Docs  |\n|---|---|---|\n[![R-CMD-check](https://github.com/floswald/psidR/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/floswald/psidR/actions/workflows/check-standard.yaml) | [![DOI](https://zenodo.org/badge/9247059.svg)](https://zenodo.org/badge/latestdoi/9247059) | [![DOCS](https://img.shields.io/badge/docs-Documentation-blue)](https://floswald.github.io/psidR)|\n\n\nThis R package provides a function to easily build panel data from PSID raw data.\n\n\u003e**Warning**: the wealth-supplement setup has changed on the PSID system. wealth variables are now part of the family files for waves 1999 onwards. The `wealth=TRUE` option has therefore been removed from the package. See [this issue](https://github.com/floswald/psidR/issues/34) for more details.\n\u003e\n\n## How to install this package\n\nThe package is on CRAN, so just type\n\n```r\ninstall.packages('psidR')\n```\n\nAlternatively to get the up-to-date version from this repository,\n\n```r\ninstall.packages('devtools')\ninstall_github(\"psidR\",username=\"floswald\")\n```\n\n### PSID\n\nThe [Panel Study of Income Dynamics](http://psidonline.isr.umich.edu/) is a publicly available dataset. \n\n* you can use the [data center](http://simba.isr.umich.edu/default.aspx) to build simple datasets\n* not workable for larger datasets\n  * some variables don't show up (although you know they exist)\n  * the ftp interface gets slower the more periods you are looking at\n  * the click and scroll exercise of selecting the right variables in each period is extremely error prone. \n* merging the data manually is non-trivial.\n\n### psidR\n\nThis package attempts to help the task of building a panel dataset. The user directly downloads ASCII data from the PSID server into `R`, **without the need** for any other software like stata or sas. To build the panel, the user must then specify the variable names in each wave of the questionnaire in a data.frame `fam.vars`, as well as the variables from the individual index in `ind.vars`. The helper function `getNamesPSID` is helpful in finding different variable names across waves - see examples below.\n\n\n### Quick Start and `API`\n\n1. You must supply at least one data.frame with variables to read from the family file. Most of the time you will also supply a data.frame with variables from the individual files to read.\n2. Those dataframes **must** be in the following format. I.e. column `year` is an integer and indicates calendar year, the other columns are the _variable names which will appear in your panel_. \n\n```R\n\u003e head(i)  # individiual file example\n   year     age    educ empstat  weight\n1: 1968 ER30004 ER30010    \u003cNA\u003e ER30019\n2: 1969 ER30023    \u003cNA\u003e    \u003cNA\u003e ER30042        # NOTICE THE NA for educ HERE!!\n3: 1970 ER30046 ER30052    \u003cNA\u003e ER30066\n4: 1971 ER30070 ER30076    \u003cNA\u003e ER30090\n5: 1972 ER30094 ER30100    \u003cNA\u003e ER30116\n6: 1973 ER30120 ER30126    \u003cNA\u003e ER30137\n\n\u003e head(f))  # family file example\n   year age_youngest_child debt empstat_ faminc hours hvalue ...\n1: 1968               V120 \u003cNA\u003e     V196    V81   V47     V5 ...\n2: 1969              V1013 \u003cNA\u003e     V639   V529  V465   V449 ...\n3: 1970              V1243 \u003cNA\u003e    V1278  V1514 V1138  V1122 ...\n4: 1971              V1946 \u003cNA\u003e    V1983  V2226 V1839  V1823 ...\n5: 1972              V2546 \u003cNA\u003e    V2581  V2852 V2439  V2423 ...\n6: 1973              V3099 \u003cNA\u003e    V3114  V3256 V3027  V3021 ...\n```\n\nExample usage:\n\n\n```R\n\u003e library(psidR)\n\n\u003e build.psid(datadir = \"~/data/PSID\", small = TRUE)  # directory `datadir` must exist!\nINFO [2021-07-13 10:34:26] Will download missing datasets now\nINFO [2021-07-13 10:34:26] will download family files: 2013, 2015\nINFO [2021-07-13 10:34:26] will download latest individual index: IND2019ER\nThis can take several hours/days to download.\n want to go ahead? give me 'yes' or 'no'.yes\nplease enter your PSID username: *****\nplease enter your PSID password: *****\nINFO [2021-07-13 10:34:41] downloading file ~/data/PSID/FAM2013ER\nINFO [2021-07-13 10:34:56] now reading and processing SAS file ~/data/PSID/FAM2013ER into R\nINFO [2021-07-13 10:40:06] downloading file ~/data/PSID/FAM2015ER          \nINFO [2021-07-13 10:40:22] now reading and processing SAS file ~/data/PSID/FAM2015ER into R\nINFO [2021-07-13 10:45:34] downloading file ~/data/PSID/IND2019ER          \nINFO [2021-07-13 10:46:39] now reading and processing SAS file ~/data/PSID/IND2019ER into R\nINFO [2021-07-13 11:15:04] finished downloading files to ~/data/PSID/       \nINFO [2021-07-13 11:15:04] continuing now to build the dataset\nINFO [2021-07-13 11:15:04] psidR: Loading Family data from .rda files\nINFO [2021-07-13 11:15:12] psidR: loaded individual file: ~/data/PSID/IND2019ER.rda\nINFO [2021-07-13 11:15:12] psidR: total memory load in MB: 1538\nINFO [2021-07-13 11:15:12] psidR: currently working on data for year 2013\nINFO [2021-07-13 11:15:12] full 2013 sample has 82573 obs\nINFO [2021-07-13 11:15:12] you selected 34856 obs belonging to SRC\nINFO [2021-07-13 11:15:12] dropping non-heads leaves 5450 obs\nINFO [2021-07-13 11:15:14] psidR: currently working on data for year 2015\nINFO [2021-07-13 11:15:14] full 2015 sample has 82573 obs\nINFO [2021-07-13 11:15:14] you selected 34856 obs belonging to SRC\nINFO [2021-07-13 11:15:14] dropping non-heads leaves 5318 obs\nINFO [2021-07-13 11:15:16] End of build.panel\n```\n\n### Usage\n\nFirst present a real world example building a full 1968-2017 panel. Then we show some tests.\n\n\n### Real World Example: With Missing Variables\n\n* You want a `data.table` with the following columns: `PID,year,income,wage,age,educ` and some more variables.\n* You went to the [PSID variable search](https://simba.isr.umich.edu/VS/s.aspx) to look up the relevant variable names in each year in either the `individual-level` or `family-level` datasets.\n* You created a list of those variables as I did in [`inst/psid-lists`](inst/psid-lists) of this package\n* You noted that there is **NO EDUCATION** variable in the individual index file in 1968 and 1969\n    * Instead of the variable name for `EDUC` in 1968 and 1969 you want to put `NA`\n* You noted that there is **NO HOURLY WAGE** variable in the family index file in 1993\n    * Instead of the variable name for `HOURLY WAGE` in 1993 you want to put `NA`\n\n```R\n# Build panel with income, wage, age, education and several other variables\n# [this is the body of the function build.psid()]\nlibrary(psidR)\nlibrary(data.table)\nr = system.file(package=\"psidR\")\nf = fread(file.path(r,\"psid-lists\",\"famvars.txt\"))\ni = fread(file.path(r,\"psid-lists\",\"indvars.txt\"))\n\n\u003e i\n                           dataset year variable                  label   name\n  1: PSID Individual Data by Years 1968  ER30019   INDIVIDUAL WEIGHT 68 weight\n  2: PSID Individual Data by Years 1969  ER30042   INDIVIDUAL WEIGHT 69 weight\n  3: PSID Individual Data by Years 1970  ER30066   INDIVIDUAL WEIGHT 70 weight\n  4: PSID Individual Data by Years 1971  ER30090   INDIVIDUAL WEIGHT 71 weight\n  5: PSID Individual Data by Years 1972  ER30116   INDIVIDUAL WEIGHT 72 weight\n ---                                                                          \n143:    PSID Individual Data Index 2009  ER34020 HIGHEST GRADE FINISHED   educ\n144:    PSID Individual Data Index 2011  ER34119 HIGHEST GRADE FINISHED   educ\n145:    PSID Individual Data Index 2013  ER34230 HIGHEST GRADE FINISHED   educ\n146:    PSID Individual Data Index 2015  ER34349 HIGHEST GRADE FINISHED   educ\n147:    PSID Individual Data Index 2017  ER34548 HIGHEST GRADE FINISHED   educ\n\n\u003e f\n                   dataset year variable                     label            name\n  1: PSID Main Family Data 1968      V47 HD ANN HRS WORKED LAST YR           hours\n  2: PSID Main Family Data 1969     V465 HD ANN HRS WORKED LAST YR           hours\n  3: PSID Main Family Data 1970    V1138 HD ANN HRS WORKED LAST YR           hours\n  4: PSID Main Family Data 1971    V1839 HD ANN HRS WORKED LAST YR           hours\n  5: PSID Main Family Data 1972    V2439 HD ANN HRS WORKED LAST YR           hours\n ---                                                                              \n609:     PSID Family-level 2009  ER42139  A52 LIKELIHOOD OF MOVING likelihood_move\n610:     PSID Family-level 2011  ER47447  A52 LIKELIHOOD OF MOVING likelihood_move\n611:     PSID Family-level 2013  ER53147  A52 LIKELIHOOD OF MOVING likelihood_move\n612:     PSID Family-level 2015  ER60162  A52 LIKELIHOOD OF MOVING likelihood_move\n613:     PSID Family-level 2017  ER66163  A52 LIKELIHOOD OF MOVING likelihood_move\n\n# alternatively, use `getNamesPSID`:\n# cwf \u003c- openxlsx::read.xlsx(\"http://psidonline.isr.umich.edu/help/xyr/psid.xlsx\")\n# Suppose you know the name of the variable in a certain year, and it is\n# \"ER17013\". then get the correpsonding name in another year with\n# getNamesPSID(\"ER17013\", cwf, years = 2001)  # 2001 only\n# getNamesPSID(\"ER17013\", cwf, years = 2003)  # 2003\n# getNamesPSID(\"ER17013\", cwf, years = NULL)  # all years\n# getNamesPSID(\"ER17013\", cwf, years = c(2005, 2007, 2009))   # some years\n\n# next, bring into required shape:\n\ni = dcast(i[,list(year,name,variable)],year~name, value.var = \"variable\")\nf = dcast(f[,list(year,name,variable)],year~name, value.var = \"variable\")\n\n\u003e head(i)\n   year     age    educ empstat  weight\n1: 1968 ER30004 ER30010    \u003cNA\u003e ER30019\n2: 1969 ER30023    \u003cNA\u003e    \u003cNA\u003e ER30042        # NOTICE THE NA for educ HERE!!\n3: 1970 ER30046 ER30052    \u003cNA\u003e ER30066\n4: 1971 ER30070 ER30076    \u003cNA\u003e ER30090\n5: 1972 ER30094 ER30100    \u003cNA\u003e ER30116\n6: 1973 ER30120 ER30126    \u003cNA\u003e ER30137\n\n\u003e head(f)\n   year age_youngest_child debt empstat_ faminc hours hvalue ...\n1: 1968               V120 \u003cNA\u003e     V196    V81   V47     V5 ...\n2: 1969              V1013 \u003cNA\u003e     V639   V529  V465   V449 ...\n3: 1970              V1243 \u003cNA\u003e    V1278  V1514 V1138  V1122 ...\n4: 1971              V1946 \u003cNA\u003e    V1983  V2226 V1839  V1823 ...\n5: 1972              V2546 \u003cNA\u003e    V2581  V2852 V2439  V2423 ...\n6: 1973              V3099 \u003cNA\u003e    V3114  V3256 V3027  V3021 ...\n\n# call the builder function\n\nd = build.panel(datadir=datadr,fam.vars=f,ind.vars=i, heads.only = TRUE,sample=\"SRC\",design=\"all\")\n\n# d contains your panel\n\nsave(d,file=\"~/psid.Rds\")\n```\n\nHere are some tests:\n\n```R\n# one year test, no ind file\n# call function `small.test.noind()`\n# get var names from cross walk\ncwf = openxlsx::read.xlsx(system.file(package=\"psidR\",\"psid-lists\",\"psid.xlsx\"))\nhead_age_var_name \u003c- getNamesPSID(\"ER17013\", cwf, years=c(2003))\n# create family vars data.frame\nfamvars = data.frame(year=c(2003),variable=head_age_var_name$variable)\n# call function\nbuild.panel(fam.vars=famvars,datadir=dd)\n\n# one year test, ind file\n# call function `small.test.ind()`\n\ncwf = openxlsx::read.xlsx(system.file(package=\"psidR\",\"psid-lists\",\"psid.xlsx\"))\nhead_age_var_name \u003c- getNamesPSID(\"ER17013\", cwf, years=c(2003))\neduc = getNamesPSID(\"ER30323\",cwf,years=2003)\nfamvars = data.frame(year=c(2003),variable=head_age_var_name$variable)\nindvars = data.frame(year=c(2003),variable=educ$variable)\nbuild.panel(fam.vars=famvars,ind.vars=indvars,datadir=dd)\n\n\n# three year test, ind file\n# call function `medium.test.ind()`\n\ncwf = openxlsx::read.xlsx(system.file(package=\"psidR\",\"psid-lists\",\"psid.xlsx\"))\nhead_age_var_name \u003c- getNamesPSID(\"ER17013\", cwf, years=c(2003,2005,2007))\n educ = getNamesPSID(\"ER30323\",cwf,years=c(2003,2005,2007))\nfamvars = data.frame(year=c(2003,2005,2007),variable=head_age_var_name$variable)\nindvars = data.frame(year=c(2003,2005,2007),variable=educ$variable)\nbuild.panel(fam.vars=famvars,ind.vars=indvars,datadir=dd)\n\n# etc for\nmedium.test.noind()\n\n# example output:\n\nINFO [2018-10-10 10:58:23] Will download missing datasets now\nINFO [2018-10-10 10:58:23] will download family files: 2003, 2005, 2007\nThis can take several hours/days to download.\n want to go ahead? give me 'yes' or 'no'.yes\nplease enter your PSID username: *******\nplease enter your PSID password: *******\nINFO [2018-10-10 10:58:46] downloading file ~/psid/FAM2003ER\nINFO [2018-10-10 10:58:50] now reading and processing SAS file ~/psid/FAM2003ER into R\nINFO [2018-10-10 11:07:02] downloading file ~/psid/FAM2005ER               \nINFO [2018-10-10 11:07:05] now reading and processing SAS file ~/psid/FAM2005ER into R\nINFO [2018-10-10 11:14:44] downloading file ~/psid/FAM2007ER               \nINFO [2018-10-10 11:14:48] now reading and processing SAS file ~/psid/FAM2007ER into R\nINFO [2018-10-10 11:28:25] finished downloading files to ~/psid/           \nINFO [2018-10-10 11:28:25] continuing now to build the dataset\nINFO [2018-10-10 11:28:25] psidR: Loading Family data from .rda files\nINFO [2018-10-10 11:28:34] psidR: loaded individual file: ~/psid/IND2015ER.rda\nINFO [2018-10-10 11:28:34] psidR: total memory load in MB: 1252\nINFO [2018-10-10 11:28:34] \nINFO [2018-10-10 11:28:34] psidR: currently working on data for year 2003\nINFO [2018-10-10 11:28:36] \nINFO [2018-10-10 11:28:36] psidR: currently working on data for year 2005\nINFO [2018-10-10 11:28:37] \nINFO [2018-10-10 11:28:37] psidR: currently working on data for year 2007\nINFO [2018-10-10 11:28:39] balanced design reduces sample from 97377 to 89571\nINFO [2018-10-10 11:28:39] End of build.panel\n\u003e x\n       age interview ID1968 pernum sequence relation.head     pid year\n    1:  92         1    848      2        1            10  848002 2003\n    2:  64         2   1173      1        1            10 1173001 2003\n    3:  48         3   1866     32        2            30 1866032 2003\n    4:  48         3   1866    171        1            10 1866171 2003\n    5:  48         3   1866    175        0             0 1866175 2003\n   ---                                                                \n89567:  49      8332   6069      4        2            20 6069004 2007\n89568:  49      8332   6069     30        0             0 6069030 2007\n89569:  49      8332   6069    171        3            33 6069171 2007\n89570:  49      8332   6069    173        1            10 6069173 2007\n89571:  49      8332   6069    174        0             0 6069174 2007\n\n# etc for \nmedium.test.ind.NA()\n```\n\n\n\n\n\n\n\n\n### Example Usage\n\nthe main function in the package is `build.panel` and it has a reproducible example which you can look at by typing\n\n```r\nrequire(psidR)\nexample(build.panel)\n```\n\n### Supplemental Datasets\n\nThe PSID has a wealth of add-on datasets. Once you have a panel those are easy to merge on. The panel will have a variable `interview`, which is the identifier in the supplemental dataset. \n\n\n## Citation\n\nIf you use `psidR` in your work, please consider citing it. You could just do \n\n```R\n\u003e citation(package=\"psidR\")\n\nTo cite the 'psidR' package in publications use:\n\n  Florian Oswald (2021). psidR: Build Panel Data Sets from PSID Raw Data. R package version\n  2.1.\n\nA BibTeX entry for LaTeX users is\n\n  @Manual{,\n    title = {psidR: Build Panel Data Sets from PSID Raw Data},\n    author = {Florian Oswald},\n    year = {2021},\n    note = {R package version 2.1},\n    url = {https://github.com/floswald/psidR},\n  }\n```\n\nThanks!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffloswald%2Fpsidr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffloswald%2Fpsidr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffloswald%2Fpsidr/lists"}