{"id":13693317,"url":"https://github.com/opensdp/OpenSDPsynthR","last_synced_at":"2025-05-02T21:31:52.996Z","repository":{"id":84502029,"uuid":"83469502","full_name":"OpenSDP/OpenSDPsynthR","owner":"OpenSDP","description":"Codebase to generate simulated data for OpenSDP project","archived":false,"fork":false,"pushed_at":"2020-06-16T21:11:29.000Z","size":2082,"stargazers_count":16,"open_issues_count":12,"forks_count":5,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-26T12:41:09.149Z","etag":null,"topics":["education","r","simulation","synthetic"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenSDP.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2017-02-28T19:12:19.000Z","updated_at":"2024-06-23T06:23:25.000Z","dependencies_parsed_at":"2023-03-12T23:10:32.385Z","dependency_job_id":null,"html_url":"https://github.com/OpenSDP/OpenSDPsynthR","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenSDP%2FOpenSDPsynthR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenSDP%2FOpenSDPsynthR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenSDP%2FOpenSDPsynthR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenSDP%2FOpenSDPsynthR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenSDP","download_url":"https://codeload.github.com/OpenSDP/OpenSDPsynthR/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252108879,"owners_count":21696155,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["education","r","simulation","synthetic"],"created_at":"2024-08-02T17:01:08.359Z","updated_at":"2025-05-02T21:31:52.200Z","avatar_url":"https://github.com/OpenSDP.png","language":"HTML","funding_links":[],"categories":["Process-driven methods"],"sub_categories":["Tabular"],"readme":"---\r\noutput: github_document\r\n---\r\n\r\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\r\n\r\n```{r, echo = FALSE}\r\nknitr::opts_chunk$set(\r\n  collapse = TRUE,\r\n  comment = \"#\u003e\",\r\n  fig.path = \"tools/figs/README-\",\r\n  message = FALSE,\r\n  warning = FALSE\r\n)\r\n```\r\n\r\n# OpenSDPsynthR\r\n\r\n![](tools/figs/open_sdp_logo_red.png)\r\n\r\n\r\nA project to generate realistic synthetic unit-level longitudinal education data\r\nto empower collaboration in education analytics.\r\n\r\n## Design Goals\r\n\r\n1. Generate synthetic education data that is realistic for use by analysts\r\nacross the education sector. Realistic means messy, and reflective of the\r\ngeneral pattern of relationships found in the U.S. education sector.\r\n2. Synthetic data should be able to be generated on-demand and responsive to\r\ninputs from the user. These inputs should allow the user to configure the\r\nprocess to produce data that resembles the patterns of data in their agency.\r\n3. The package should be modular and extendable allowing new data topics to be\r\ngenerated as needed so synthetic data coverage can grow.\r\n\r\n## Structure\r\n\r\nThe package is organized into the following functions:\r\n\r\n- `simpop()` is the overall function that runs the simulation, this function calls\r\nmany subfunctions to simulate different elements of the student data\r\n- `cleaners` are functions which take the output from the `simpop` function and\r\nreshape it into data formats for different analyses. Currently only two cleaners\r\nare supported -- `CEDS` and `sdp_cleaner()` which prepare the data into a CEDS\r\nlike format and into the Strategic Data Project college-going analysis file\r\nspecification respectively.\r\n- `sim_control()` -- a function that controls all of the parameters of the `simpop`\r\nsimulation. The details of this function are covered in the vignettes.\r\n\r\n# Get Started\r\n\r\nTo use `OpenSDPsynthR`, follow the instructions below:\r\n\r\n## Install Package\r\n\r\nThe development version of the package is able to be installed using the\r\n`install_github()`. To use this command you will need to install the `devtools`\r\npackage.\r\n\r\n```{r eval=FALSE}\r\ndevtools::install_github(\"opensdp/OpenSDPsynthR\")\r\n```\r\n\r\n## Make some data\r\n\r\nLoad the package\r\n\r\n```{r, message=TRUE}\r\nlibrary(OpenSDPsynthR)\r\n```\r\n\r\nThe main function of the package is `simpop` which generates a list of data\r\nelements corresponding to simulated educational careers, K-20, for a user\r\nspecified number of students. In R, a list is a data structure that can contain\r\nmultiple data elements of different structures. This can be used to emulate\r\nthe multiple tables of a Student Information System (SIS).\r\n\r\n\r\n\r\n```{r, message=TRUE}\r\nout \u003c- simpop(nstu = 500, seed = 213, control = sim_control(nschls = 3))\r\n```\r\n\r\nCurrently ten tables are produced:\r\n\r\n```{r}\r\nnames(out)\r\n```\r\n\r\n\r\nData elements produced include:\r\n\r\n- **Student demographics:**  age, race, and sex\r\n- **Student participation:** grade advancement, ELL status, IEP status,\r\nFRPL status, gifted and talented status, attendance\r\n- **Student enrollment status:** exit type, enrollment type, transfer, graduation,\r\ndropout, etc.\r\n- **School attributes:** name, school category, school size, Title I and Title III status, etc.\r\n- **Student assessment:** math assessment, reading assessment, grade level assessed\r\n- **High school outcomes:** graduation, cumulative GPA, graduation type, cohort,\r\nclass rank, postsecondary enrollment\r\n- **High school progression:** annual class rank, cumulative credits earned, credits\r\nearned, credits by English Language Arts and by Mathematics, credits attempted, \r\nontrack status\r\n- **Postsecondary enrollment:** year of enrollment, transfer indicator, name and ID of\r\npostsecondary institution, type of institution\r\n- **Postsecondary institution:** name, city, state, online only, average net price,\r\nPell grant rate, retention four year full time, share of part time enrollment,\r\nenrollment by race, SAT and ACT score distribution for admitted students\r\n\r\nThere are two tables of metadata about the assessment data above to be used in\r\ncases where multiple types of student assessment are analyzed together.\r\n\r\n- **Assessment information:** grade, subject, ID, type, and name of assessment\r\n- **Proficiency information:** mean score, error of score, number of students tested\r\n\r\n\r\n```{r, echo=FALSE, message=FALSE, warning=FALSE, include=FALSE}\r\ntable_names \u003c- data.frame(table = NULL, column = NULL)\r\nfor(i in seq_along(out)){\r\n  table_name \u003c- names(out)[[i]]\r\n  columns \u003c- names(out[[i]])\r\n  tmp \u003c- data.frame(table = table_name, column = columns,\r\n                    stringsAsFactors = FALSE)\r\n  table_names \u003c- bind_rows(table_names, tmp)\r\n}\r\n\r\n```\r\n\r\n\r\n```{r, inclue=FALSE}\r\nhead(out$demog_master %\u003e% arrange(sid) %\u003e% select(1:4))\r\nhead(out$stu_year, 10)\r\n```\r\n\r\n## Cleaners\r\n\r\nYou can reformat the synthetic data for use in specific types of projects.\r\nCurrently two functions exist to format the simulated data into an analysis\r\nfile matching the SDP College-going data specification and a CEDS-like\r\ndata specification. More of these functions are planned in the future.\r\n\r\n```{r eval=FALSE}\r\ncgdata \u003c- sdp_cleaner(out)\r\nceds \u003c- ceds_cleaner(out)\r\n```\r\n\r\n\r\n## Control Parameters\r\n\r\nBy default, you only need to specify the number of students to simulate to the\r\n`simpop` command. The package has default simulation parameters that will result\r\nin creating a small school district with two schools.\r\n\r\n\r\n```{r demonstrateOptionList}\r\nnames(sim_control())\r\n```\r\n\r\nThese parameters can have complex structures to allow for conditional and random\r\ngeneration of data. Parameters fall into four categories:\r\n\r\n- **vectors:** a single list of parameters like school names, category names, or\r\nschool IDs\r\n- **conditional probability list:** an R list that contains a variable to group by,\r\na function to generate data with, and a list of parameters for that function for\r\neach group in the grouping variable\r\n- **outcome simulation parameters:** an R list of arguments to pass to the `simglm`\r\nfunction\r\n- **outcome adjustments:** an R list of lists, with functions that modify a variable\r\nin an existing data set\r\n\r\nFor more details, see the simulation control vignette.\r\n\r\n```{r, eval=FALSE}\r\nvignette(\"Controlling the Data Simulation\", package = \"OpenSDPsynthR\")\r\n```\r\n\r\n\r\n## Package Dependencies\r\n\r\n- `dplyr`\r\n- `lubridate`\r\n- [wakefield](https://www.github.com/trinker/wakefield)\r\n- [simglm](https://www.github.com/lebebr01/simglm)\r\n\r\n## OpenSDP\r\n\r\n`OpenSDPsynthR` is part of the OpenSDP project.\r\n\r\n[OpenSDP](https://opensdp.github.io) is an online, public repository of analytic\r\ncode, tools, and training intended to foster collaboration among education\r\nanalysts and researchers in order to accelerate the improvement of our school\r\nsystems. The community is hosted by the\r\n[Strategic Data Project](https://sdp.cepr.harvard.edu), an initiative of the\r\n[Center for Education Policy Research at Harvard University](https://cepr.harvard.edu).\r\nWe welcome contributions and feedback.\r\n\r\nThese materials were originally authored by the Strategic Data Project.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopensdp%2FOpenSDPsynthR","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopensdp%2FOpenSDPsynthR","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopensdp%2FOpenSDPsynthR/lists"}