Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kaz-yos/distributed
Comparison of Privacy-Protecting Analytic and Data-sharing Methods: a Simulation Study (Pharmacoepidemiol Drug Saf 2018)
https://github.com/kaz-yos/distributed
data-analysis epidemiology statistics
Last synced: about 1 month ago
JSON representation
Comparison of Privacy-Protecting Analytic and Data-sharing Methods: a Simulation Study (Pharmacoepidemiol Drug Saf 2018)
- Host: GitHub
- URL: https://github.com/kaz-yos/distributed
- Owner: kaz-yos
- Created: 2017-09-14T17:06:17.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-06-27T14:37:24.000Z (over 6 years ago)
- Last Synced: 2024-11-12T04:44:25.300Z (3 months ago)
- Topics: data-analysis, epidemiology, statistics
- Language: R
- Homepage: https://www.ncbi.nlm.nih.gov/pubmed/30022561
- Size: 172 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.org
- Changelog: NEWS
Awesome Lists containing this project
README
#+TITLE: Overview of the distributed package
#+AUTHOR: Kazuki Yoshida
#+OPTIONS: toc:nil
#+OPTIONS: ^:{}
# ############################################################################ #* Introduction
In distributed data networks such as the Sentinel and PCORnet, minimizing the amount of data shared across data partners is important for reducing the danger of potential privacy breach. In this simulation project, we examined the performance of data analysis methods at various levels of data sharing such as meta-analysis, summary table data, risk set data, and individual-level data. The preliminary version of the results can be found in the [[https://pcornetcommons.org/wp-content/uploads/2018/01/Aim-2-Simulation-Studies-Presentation.pdf][presentation slides]]. The final results are in the corresponding publication (/Pharmacoepidemiology and Drug Safety/ 2018 [in press]). More information of privacy-protecting analytic and data-sharing methods is available at [[https://www.distributedanalysis.org][www.distributedanalysis.org]].
* System Requirement
Running the entire simulation requires working installation of =R= as well as UNIX SAS that can be called via =sas= command. Most part of the simulation is in pure R, but the experimental weighted risk set analysis is implemented in SAS. When the sas command is not found, this part of simulation is skipped gracefully.
* Installation
The package can be installed as follows from the shell if you have an archive file. If it asks for dependencies, you may need to install these required packages first.
#+BEGIN_SRC sh
R CMD install distributed_0.3.0.tar.gz
#+END_SRCAnother way to install the package is to directly install from Github within =R=.
#+BEGIN_SRC R
## Install devtools (if you do not have it already)
install.packages("devtools")
## Install directly from github (develop branch)
devtools::install_github(repo = "kaz-yos/distributed")
#+END_SRCUsing devtools may requires some preparation, please see the following link for information.
http://www.rstudio.com/projects/devtools/
* Overview of Package Contents
The distributed package contains the functions used to generate, prepare, and analyze data. It also contains script files that are used to run the simulation study. The functions can be loaded in =R= using =library(distributed)=. The scripts are found in the =inst= subfolder in this repository. The unarchived folder contains the following =R= and shell scripts. You need to create subfolders =data=, =log=, =log_odyssey=, and =summary= for execution.
#+BEGIN_EXAMPLE
./scripts
├── 01.GenerateData.R
├── 01.GenerateData_odyssey.sh
├── 02.PrepareData.R
├── 02.PrepareData_odyssey.sh
├── 03.AnalyzeData.R
├── 03.AnalyzeData_odyssey.sh
├── 04.AggregateResults.R
├── 04.AggregateResults_odyssey.sh
├── 05.AssessMethodsByScenario.R
└── 06.AssessMethodsByScenarioSeries.R
#+END_EXAMPLE* Running Simulation
The simulation has the following distinct phases.
- Data Generation
- Data Preparation
- Data Analysis
- Result Aggregation
- Method Assessment** Data Generation
Running the following will generate files containing simulated distributed data network under the data subfolder. The 4 at the end of the command specifies the number of CPU cores to use.
#+BEGIN_SRC sh
Rscript 01.GenerateData.R 4
#+END_SRCMultiple files are generated for each scenario to lessen the resource requirement. Each new file generated under the data subfolder has a name such as =ScenarioRaw001_part001_R50.RData=, where first number indicates the scenario, part number indicates which part it is in a series of files under this scenario, and =R50= indicates the number of iterations included in the file.
If you are running the simulation in a Linux cluster environment with job managers, the following script can aid dispatching the data generation job to a node. These shell script is designed for the Harvard University's Odyssey cluster (SLURM job manager). Thus, the script will require modification according to the configuration of the cluster system you are using.
#+BEGIN_SRC sh
sh 01.GenerateData_odyssey.sh
#+END_SRC** Data Preparation
This step fits summary score models in the data and performs matching, stratification, and weighting by these estimated summary scores. Conceptually, this part corresponds to what each site does in a distributed data network. The process has to be run on each data file as follows. The 4 at the end of the command specifies the number of CPU cores to use.
#+BEGIN_SRC sh
Rscript 02.PrepareData.R ./data/ScenarioRaw001_part001_R50.RData 4
#+END_SRCThis will generate a new file named =ScenarioPrepared001_part001_R50.RData= under the data subfolder. This process can be repeated for each file via for loop, but it is better suited for a cluster system. The following script dispatches the data preparation job on each file to a separate node, thereby, allowing parallel execution. Again the files included are specialized for the cluster the authors used, and need modification before use at a different system.
#+BEGIN_SRC sh
sh 02.PrepareData_odyssey.sh ./data/ScenarioRaw*
#+END_SRC** Data Analysis
This step conducts the actual analysis of prepared data for the treatment effect of interest. The process has to be run on each data file as follows. The 4 at the end of the command specifies the number of CPU cores to use.
#+BEGIN_SRC sh
Rscript 03.AnalyzeData.R ./data/ScenarioPrepared001_part001_R50.RData 4
#+END_SRCThis will generate a new file named =ScenarioAnalyzed001_part001_R50.RData= under the data subfolder. Again this can be repeated using a for loop or dispatched to multiple nodes in a cluster system.
#+BEGIN_SRC sh
sh 03.AnalyzeData_odyssey.sh ./data/ScenarioPrepared*
#+END_SRC** Result Aggregation
This step aggregates the analysis results into a summary file. The following will load all data files with names containing =ScenarioAnalyzed= (analysis result files), and output assessment results in the summary subfolder.
#+BEGIN_SRC sh
sh 04.AggregateResults_odyssey.sh
#+END_SRCAn =R= data file named =analysis_summary_data.RData= will be generated under the =data= subfolder.
** Method Assessment
The following steps are less computationally intensive and designed for local execution with the =analysis_summary_data.RData= file in the =data= subfolder.
#+BEGIN_SRC sh
Rscript 05.AssessMethodsByScenario.R
Rscript 06.AssessMethodsByScenarioSeries.R
#+END_SRC