Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kaz-yos/distributed

Comparison of Privacy-Protecting Analytic and Data-sharing Methods: a Simulation Study (Pharmacoepidemiol Drug Saf 2018)
https://github.com/kaz-yos/distributed

data-analysis epidemiology statistics

Last synced: about 1 month ago
JSON representation

Comparison of Privacy-Protecting Analytic and Data-sharing Methods: a Simulation Study (Pharmacoepidemiol Drug Saf 2018)

Awesome Lists containing this project

README

        

#+TITLE: Overview of the distributed package
#+AUTHOR: Kazuki Yoshida
#+OPTIONS: toc:nil
#+OPTIONS: ^:{}
# ############################################################################ #

* Introduction

In distributed data networks such as the Sentinel and PCORnet, minimizing the amount of data shared across data partners is important for reducing the danger of potential privacy breach. In this simulation project, we examined the performance of data analysis methods at various levels of data sharing such as meta-analysis, summary table data, risk set data, and individual-level data. The preliminary version of the results can be found in the [[https://pcornetcommons.org/wp-content/uploads/2018/01/Aim-2-Simulation-Studies-Presentation.pdf][presentation slides]]. The final results are in the corresponding publication (/Pharmacoepidemiology and Drug Safety/ 2018 [in press]). More information of privacy-protecting analytic and data-sharing methods is available at [[https://www.distributedanalysis.org][www.distributedanalysis.org]].

* System Requirement

Running the entire simulation requires working installation of =R= as well as UNIX SAS that can be called via =sas= command. Most part of the simulation is in pure R, but the experimental weighted risk set analysis is implemented in SAS. When the sas command is not found, this part of simulation is skipped gracefully.

* Installation

The package can be installed as follows from the shell if you have an archive file. If it asks for dependencies, you may need to install these required packages first.

#+BEGIN_SRC sh
R CMD install distributed_0.3.0.tar.gz
#+END_SRC

Another way to install the package is to directly install from Github within =R=.

#+BEGIN_SRC R
## Install devtools (if you do not have it already)
install.packages("devtools")
## Install directly from github (develop branch)
devtools::install_github(repo = "kaz-yos/distributed")
#+END_SRC

Using devtools may requires some preparation, please see the following link for information.

http://www.rstudio.com/projects/devtools/

* Overview of Package Contents

The distributed package contains the functions used to generate, prepare, and analyze data. It also contains script files that are used to run the simulation study. The functions can be loaded in =R= using =library(distributed)=. The scripts are found in the =inst= subfolder in this repository. The unarchived folder contains the following =R= and shell scripts. You need to create subfolders =data=, =log=, =log_odyssey=, and =summary= for execution.

#+BEGIN_EXAMPLE
./scripts
├── 01.GenerateData.R
├── 01.GenerateData_odyssey.sh
├── 02.PrepareData.R
├── 02.PrepareData_odyssey.sh
├── 03.AnalyzeData.R
├── 03.AnalyzeData_odyssey.sh
├── 04.AggregateResults.R
├── 04.AggregateResults_odyssey.sh
├── 05.AssessMethodsByScenario.R
└── 06.AssessMethodsByScenarioSeries.R
#+END_EXAMPLE

* Running Simulation

The simulation has the following distinct phases.

- Data Generation
- Data Preparation
- Data Analysis
- Result Aggregation
- Method Assessment

** Data Generation

Running the following will generate files containing simulated distributed data network under the data subfolder. The 4 at the end of the command specifies the number of CPU cores to use.

#+BEGIN_SRC sh
Rscript 01.GenerateData.R 4
#+END_SRC

Multiple files are generated for each scenario to lessen the resource requirement. Each new file generated under the data subfolder has a name such as =ScenarioRaw001_part001_R50.RData=, where first number indicates the scenario, part number indicates which part it is in a series of files under this scenario, and =R50= indicates the number of iterations included in the file.

If you are running the simulation in a Linux cluster environment with job managers, the following script can aid dispatching the data generation job to a node. These shell script is designed for the Harvard University's Odyssey cluster (SLURM job manager). Thus, the script will require modification according to the configuration of the cluster system you are using.

#+BEGIN_SRC sh
sh 01.GenerateData_odyssey.sh
#+END_SRC

** Data Preparation

This step fits summary score models in the data and performs matching, stratification, and weighting by these estimated summary scores. Conceptually, this part corresponds to what each site does in a distributed data network. The process has to be run on each data file as follows. The 4 at the end of the command specifies the number of CPU cores to use.

#+BEGIN_SRC sh
Rscript 02.PrepareData.R ./data/ScenarioRaw001_part001_R50.RData 4
#+END_SRC

This will generate a new file named =ScenarioPrepared001_part001_R50.RData= under the data subfolder. This process can be repeated for each file via for loop, but it is better suited for a cluster system. The following script dispatches the data preparation job on each file to a separate node, thereby, allowing parallel execution. Again the files included are specialized for the cluster the authors used, and need modification before use at a different system.

#+BEGIN_SRC sh
sh 02.PrepareData_odyssey.sh ./data/ScenarioRaw*
#+END_SRC

** Data Analysis

This step conducts the actual analysis of prepared data for the treatment effect of interest. The process has to be run on each data file as follows. The 4 at the end of the command specifies the number of CPU cores to use.

#+BEGIN_SRC sh
Rscript 03.AnalyzeData.R ./data/ScenarioPrepared001_part001_R50.RData 4
#+END_SRC

This will generate a new file named =ScenarioAnalyzed001_part001_R50.RData= under the data subfolder. Again this can be repeated using a for loop or dispatched to multiple nodes in a cluster system.

#+BEGIN_SRC sh
sh 03.AnalyzeData_odyssey.sh ./data/ScenarioPrepared*
#+END_SRC

** Result Aggregation

This step aggregates the analysis results into a summary file. The following will load all data files with names containing =ScenarioAnalyzed= (analysis result files), and output assessment results in the summary subfolder.

#+BEGIN_SRC sh
sh 04.AggregateResults_odyssey.sh
#+END_SRC

An =R= data file named =analysis_summary_data.RData= will be generated under the =data= subfolder.

** Method Assessment

The following steps are less computationally intensive and designed for local execution with the =analysis_summary_data.RData= file in the =data= subfolder.

#+BEGIN_SRC sh
Rscript 05.AssessMethodsByScenario.R
Rscript 06.AssessMethodsByScenarioSeries.R
#+END_SRC