{"id":13475394,"url":"https://github.com/jtleek/datasharing","last_synced_at":"2026-02-12T10:02:03.401Z","repository":{"id":11692352,"uuid":"14204342","full_name":"jtleek/datasharing","owner":"jtleek","description":"The Leek group guide to data sharing ","archived":false,"fork":false,"pushed_at":"2024-08-07T08:29:32.000Z","size":590,"stargazers_count":6632,"open_issues_count":912,"forks_count":243562,"subscribers_count":564,"default_branch":"master","last_synced_at":"2025-08-16T16:32:51.989Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jtleek.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2013-11-07T13:25:07.000Z","updated_at":"2025-08-15T19:55:08.000Z","dependencies_parsed_at":"2023-02-10T00:05:11.902Z","dependency_job_id":"8a94da22-3510-42d3-be2a-a7c23f328e3b","html_url":"https://github.com/jtleek/datasharing","commit_stats":{"total_commits":20,"total_committers":10,"mean_commits":2.0,"dds":0.5,"last_synced_commit":"df97230f037c212d8a07cdaf9216348c63644c9b"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jtleek/datasharing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jtleek%2Fdatasharing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jtleek%2Fdatasharing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jtleek%2Fdatasharing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jtleek%2Fdatasharing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jtleek","download_url":"https://codeload.github.com/jtleek/datasharing/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jtleek%2Fdatasharing/sbom","scorecard":{"id":540054,"data":{"date":"2025-08-11","repo":{"name":"github.com/jtleek/datasharing","commit":"df97230f037c212d8a07cdaf9216348c63644c9b"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3,"checks":[{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":3,"reason":"Found 9/27 approved changesets -- score normalized to 3","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"License","score":0,"reason":"license file not detected","details":["Warn: project does not have a license file"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 11 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-20T07:59:09.024Z","repository_id":11692352,"created_at":"2025-08-20T07:59:09.024Z","updated_at":"2025-08-20T07:59:09.024Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29363010,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-12T08:51:36.827Z","status":"ssl_error","status_checked_at":"2026-02-12T08:51:26.849Z","response_time":55,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T16:01:20.040Z","updated_at":"2026-02-12T10:02:03.382Z","avatar_url":"https://github.com/jtleek.png","language":null,"readme":"How to share data with a statistician\n===========\n\nThis is a guide for anyone who needs to share data with a statistician or data scientist. The target audiences I have in mind are:\n\n* Collaborators who need statisticians or data scientists to analyze data for them\n* Students or postdocs in various disciplines looking for consulting advice\n* Junior statistics students whose job it is to collate/clean/wrangle data sets\n\nThe goals of this guide are to provide some instruction on the best way to share data to avoid the most common pitfalls\nand sources of delay in the transition from data collection to data analysis. The [Leek group](http://biostat.jhsph.edu/~jleek/) works with a large\nnumber of collaborators and the number one source of variation in the speed to results is the status of the data\nwhen they arrive at the Leek group. Based on my conversations with other statisticians this is true nearly universally.\n\nMy strong feeling is that statisticians should be able to handle the data in whatever state they arrive. It is important\nto see the raw data, understand the steps in the processing pipeline, and be able to incorporate hidden sources of\nvariability in one's data analysis. On the other hand, for many data types, the processing steps are well documented\nand standardized. So the work of converting the data from raw form to directly analyzable form can be performed \nbefore calling on a statistician. This can dramatically speed the turnaround time, since the statistician doesn't\nhave to work through all the pre-processing steps first. \n\n\nWhat you should deliver to the statistician\n====================\n\nTo facilitate the most efficient and timely analysis this is the information you should pass to a statistician:\n\n1. The raw data.\n2. A [tidy data set](http://vita.had.co.nz/papers/tidy-data.pdf) \n3. A code book describing each variable and its values in the tidy data set.  \n4. An explicit and exact recipe you used to go from 1 -\u003e 2,3 \n\nLet's look at each part of the data package you will transfer. \n\n\n### The raw data\n\nIt is critical that you include the rawest form of the data that you have access to. This ensures\nthat data provenance can be maintained throughout the workflow.  Here are some examples of the\nraw form of data:\n\n* The strange [binary file](http://en.wikipedia.org/wiki/Binary_file) your measurement machine spits out\n* The unformatted Excel file with 10 worksheets the company you contracted with sent you\n* The complicated [JSON](http://en.wikipedia.org/wiki/JSON) data you got from scraping the [Twitter API](https://twitter.com/twitterapi)\n* The hand-entered numbers you collected looking through a microscope\n\nYou know the raw data are in the right format if you: \n\n1. Ran no software on the data\n1. Did not modify any of the data values\n1. You did not remove any data from the data set\n1. You did not summarize the data in any way\n\nIf you made any modifications of the raw data it is not the raw form of the data. Reporting modified data\nas raw data is a very common way to slow down the analysis process, since the analyst will often have to do a\nforensic study of your data to figure out why the raw data looks weird. (Also imagine what would happen if new data arrived?)\n\n### The tidy data set\n\nThe general principles of tidy data are laid out by [Hadley Wickham](http://had.co.nz/) in [this paper](http://vita.had.co.nz/papers/tidy-data.pdf)\nand [this video](http://vimeo.com/33727555). While both the paper and the video describe tidy data using [R](http://www.r-project.org/), the principles\nare more generally applicable:\n\n1. Each variable you measure should be in one column\n1. Each different observation of that variable should be in a different row\n1. There should be one table for each \"kind\" of variable\n1. If you have multiple tables, they should include a column in the table that allows them to be joined or merged\n\nWhile these are the hard and fast rules, there are a number of other things that will make your data set much easier\nto handle. First is to include a row at the top of each data table/spreadsheet that contains full row names. \nSo if you measured age at diagnosis for patients, you would head that column with the name `AgeAtDiagnosis` instead\nof something like `ADx` or another abbreviation that may be hard for another person to understand. \n\n\nHere is an example of how this would work from genomics. Suppose that for 20 people you have collected gene expression measurements with \n[RNA-sequencing](http://en.wikipedia.org/wiki/RNA-Seq). You have also collected demographic and clinical information\nabout the patients including their age, treatment, and diagnosis. You would have one table/spreadsheet that contains the clinical/demographic\ninformation. It would have four columns (patient id, age, treatment, diagnosis) and 21 rows (a row with variable names, then one row\nfor every patient). You would also have one spreadsheet for the summarized genomic data. Usually this type of data\nis summarized at the level of the number of counts per exon. Suppose you have 100,000 exons, then you would have a\ntable/spreadsheet that had 21 rows (a row for gene names, and one row for each patient) and 100,001 columns (one row for patient\nids and one row for each data type). \n\nIf you are sharing your data with the collaborator in Excel, the tidy data should be in one Excel file per table. They\nshould not have multiple worksheets, no macros should be applied to the data, and no columns/cells should be highlighted. \nAlternatively share the data in a [CSV](http://en.wikipedia.org/wiki/Comma-separated_values) or [TAB-delimited](http://en.wikipedia.org/wiki/Tab-separated_values) text file. (Beware however that reading CSV files into Excel can sometimes lead to non-reproducible handling of date and time variables.)\n\n\n### The code book\n\nFor almost any data set, the measurements you calculate will need to be described in more detail than you can or should sneak\ninto the spreadsheet. The code book contains this information. At minimum it should contain:\n\n1. Information about the variables (including units!) in the data set not contained in the tidy data \n1. Information about the summary choices you made\n1. Information about the experimental study design you used\n\nIn our genomics example, the analyst would want to know what the unit of measurement for each\nclinical/demographic variable is (age in years, treatment by name/dose, level of diagnosis and how heterogeneous). They \nwould also want to know how you picked the exons you used for summarizing the genomic data (UCSC/Ensembl, etc.). They\nwould also want to know any other information about how you did the data collection/study design. For example,\nare these the first 20 patients that walked into the clinic? Are they 20 highly selected patients by some characteristic\nlike age? Are they randomized to treatments? \n\nA common format for this document is a Word file. There should be a section called \"Study design\" that has a thorough\ndescription of how you collected the data. There is a section called \"Code book\" that describes each variable and its\nunits. \n\n### How to code variables\n\nWhen you put variables into a spreadsheet there are several main categories you will run into depending on their [data type](http://en.wikipedia.org/wiki/Statistical_data_type):\n\n1. Continuous\n1. Ordinal\n1. Categorical\n1. Missing \n1. Censored\n\nContinuous variables are anything measured on a quantitative scale that could be any fractional number. An example\nwould be something like weight measured in kg. [Ordinal data](http://en.wikipedia.org/wiki/Ordinal_data) are data that have a fixed, small (\u003c 100) number of levels but are ordered. \nThis could be for example survey responses where the choices are: poor, fair, good. [Categorical data](http://en.wikipedia.org/wiki/Categorical_variable) are data where there\nare multiple categories, but they aren't ordered. One example would be sex: male or female. This coding is attractive because it is self-documenting.  [Missing data](http://en.wikipedia.org/wiki/Missing_data) are data\nthat are unobserved and you don't know the mechanism. You should code missing values as `NA`. [Censored data](http://en.wikipedia.org/wiki/Censoring_\\(statistics\\)) are data\nwhere you know the missingness mechanism on some level. Common examples are a measurement being below a detection limit\nor a patient being lost to follow-up. They should also be coded as `NA` when you don't have the data. But you should\nalso add a new column to your tidy data called, \"VariableNameCensored\" which should have values of `TRUE` if censored \nand `FALSE` if not. In the code book you should explain why those values are missing. It is absolutely critical to report\nto the analyst if there is a reason you know about that some of the data are missing. You should also not [impute](http://en.wikipedia.org/wiki/Imputation_\\(statistics\\))/make up/\nthrow away missing observations.\n\nIn general, try to avoid coding categorical or ordinal variables as numbers. When you enter the value for sex in the tidy\ndata, it should be \"male\" or \"female\". The ordinal values in the data set should be \"poor\", \"fair\", and \"good\" not 1, 2 ,3.\nThis will avoid potential mixups about which direction effects go and will help identify coding errors. \n\nAlways encode every piece of information about your observations using text. For example, if you are storing data in Excel and use a form of colored text or cell background formatting to indicate information about an observation (\"red variable entries were observed in experiment 1.\") then this information will not be exported (and will be lost!) when the data is exported as raw text.  Every piece of data should be encoded as actual text that can be exported.  \n\n### The instruction list/script\n\nYou may have heard this before, but [reproducibility is a big deal in computational science](http://www.sciencemag.org/content/334/6060/1226).\nThat means, when you submit your paper, the reviewers and the rest of the world should be able to exactly replicate\nthe analyses from raw data all the way to final results. If you are trying to be efficient, you will likely perform\nsome summarization/data analysis steps before the data can be considered tidy. \n\nThe ideal thing for you to do when performing summarization is to create a computer script (in `R`, `Python`, or something else) \nthat takes the raw data as input and produces the tidy data you are sharing as output. You can try running your script\na couple of times and see if the code produces the same output. \n\nIn many cases, the person who collected the data has incentive to make it tidy for a statistician to speed the process\nof collaboration. They may not know how to code in a scripting language. In that case, what you should provide the statistician\nis something called [pseudocode](http://en.wikipedia.org/wiki/Pseudocode). It should look something like:\n\n1. Step 1 - take the raw file, run version 3.1.2 of summarize software with parameters a=1, b=2, c=3\n1. Step 2 - run the software separately for each sample\n1. Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set\n\nYou should also include information about which system (Mac/Windows/Linux) you used the software on and whether you \ntried it more than once to confirm it gave the same results. Ideally, you will run this by a fellow student/labmate\nto confirm that they can obtain the same output file you did. \n\n\n\n\nWhat you should expect from the analyst\n====================\n\nWhen you turn over a properly tidied data set it dramatically decreases the workload on the statistician. So hopefully\nthey will get back to you much sooner. But most careful statisticians will check your recipe, ask questions about\nsteps you performed, and try to confirm that they can obtain the same tidy data that you did with, at minimum, spot\nchecks.\n\nYou should then expect from the statistician:\n\n1. An analysis script that performs each of the analyses (not just instructions)\n1. The exact computer code they used to run the analysis\n1. All output files/figures they generated. \n\nThis is the information you will use in the supplement to establish reproducibility and precision of your results. Each\nof the steps in the analysis should be clearly explained and you should ask questions when you don't understand\nwhat the analyst did. It is the responsibility of both the statistician and the scientist to understand the statistical\nanalysis. You may not be able to perform the exact analyses without the statistician's code, but you should be able\nto explain why the statistician performed each step to a labmate/your principal investigator. \n\n\nContributors\n====================\n\n* [Jeff Leek](http://biostat.jhsph.edu/~jleek/) - Wrote the initial version.\n* [L. Collado-Torres](http://bit.ly/LColladoTorres) - Fixed typos, added links.\n* [Nick Reich](http://people.umass.edu/nick/) - Added tips on storing data as text.\n* [Nick Horton](https://www.amherst.edu/people/facstaff/nhorton) - Minor wording suggestions.\n\n\n","funding_links":[],"categories":["Others","Literature and Media"],"sub_categories":["Presentations"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjtleek%2Fdatasharing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjtleek%2Fdatasharing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjtleek%2Fdatasharing/lists"}