{"id":20425048,"url":"https://github.com/databio/example_analysis","last_synced_at":"2026-03-19T15:46:14.264Z","repository":{"id":86287774,"uuid":"59844386","full_name":"databio/example_analysis","owner":"databio","description":null,"archived":false,"fork":false,"pushed_at":"2020-11-04T15:14:04.000Z","size":5,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":17,"default_branch":"master","last_synced_at":"2025-12-05T21:26:09.958Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Makefile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-05-27T15:25:14.000Z","updated_at":"2020-10-30T17:45:51.000Z","dependencies_parsed_at":"2023-03-13T09:28:30.360Z","dependency_job_id":null,"html_url":"https://github.com/databio/example_analysis","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/databio/example_analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fexample_analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fexample_analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fexample_analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fexample_analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databio","download_url":"https://codeload.github.com/databio/example_analysis/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fexample_analysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30150422,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T21:15:50.531Z","status":"ssl_error","status_checked_at":"2026-03-05T21:15:11.173Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T07:12:07.663Z","updated_at":"2026-03-05T21:32:05.027Z","avatar_url":"https://github.com/databio.png","language":"Makefile","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Analysis project template\n\nThis README contains detailed instructions on how to organize an analysis project. This repository also serves as a template repository you can clone as a starting point for a new analysis project.\n\n## Components of a project\n\nA data analysis project has four components:\n\n* Raw data\n* Metadata (descriptions of the data)\n* Code\n* Results (processed data, figures, *etc.*)\n\nWhich of these components belongs in a git repository? Only what we need under version control. Raw data and Results are usually too large, so they should not be version controlled and therefore should not be in the git repository. In contrast, metadata and code are typically small, text-based, and change frequently; these should therefore under version control in a git repository. Keeping the repository small improves efficiency, and then each person working on the project can clone it into a personal area because these duplicates are cheap and small. The larger components (Raw data and results) will instead reside in single copy on shared disk space, to be referenced by code or metadata in the git repository. \n\nOur project thus must be divided into two components: the version-controlled part and the non-controlled part, each with its own space and structure:\n\n### 1. Version-controlled components: GitHub repository organization (metadata and code)\n\nThe two version-controlled components should each live under a subfolder. So, the simple structure would look like this:\n\n```\ngithub_repository\n\t/metadata\n\t/src\n```\n\nThis is how we have organized this template repository. All code goes in `/src`. Descriptions, config files, and pointers to data go in `/metadata`. Pretty simple. The way to describe the project `metadata` is really important, too -- we want tools that can work with all our projects, so we have to define them in the same way. Therefore, we use a shared, standardized way of describing and configuring projects and sample annotation: [the PEP format](http://pepkit.github.io), which uses a `yaml` configuration file and a `csv` sample annotation sheet.\n\nYou should also think carefully about how you name this repository. See section below on [choosing a project name](#choosing-a-project-name).\n\n### 2. Non-version-controlled components: Data organization (raw data and results)\n\nData and Results never get into the repository, so they must be stored elsewhere, in a shared disk area. They should be further divided into two areas due to the [the curse of enormity](http://databio.org/posts/curse_of_enormity.html) and different disk requirements for data and results:\n\n1. Data is stored in a parent Data folder (either `$DATA` or `$MDATA`), subdivided into subfolders by data source (which are not required to have a one-to-one relationship with a project, as multiple projects may analyze the same data). The subfolder name therefore doesn't necessarily correspond to the project name, but instead, is more descriptive of the data itself.\n\n2. Results are stored in the `$PROCESSED` folder, subdivided into a subfolder for each project. Because there *is* a one-to-one relationship between projects and result folders, **the results subfolder should be named the same as the git repository**, which should be a unique identifier for the project. Within that project subfolder, I like the concept of dividing Results into two parts: pipeline and analysis.  _Pipeline_ results are anything that takes you from raw, unaligned reads, to a processed data type, like RNA levels, methylation calls, or ChIP peaks. _Analysis_ is anything that happens afterwards. While the division is a bit ambiguous and probably not strictly necessary, I find it improves my conceptual approach to the project. Pipelines generally operate on an individual sample basis (or small subsets of samples, like 2-way comparison), and can be run as data is produced; they are generally relatively standardized by the community, with a series of steps primarily using public tools (plus some small internal supplements). Analysis is much more variable and less standardized, combining and comparing many samples to answer a particular question which varies by project. So a `$PROCESSED/project` folder would look something like this:\n```\nshared_data_folder\n\t/results_pipeline\n\t\t/sample1\n\t\t/sample2\n\t/analysis\n\t\t/fig1\n\t\t/fig2\n\t\t/tables\n```\n\n\n## How to start a new project:\n\n(I have a script that automates this in [env](https://github.com/nsheff/env/blob/master/bin/newproject.sh))\n\n1. Select a project name that fits with our naming style. See section below on [choosing a project name](#choosing-a-project-name).\n2. Create a new repository on GitHub in our group organization (name `YOUR_PROJECT_NAME`)\n3. Clone the `newproject` repository to establish folder structure (`git clone https://github.com/databio/newproject`)\n4. Rename the local \"newproject\" folder to your new repo name (`mv newproject YOUR_PROJECT_NAME`).\n5. Edit `metadata/project_config.yaml`. For consistency, the `output_dir` should have the same name as the project, so set `metadata.output_dir:` to `${PROCESSED}YOUR_PROJECT_NAME`. \n6. Edit this README.md file to briefly describe the project.\n7. Delete the Makefile (this just lets me purge this history of this template so your new clone starts clean)\n8. Update the remote of your local repo and push changes:\n\n```\ngit remote set-url origin git@github.com:databio/YOUR_PROJECT_NAME.git\ngit push -u origin master\n```\n\n## Choosing a project name\n\nTo keep things simple and predictable, let's keep a correspondence between project name, GitHub repository name, and results folders. This means the project name should **match the output folder name exactly**. This 1:1 correspondondence makes it easier to keep track of things and find stuff when we're browsing around on disk, and also to build automated tools to look in results folders. Therefore, picking a name is pretty critical, because it's going to show up everywhere.\n\nLet's also try to keep things consistent with our analysis project names. There's no right way, but there's the way we've been doing it, so we can just stick with that. Here are a few example project names:  `ews_patients`, `cphg_atac`, `microtest`, `ratbrain`.\n\nSome style things:\n\n* keep the project name all lowercase\n* use underscores, not camelCase\n* keep it relatively short, but long enough to capture the unique things about the project\n* try to make it future-proof\n* don't worry about it too much\n\n\n## Project management guidelines for keeping GitHub repositories tidy\n\n1. Put all source code in the `src/` folder.\n2. Do not commit any results, data, figures, or binary files of any type into the repository.\n3. Write [good commit messages](https://gist.github.com/nsheff/868f88bdca529e4a32377e8279dae826)\n4. Keep the first line of commit messages [under 50 characters](http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html) (use Details to elaborate)\n5. Consider `--rebase` for git pull to [keep a cleaner history](http://git-scm.com/book/en/v2/Git-Branching-Rebasing)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabio%2Fexample_analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabio%2Fexample_analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabio%2Fexample_analysis/lists"}