{"id":42060210,"url":"https://github.com/statgen/bravo_data_prep","last_synced_at":"2026-01-26T07:38:54.442Z","repository":{"id":49799711,"uuid":"361024499","full_name":"statgen/bravo_data_prep","owner":"statgen","description":"Work flows and tools to create data that backs the BRAVO API.","archived":false,"fork":false,"pushed_at":"2025-11-18T21:56:15.000Z","size":19439,"stargazers_count":0,"open_issues_count":8,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-11-18T23:18:08.159Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/statgen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-04-23T23:00:19.000Z","updated_at":"2025-09-09T15:29:40.000Z","dependencies_parsed_at":"2023-02-16T22:16:10.913Z","dependency_job_id":"e131df5c-beb4-430d-ae5a-8d4ff9bc5629","html_url":"https://github.com/statgen/bravo_data_prep","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/statgen/bravo_data_prep","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statgen%2Fbravo_data_prep","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statgen%2Fbravo_data_prep/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statgen%2Fbravo_data_prep/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statgen%2Fbravo_data_prep/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/statgen","download_url":"https://codeload.github.com/statgen/bravo_data_prep/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statgen%2Fbravo_data_prep/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28769853,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-26T06:37:25.426Z","status":"ssl_error","status_checked_at":"2026-01-26T06:37:23.039Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-26T07:38:51.622Z","updated_at":"2026-01-26T07:38:54.436Z","avatar_url":"https://github.com/statgen.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BRAVO Data Pipeline\nProcessing data to power the BRowse All Variants Online (BRAVO) API\n\n1. Build, download, or install dependencies.\n    1. Compile custom tools\n    1. Install external tools\n    1. Download external data\n1. Collect data to be processed into convenient location.\n1. Modify nextflow configs to match paths on your system or cluster.\n1. Run nextflow workflows\n\n## Input Data\n**Naming:** The pipeline depends on the names of the input cram files having the sample ID as the first part of the filename.\nSpecifically, the expectation that the ID preceeds the first `.` such that a call to `getSimpleName()` yields the ID.\n\n### Sequence Data\nSource cram files.  Original sequences from which the variant calls were made.\n\n### Variant calls\nSource bcf files. Generated running the [topmed variant calling pipeline](https://github.com/statgen/topmed_variant_calling) \n\n## Data Preparation Tools\n\n### Compile Custom Tools\nIn the `tools/` directory you will find tools/scripts to prepare your data for importing into Mongo database and using in BRAVO browser.\n\n```sh\ncd tools/cpp_tools\ncget install .\n```\nThis build executables in `tools/cpp_tools/cget/bin`\n\n### External Tools\nBamUtil, VEP, and Loftee tools required are described in [dependencies.md](dependencies.md)\n\n### External Data\nGencode, Ensembl, dbSNP, and HUGO data required are described in [basis\\_data.md](basis_data.md)\n\n## Nextflow Scripts\nIn the `workflows/` directory are three Nextflow configs and scripts used to prepare the runtime data for the BRAVO API.\n\nDetails about the steps of the pipeline are detailed in [data\\_prep\\_steps.md](data_prep_steps.md).\n\nThe three nextflow pipelines are:\n1. Prepare VCF Teddy\n2. Sequences\n3. Coverage\n\n## Downstream data for BRAVO API\nThe `make_vignette_dir.sh` script consolidates the results from the nextflow scripts into a data directory organized for the BRAVO API.\nIt is designed for small data sets, and should be run after the three data pipelines complete.\n\nThere are two data sets that Bravo API needs to run:\n- *Runtime Data* are flat files on disk read at runtime.\n- *Basis Data* files processed and loaded into mongo db.\n\n### Downstream data subdirectory notes\n\n```sh\ndata/\n├── cache\n├── coverage\n│   ├── bin_1\n│   ├── bin_25e-2\n│   ├── bin_50e-2\n│   ├── bin_75e-2\n│   └── full\n├── crams\n│   ├── sequences\n│   ├── variant_map.tsv.gz\n│   └── variant_map.tsv.gz.tbi\n└── reference\n    ├── hs38DH.fa\n    └── hs38DH.fa.fai\n```\n\n- `reference/` holds the refercence fasta files for the genome\n- API's `SEQUENCE_DIR` config val is asking for directory that contains the 'sequences' directory.\n  - sequences dirname is hardcoded\n  - `variant_map.tsv.gz` file name is hardcoded.\n  - `variant_map.tsv.gz.tbi` file name is hardcoded.\n- Under sequence/, directory structure and filenames are perscribed.\n  - All two hex character directories 00 to ff should exist as subdirectories.\n  - cram files must have the filename in the exact form of `sample_id.cram`\n  - The sub dir a cram belongs in is the first two characters of the md5 hexdigest of the sample_id.\n    - E.g. foobar123.cram would be in directory \"ae\"\n        ```python\n        hashlib.md5(\"foobar123\".encode()).hexdigest()[:2]\n        ```\n    - This dir structure is produced by the nextflow pipeline\n- coverage directory contents are taken from result/ dir of coverage workflow\n- `variant_map.tsv.gz` is an output of `RandomHetHom3`\n    \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstatgen%2Fbravo_data_prep","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstatgen%2Fbravo_data_prep","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstatgen%2Fbravo_data_prep/lists"}