https://github.com/databiosphere/encode-ingest
Batch ETL pipeline to mirror ENCODE data into the Jade Data Repository.
https://github.com/databiosphere/encode-ingest
Last synced: 8 months ago
JSON representation
Batch ETL pipeline to mirror ENCODE data into the Jade Data Repository.
- Host: GitHub
- URL: https://github.com/databiosphere/encode-ingest
- Owner: DataBiosphere
- License: bsd-3-clause
- Created: 2020-01-09T21:15:51.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-08-04T15:59:11.000Z (over 2 years ago)
- Last Synced: 2025-06-01T06:52:33.652Z (10 months ago)
- Language: Scala
- Homepage:
- Size: 1.64 MB
- Stars: 3
- Watchers: 6
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ENCODE Ingest
Batch ETL pipeline to mirror ENCODE data into the Terra Data Repository (TDR). See the [architecture documentation](https://github.com/DataBiosphere/encode-ingest/blob/master/ARCHITECTURE.md) for
further design details.
## Getting Started
Orchestration of the ETL flows in this project is implemented using [Argo Workflows](https://argoproj.github.io/argo-workflows/).
The core extraction and transformation data pipelines are implemented in [Scio](https://spotify.github.io/scio/) on top of Apache Beam.
After cloning the repository, ensure you can compile the code, auto-generate schema classes
and run the test suite from the repository root:
`sbt test`
## Development Process
All development should be done on branches off of the protected `master` branch. After review, merge to `master`
and then follow the instuctions in the [monster-deploy repo](https://github.com/broadinstitute/monster-deploy/)
When modifying the Scio data pipelines, it's possible to run the pipeline locally by invoking the relevant pipeline:
* **Extraction:**
`sbt "encode-extraction / runMain org.broadinstitute.monster.encode.extraction.ExtractionPipeline --outputDir="`
* **Transformation**
`sbt "encode-transformation-pipeline / runMain org.broadinstitute.monster.encode.transformation.TransformationPipeline --inputPrefix= --outputPrefix="`
Development of Argo changes requires deployment to the DEV environment as documented in the [monster-deploy repo](https://github.com/broadinstitute/monster-deploy/)
## Build and deploy code chages from branch
#### *Do you have a new Macbook Pro with the M1 chip?*
1. If so, clone this repo `https://github.com/DataBiosphere/ingest-utils.git`
1. Check out the branch `ah_m1arch
`
### Steps in encode-ingest repo
1. Before you commit changes, make sure the build occurs without error `sbt compile` and that the tests run without errors `sbt test`.
*The build may reformat some of your files. Make sure to do a diff and add any changes to your git staging area.*
1. Commit changes to local branch `git commit -m ""`
1. Create a version tag for the branch `git tag v1.0.`. For example `v1.0.120`
*You can run a `git log ` to see the previous version number or go to the actions tab for the repo `https://github.com/DataBiosphere/encode-ingest/actions` to see the previous version built. It is very important the the version format is exact. No `.` between the `v` and the `1`.*
1. Now push the branch and tag to the remote `git push origin : v1.0.`
*You can also push your branch to the remote and then separately push the tags to the remote `git push --tags`*
*This will start 2 build actions. If the build fails, you can look at the details for errors or just run a local `sbt compile`. In some cases you may have forgotten to commit formatting changes done automatically by `sbt compile`. Just add them to your staging area and start with #2 above.*