https://github.com/getindata/data-pipelines-cli
CLI for data platform
https://github.com/getindata/data-pipelines-cli
Last synced: about 1 year ago
JSON representation
CLI for data platform
- Host: GitHub
- URL: https://github.com/getindata/data-pipelines-cli
- Owner: getindata
- License: apache-2.0
- Created: 2021-11-17T14:21:25.000Z (over 4 years ago)
- Default Branch: develop
- Last Pushed: 2023-12-08T18:25:16.000Z (over 2 years ago)
- Last Synced: 2025-03-23T22:07:03.644Z (about 1 year ago)
- Language: Python
- Size: 2.11 MB
- Stars: 19
- Watchers: 8
- Forks: 3
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# data-pipelines-cli
[](https://github.com/getindata/data-pipelines-cli)
[](https://pypi.org/project/data-pipelines-cli/)
[](https://pepy.tech/project/data-pipelines-cli)
[](https://codeclimate.com/github/getindata/data-pipelines-cli/maintainability)
[](https://codeclimate.com/github/getindata/data-pipelines-cli/test_coverage)
[](https://data-pipelines-cli.readthedocs.io/en/latest/?badge=latest)
CLI for data platform
## Documentation
Read the full documentation at [https://data-pipelines-cli.readthedocs.io/](https://data-pipelines-cli.readthedocs.io/en/latest/index.html)
## Installation
Use the package manager [pip](https://pip.pypa.io/en/stable/) to install [dp (data-pipelines-cli)](https://pypi.org/project/data-pipelines-cli/):
```bash
pip install data-pipelines-cli[bigquery,docker,datahub,gcs]
```
## Usage
First, create a repository with a global configuration file that you or your organization will be using. The repository
should contain `dp.yml.tmpl` file looking similar to this:
```yaml
_templates_suffix: ".tmpl"
_envops:
autoescape: false
block_end_string: "%]"
block_start_string: "[%"
comment_end_string: "#]"
comment_start_string: "[#"
keep_trailing_newline: true
variable_end_string: "]]"
variable_start_string: "[["
templates:
my-first-template:
template_name: my-first-template
template_path: https://github.com//.git
vars:
username: [[ YOUR_USERNAME ]]
```
Thanks to the [copier](https://copier.readthedocs.io/en/stable/), you can leverage tmpl template syntax to create
easily modifiable configuration templates. Just create a `copier.yml` file next to the `dp.yml.tmpl` one and configure
the template questions (read more at [copier documentation](https://copier.readthedocs.io/en/stable/configuring/)).
Then, run `dp init ` to initialize **dp**. You can also drop `` argument,
**dp** will get initialized with an empty config.
### Project creation
You can use `dp create ` to choose one of the templates added before and create the project in the
`` directory. You can also use `dp create ` to point
directly to a template repository. If `` proves to be the name of the template defined in
**dp**'s config file, `dp create` will choose the template by the name instead of trying to download the repository.
`dp template-list` lists all added templates.
### Project update
To update your pipeline project use `dp update `. It will sync your existing project with updated
template version selected by `--vcs-ref` option (default `HEAD`).
### Project deployment
`dp deploy` will sync with your bucket provider. The provider will be chosen automatically based on the remote URL.
Usually, it is worth pointing `dp deploy` to JSON or YAML file with provider-specific data like access tokens or project
names. E.g., to connect with Google Cloud Storage, one should run:
```bash
echo '{"token": "", "project_name": ""}' > gs_args.json
dp deploy --dags-path "gs://" --blob-args gs_args.json
```
However, in some cases you do not need to do so, e.g. when using `gcloud` with properly set local credentials. In such
case, you can try to run just the `dp deploy --dags-path "gs://"` command. Please refer to
[documentation](https://data-pipelines-cli.readthedocs.io/en/latest/usage.html#project-deployment) for more information.
When finished, call `dp clean` to remove compilation related directories.
### Variables
You can put a dictionary of variables to be passed to `dbt` in your `config//dbt.yml` file, following the convention
presented in [the guide at the dbt site](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-variables#defining-variables-in-dbt_projectyml).
E.g., if one of the fields of `config//snowflake.yml` looks like this:
```yaml
schema: "{{ var('snowflake_schema') }}"
```
you should put the following in your `config//dbt.yml` file:
```yaml
vars:
snowflake_schema: EXAMPLE_SCHEMA
```
and then run your `dp run --env ` (or any similar command).
## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.