{"id":21514880,"url":"https://github.com/getindata/first-steps-with-data-pipelines","last_synced_at":"2025-03-17T16:15:00.168Z","repository":{"id":37738498,"uuid":"499454415","full_name":"getindata/first-steps-with-data-pipelines","owner":"getindata","description":null,"archived":false,"fork":false,"pushed_at":"2023-09-08T13:52:54.000Z","size":3448,"stargazers_count":9,"open_issues_count":0,"forks_count":3,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-01-24T02:30:33.822Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Dockerfile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/getindata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-03T09:32:44.000Z","updated_at":"2023-02-01T04:30:15.000Z","dependencies_parsed_at":"2025-01-24T02:39:46.343Z","dependency_job_id":null,"html_url":"https://github.com/getindata/first-steps-with-data-pipelines","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Ffirst-steps-with-data-pipelines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Ffirst-steps-with-data-pipelines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Ffirst-steps-with-data-pipelines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Ffirst-steps-with-data-pipelines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/getindata","download_url":"https://codeload.github.com/getindata/first-steps-with-data-pipelines/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244066191,"owners_count":20392407,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-23T23:53:25.729Z","updated_at":"2025-03-17T16:15:00.146Z","avatar_url":"https://github.com/getindata.png","language":"Dockerfile","funding_links":[],"categories":[],"sub_categories":[],"readme":"# First steps with Data Pipelines\n\n## Description\n\nThis is an example of a simple [Data Pipelines](https://data-pipelines-cli.readthedocs.io/en/latest/index.html) project template with a small amount of data and a simple pipeline and test scenario.\nYou can learn the basics of how to work with the tool. Below we will describe the steps of how to use [Data Pipelines](https://data-pipelines-cli.readthedocs.io/en/latest/index.html) tool for creating new projects using an existing project template and how to\nuse the tool for running simple data pipelines on [GCP BigQuery](https://cloud.google.com/bigquery).\nThis project can be used as a project template for [Data Pipelines](https://data-pipelines-cli.readthedocs.io/en/latest/index.html) tool.\nHopefully thanks to this you will make your first steps with [Data Pipelines](https://data-pipelines-cli.readthedocs.io/en/latest/index.html) tool faster.\n\nThis is an example of a simple [Data Pipelines](https://data-pipelines-cli.readthedocs.io/en/latest/index.html) (DP) project. If you are looking for a more advanced project, a project with many pipelines, tables and\nviews, tests and seeds you can find it [here](https://github.com/getindata/tpc-h-data-pipelines-demo.git).\n\n## Prerequisites\n- A project will be run on local machine\n   and the results of our pipelines will be stored on [GCP BigQuery](https://cloud.google.com/bigquery) connected with your project\n- Access to GCP account and projects via CLI \n- Some experience with a command line\n- Basic understanding of SQL\n\n## Data used\nFor the purpose of this simple project demo we will use the data from 2 CSV files that are placed in the seeds folder.\nNo other data is being used. Data in both of the CSV files was generated.\n\n## First steps with Data Pipelines\n\n### 1. Environment preparation\n\nHere we will explain how to make it possible to run [Data Pipelines](https://data-pipelines-cli.readthedocs.io/en/latest/index.html). Let's first clone this repo to your machine and then  need to install Data Pipelines CLI:\n```\npip install data-pipelines-cli[\u003cflags\u003e]\n```\nDepending on the systems that you want to integrate with you need to provide different flags in square brackets. For purpose of our project we will use:\n```\npip install data-pipelines-cli[gcs,git,bigquery]\n```\n\nIf you want to get more information about installation of Data Pipelines CLI follow the documentation - [Data Pipelines Documentation](https://data-pipelines-cli.readthedocs.io/en/latest/installation.html)\n\n### 2. Initialization of Data Pipelines tool\n\nWe expect that the whole organization will be using the same [Data Pipelines](https://data-pipelines-cli.readthedocs.io/en/latest/index.html) initialization project that specifies which\ntemplates (DP projects) they are using. Initialization makes it possible to set up some dp variables that can be used\nacross the whole company. We specify the repository path where we specified the variables that we want dp to be using.\nWe can have many initialization repos depending on what we want to do with our project.\n\nHere is how you can initialize dp:\n```\ndp init \u003cpath to init repo\u003e\n```\n\nIf this if your first DP project and you do not have your own templates of projects then\nhere is an example of a publically available DP init repository that you can use:\n```\ndp init https://github.com/getindata/data-pipelines-cli-init-example\n```\n\nFor the purpose of this demo the only variable we will be asked for is going to be\n```username``` which is used in many dp commands. We can specify our username as shown below:\n\n![](images/project_init_username_specification.png)\n\nYou can add more options to dp.yml file with other templates of projects to choose from. Specify their template_names\nand the template_paths to git repositories. You can also specify more vars for use in your projects.\nThe example initialization asks about the name of user, this name will be later used in other operations but\nyou typically have to run init command only once.\n\n### 3. Creating our own project\n\nAfter the initialization is complete we can start using DP. Now we will ```create``` a project using a project template.\nThe ```dp create``` command can look like this:\n\n```\ndp create \u003cproject path\u003e \u003ctemplate path\u003e \n```\n\n```project path``` says in which folder our project, that will be created should be placed. Usually this is just a directory name.\n```template path``` is a path of a template to use for creating a new project. This parameter can be skipped - then\nwe are able to choose one template of a project from a list specified in the ```.dp.yml``` file.\n\nFor the purpose of this demo, we will use a template already specified in `.dp.yml` file. After executing this command:\n```\ndp create our-simple-project\n```\nwe should be able to choose a template that we want to use from a list. \n\n![](images/project_creation_template_specification.png)\n\nWe can switch options by pressing up and down buttons and we can make a decision by pressing enter. For this demo we are going\nto use ```first-steps-with-data-pipelines``` template, which is actually the project that you are reading right now.\n\nAfter pressing enter button, we will be asked some questions about which template to use for a new project, the name of the project,\nthe name of GCP project that we are working on, the cron that specifies at what times should the DP pipeline run and a\ndescription of the created project. Answer these questions. Be aware that the name of the DP project should be composed of alpha-numeric signs and the `_` sign.\n\n![](images/project_creation_options_filled_in.png)\n\nAfter answering these questions [Copier](https://copier.readthedocs.io/en/stable/) will be used to create contents of our projects using the specified project template.\nGood job! The project should have been created successfully.\n\nNow let's enter the project folder.\n\n```\ncd our-simple-project\n```\n\n### 4. Config files in config directory\n\nIn the ```config``` directory you can find some environment configuration files. These files will be modified were generated from a project template that we used.\nWhen you want to use [Data Pipelines](https://data-pipelines-cli.readthedocs.io/en/latest/index.html) in the future, you will be able to specify the configuration that is suitable for your project.\n\n![](images/configuration_files.png)\n\nFor the purpose of this demo you do not have to worry about making changes in these files. We can use the default configuration.\n\n### 5. Running pipelines and tests using Data Pipelines tool\n\nThis project consists of:\n- 2 ```seeds```\n- 1 ```model```\n- 1 ```test```\n\nTo understand more about ```models```, ```tests``` and ```seeds``` please read about them at the\n[DBT Documentation](https://docs.getdbt.com/docs/building-a-dbt-project/documentation).\n\n#### 5.1 Executing seeds\n\nWhen we have our environment ready and the project has been created, the first thing we should do is to execute the ```seeds```.\nIn this repository there are 2 CSV files specified that contain some data. DBT will use these 2 files as ```seeds```.\nAfter running this command the tables with contents of CSV files will be created in a BigQuery dataset.\nThe name of the dataset we use is created using our ```username``` value that we provided in the initialization step.\nMake sure that you are in the project folder and execute the ```seeds``` with this command:\n\n```\ndp seed\n```\n\nIf we execute the ```seeds``` more times than one then the contents of the tables will be replaced with the same values.\nUnless we change the contents of CSV files there will be no change. This is why usually we will only have to run the command once, in the beginning of our work.\n\nHere is an example of what output of this command can look like based on the contents of this repository.\n\n![](images/simple_output_seed.png)\n\nWhen the process is finished let's check the contents of our BigQuery dataset.\nBelow is a picture that presents the contents of 2 tables generated in BigQuery based on the 2 ```seed``` CSV files:\n\n![](images/simple_bigquery_seed.png)\n\n\n#### 5.2 Executing models\n\nThe contents of the tables that were created in the ```Executing seeds``` step can be later used in some of the models that we specify.\nNow we should be ready to run our models. In the models folder of our template we have specified 1 model that uses the 2 ```seed``` tables.\n\nExecute the command.\n\n```\ndp run\n```\n\n![](images/simple_run_output.png)\n\nThis process will look at the contents of the models directory and create coresponding tables or views in our BigQuery Dataset:\n\n![](images/simple_run_bigquery.png)\n\n#### 5.3 Executing tests\n\nNow after all the tables and views are created we can also check, if the models work as intended by running the tests.\nWe can have tests that check if the logic behind a query works as intended for a set of data. Let's run the tests.\n\n```\ndp test\n```\n\n![](images/simple_test_output.png)\n\nWe should be able to see the summary, we can see if everything with our models is fine and there are no errors.\n\n### Next steps\nIf you are interested in more advanced use of [Data Pipelines](https://data-pipelines-cli.readthedocs.io/en/latest/index.html) you can check\n[this repository](https://github.com/getindata/tpc-h-data-pipelines-demo.git). By familiarizing yourself with this resource, you will get \n a better understanding on how [Data Pipelines](https://data-pipelines-cli.readthedocs.io/en/latest/index.html) could look like in your production project. Remember to push your work to your [Git](https://git-scm.com/doc) repository before stopping it if you want to continue in the future.\n\n## Resources\n\n- More about [data-pipelines-cli](https://data-pipelines-cli.readthedocs.io/en/latest/usage.html#)\n- More about dbt [in the docs](https://docs.getdbt.com/docs/introduction)\n- [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers about `dbt`\n- Rendering project templates with [Copier](https://copier.readthedocs.io/en/stable/) \n- Data pipelines orchestration with [Airlfow](https://airflow.apache.org/) ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetindata%2Ffirst-steps-with-data-pipelines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgetindata%2Ffirst-steps-with-data-pipelines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetindata%2Ffirst-steps-with-data-pipelines/lists"}