Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vincentclaes/datajob
Build and deploy a serverless data pipeline on AWS with no effort.
https://github.com/vincentclaes/datajob
aws aws-cdk data-pipeline glue glue-job machine-learning pipeline sagemaker serverless stepfunctions
Last synced: 2 months ago
JSON representation
Build and deploy a serverless data pipeline on AWS with no effort.
- Host: GitHub
- URL: https://github.com/vincentclaes/datajob
- Owner: vincentclaes
- License: apache-2.0
- Created: 2020-10-22T19:07:31.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2023-02-08T04:33:47.000Z (almost 2 years ago)
- Last Synced: 2024-09-18T16:28:50.908Z (4 months ago)
- Topics: aws, aws-cdk, data-pipeline, glue, glue-job, machine-learning, pipeline, sagemaker, serverless, stepfunctions
- Language: Python
- Homepage: https://pypi.org/project/datajob/
- Size: 3.15 MB
- Stars: 110
- Watchers: 6
- Forks: 19
- Open Issues: 19
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-cdk - Datajob - Build and deploy a serverless data pipeline or machine learning pipeline on AWS with no effort. (High-Level Frameworks / Multi-accounts setup)
README
[![Awesome](https://awesome.re/badge.svg)](https://github.com/kolomied/awesome-cdk#high-level-frameworks)
![logo](./assets/logo.png)
Build and deploy a serverless data pipeline on AWS with no effort.
Our goal is to let developers think about the business logic, datajob does the rest...
- Deploy code to python shell / pyspark **AWS Glue jobs**.
- Use **AWS Sagemaker** to create ML Models.
- Orchestrate the above jobs using **AWS Stepfunctions** as simple as `task1 >> task2`
- Let us [know](https://github.com/vincentclaes/datajob/discussions) **what you want to see next**.
:rocket: :new: :rocket:
[Check our new example of an End-to-end Machine Learning Pipeline with Glue, Sagemaker and Stepfunctions](examples/ml_pipeline_end_to_end)
:rocket: :new: :rocket:
# Installation
Datajob can be installed using pip.
Beware that we depend on [aws cdk cli](https://github.com/aws/aws-cdk)!pip install datajob
npm install -g [email protected] # latest version of datajob depends this version# Quickstart
You can find the full example in [examples/data_pipeline_simple](./examples/data_pipeline_simple/).
We have a simple data pipeline composed of [2 glue jobs](./examples/data_pipeline_simple/glue_jobs/) orchestrated sequentially using step functions.
```python
from aws_cdk import corefrom datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflowapp = core.App()
# The datajob_stack is the instance that will result in a cloudformation stack.
# We inject the datajob_stack object through all the resources that we want to add.
with DataJobStack(scope=app, id="data-pipeline-simple") as datajob_stack:
# We define 2 glue jobs with the relative path to the source code.
task1 = GlueJob(
datajob_stack=datajob_stack, name="task1", job_path="glue_jobs/task.py"
)
task2 = GlueJob(
datajob_stack=datajob_stack, name="task2", job_path="glue_jobs/task2.py"
)# We instantiate a step functions workflow and orchestrate the glue jobs.
with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn:
task1 >> task2app.synth()
```
We add the above code in a file called `datajob_stack.py` in the [root of the project](./examples/data_pipeline_with_packaged_project/).
### Configure CDK
Follow the steps [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config) to configure your credentials.```shell script
export AWS_PROFILE=default
# use the aws cli to get your account number
export AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text --profile $AWS_PROFILE)
export AWS_DEFAULT_REGION=eu-west-1# init cdk
cdk bootstrap aws://$AWS_ACCOUNT/$AWS_DEFAULT_REGION
```### Deploy
Deploy the pipeline using CDK.
```shell
cd examples/data_pipeline_simple
cdk deploy --app "python datajob_stack.py" --require-approval never
```### Execute
```shell script
datajob execute --state-machine data-pipeline-simple-workflow
```
The terminal will show a link to the step functions page to follow up on your pipeline run.![sfn](./assets/sfn.png)
### Destroy
```shell script
cdk destroy --app "python datajob_stack.py"
```# Examples
- [Data pipeline with parallel steps](./examples/data_pipeline_parallel/)
- [Data pipeline for processing big data using PySpark](./examples/data_pipeline_pyspark/)
- [Data pipeline where you package and ship your project as a wheel](./examples/data_pipeline_with_packaged_project/)
- [Machine Learning pipeline where we combine glue jobs with sagemaker](examples/ml_pipeline_end_to_end)All our examples are in [./examples](./examples)
# Functionality
Deploy to a stage
Specify a stage to deploy an isolated pipeline.
Typical examples would be `dev` , `prod`, ...
```shell
cdk deploy --app "python datajob_stack.py" --context stage=my-stage
```Using datajob's S3 data bucket
Dynamically reference the `datajob_stack` data bucket name to the arguments of your GlueJob by calling
`datajob_stack.context.data_bucket_name`.```python
import pathlibfrom aws_cdk import core
from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflowcurrent_dir = str(pathlib.Path(__file__).parent.absolute())
app = core.App()
with DataJobStack(
scope=app, id="datajob-python-pyspark", project_root=current_dir
) as datajob_stack:
pyspark_job = GlueJob(
datajob_stack=datajob_stack,
name="pyspark-job",
job_path="glue_job/glue_pyspark_example.py",
job_type="glueetl",
glue_version="2.0", # we only support glue 2.0
python_version="3",
worker_type="Standard", # options are Standard / G.1X / G.2X
number_of_workers=1,
arguments={
"--source": f"s3://{datajob_stack.context.data_bucket_name}/raw/iris_dataset.csv",
"--destination": f"s3://{datajob_stack.context.data_bucket_name}/target/pyspark_job/iris_dataset.parquet",
},
)with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn:
pyspark_job >> ...```
you can find this example [here](./examples/data_pipeline_pyspark/glue_job/glue_pyspark_example.py)
Deploy files to the datajob's deployment bucket
Specify the path to the folder we would like to include in the deployment bucket.
```python
from aws_cdk import core
from datajob.datajob_stack import DataJobStackapp = core.App()
with DataJobStack(
scope=app, id="some-stack-name", include_folder="path/to/folder/"
) as datajob_stack:...
```
Package your project as a wheel and ship it to AWS
You can find the example [here](./examples/data_pipeline_with_packaged_project/)
```python
# We add the path to the project root in the constructor of DataJobStack.
# By specifying project_root, datajob will look for a .whl in
# the dist/ folder in your project_root.
with DataJobStack(
scope=app, id="data-pipeline-pkg", project_root=current_dir
) as datajob_stack:
```Package you project using [poetry](https://python-poetry.org/)
```shell
poetry build
cdk deploy --app "python datajob_stack.py"
```Package you project using [setup.py](./examples/data_pipeline_with_packaged_project)
```shell
python setup.py bdist_wheel
cdk deploy --app "python datajob_stack.py"
```
you can also use the datajob cli to do the two commands at once:
```shell
# for poetry
datajob deploy --config datajob_stack.py --package poetry# for setup.py
datajob deploy --config datajob_stack.py --package setuppy
```Processing big data using a Glue Pyspark job
```python
import pathlibfrom aws_cdk import core
from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJobcurrent_dir = str(pathlib.Path(__file__).parent.absolute())
app = core.App()
with DataJobStack(
scope=app, id="datajob-python-pyspark", project_root=current_dir
) as datajob_stack:
pyspark_job = GlueJob(
datajob_stack=datajob_stack,
name="pyspark-job",
job_path="glue_job/glue_pyspark_example.py",
job_type="glueetl",
glue_version="2.0", # we only support glue 2.0
python_version="3",
worker_type="Standard", # options are Standard / G.1X / G.2X
number_of_workers=1,
arguments={
"--source": f"s3://{datajob_stack.context.data_bucket_name}/raw/iris_dataset.csv",
"--destination": f"s3://{datajob_stack.context.data_bucket_name}/target/pyspark_job/iris_dataset.parquet",
},
)
```
full example can be found in [examples/data_pipeline_pyspark](examples/data_pipeline_pyspark]).Orchestrate stepfunctions tasks in parallel
```python
# Task2 comes after task1. task4 comes after task3.
# Task 5 depends on both task2 and task4 to be finished.
# Therefore task1 and task2 can run in parallel,
# as well as task3 and task4.
with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn:
task1 >> task2
task3 >> task4
task2 >> task5
task4 >> task5```
More can be found in [examples/data_pipeline_parallel](./examples/data_pipeline_parallel)Orchestrate 1 stepfunction task
Use the [Ellipsis](https://docs.python.org/dev/library/constants.html#Ellipsis) object to be able to orchestrate 1 job via step functions.
```python
some_task >> ...
```Notify in case of error/success
Provide the parameter `notification` in the constructor of a `StepfunctionsWorkflow` object.
This will create an SNS Topic which will be triggered in case of failure or success.
The email will subscribe to the topic and receive the notification in its inbox.```python
with StepfunctionsWorkflow(datajob_stack=datajob_stack,
name="workflow",
notification="[email protected]") as sfn:
task1 >> task2
```You can provide 1 email or a list of emails `["[email protected]", "[email protected]"]`.
# Datajob in depth
The `datajob_stack` is the instance that will result in a cloudformation stack.
The path in `project_root` helps `datajob_stack` locate the root of the project where
the setup.py/poetry pyproject.toml file can be found, as well as the `dist/` folder with the wheel of your project .```python
import pathlib
from aws_cdk import corefrom datajob.datajob_stack import DataJobStack
current_dir = pathlib.Path(__file__).parent.absolute()
app = core.App()with DataJobStack(
scope=app, id="data-pipeline-pkg", project_root=current_dir
) as datajob_stack:...
```When __entering the contextmanager__ of DataJobStack:
A [DataJobContext](./datajob/datajob_stack.py#L48) is initialized
to deploy and run a data pipeline on AWS.
The following resources are created:
1) "data bucket"
- an S3 bucket that you can use to dump ingested data, dump intermediate results and the final output.
- you can access the data bucket as a [Bucket](https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_s3/Bucket.html) object via ```datajob_stack.context.data_bucket```
- you can access the data bucket name via ```datajob_stack.context.data_bucket_name```
2) "deployment bucket"
- an s3 bucket to deploy code, artifacts, scripts, config, files, ...
- you can access the deployment bucket as a [Bucket](https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_s3/Bucket.html) object via ```datajob_stack.context.deployment_bucket```
- you can access the deployment bucket name via ```datajob_stack.context.deployment_bucket_name```when __exiting the context manager__ all the resources of our DataJobStack object are created.
We can write the above example more explicitly...
```python
import pathlib
from aws_cdk import corefrom datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflowcurrent_dir = pathlib.Path(__file__).parent.absolute()
app = core.App()
datajob_stack = DataJobStack(scope=app, id="data-pipeline-pkg", project_root=current_dir)
datajob_stack.init_datajob_context()task1 = GlueJob(datajob_stack=datajob_stack, name="task1", job_path="glue_jobs/task.py")
task2 = GlueJob(datajob_stack=datajob_stack, name="task2", job_path="glue_jobs/task2.py")with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as step_functions_workflow:
task1 >> task2datajob_stack.create_resources()
app.synth()
```# Ideas
Any suggestions can be shared by starting a [discussion](https://github.com/vincentclaes/datajob/discussions)
These are the ideas, we find interesting to implement;
- add a time based trigger to the step functions workflow.
- add an s3 event trigger to the step functions workflow.
- add a lambda that copies data from one s3 location to another.
- version your data pipeline.
- cli command to view the logs / glue jobs / s3 bucket
- implement sagemaker services
- processing jobs
- hyperparameter tuning jobs
- training jobs
- implement lambda
- implement ECS Fargate
- create a serverless UI that follows up on the different pipelines deployed on possibly different AWS accounts using Datajob> [Feedback](https://github.com/vincentclaes/datajob/discussions) is much appreciated!