https://github.com/vincentclaes/datajob

Build and deploy a serverless data pipeline on AWS with no effort.
https://github.com/vincentclaes/datajob
aws aws-cdk data-pipeline glue glue-job machine-learning pipeline sagemaker serverless stepfunctions
Last synced: 3 months ago
JSON representation
Build and deploy a serverless data pipeline on AWS with no effort.
Host: GitHub
URL: https://github.com/vincentclaes/datajob
Owner: vincentclaes
License: apache-2.0
Created: 2020-10-22T19:07:31.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2023-02-08T04:33:47.000Z (over 2 years ago)
Last Synced: 2025-03-01T18:18:28.524Z (4 months ago)
Topics: aws, aws-cdk, data-pipeline, glue, glue-job, machine-learning, pipeline, sagemaker, serverless, stepfunctions
Language: Python
Homepage: https://pypi.org/project/datajob/
Size: 3.15 MB
Stars: 111
Watchers: 6
Forks: 19
Open Issues: 19
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesome-cdk - Datajob - Build and deploy a serverless data pipeline or machine learning pipeline on AWS with no effort. (High-Level Frameworks / Multi-accounts setup)
README

        [![Awesome](https://awesome.re/badge.svg)](https://github.com/kolomied/awesome-cdk#high-level-frameworks)

![logo](./assets/logo.png)



 Build and deploy a serverless data pipeline on AWS with no effort.

 Our goal is to let developers think about the business logic, datajob does the rest...

 

 

 



 

- Deploy code to python shell / pyspark **AWS Glue jobs**.

- Use **AWS Sagemaker** to create ML Models.

- Orchestrate the above jobs using **AWS Stepfunctions** as simple as `task1 >> task2`

- Let us [know](https://github.com/vincentclaes/datajob/discussions) **what you want to see next**.

 



 :rocket: :new: :rocket:

 

[Check our new example of an End-to-end Machine Learning Pipeline with Glue, Sagemaker and Stepfunctions](examples/ml_pipeline_end_to_end)

:rocket: :new: :rocket:



 

# Installation

 Datajob can be installed using pip. 


 Beware that we depend on [aws cdk cli](https://github.com/aws/aws-cdk)!

    pip install datajob

    npm install -g [email protected] # latest version of datajob depends this version

# Quickstart

You can find the full example in [examples/data_pipeline_simple](./examples/data_pipeline_simple/).

We have a simple data pipeline composed of [2 glue jobs](./examples/data_pipeline_simple/glue_jobs/) orchestrated sequentially using step functions.

```python

from aws_cdk import core

from datajob.datajob_stack import DataJobStack

from datajob.glue.glue_job import GlueJob

from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow

app = core.App()

# The datajob_stack is the instance that will result in a cloudformation stack.

# We inject the datajob_stack object through all the resources that we want to add.

with DataJobStack(scope=app, id="data-pipeline-simple") as datajob_stack:

    # We define 2 glue jobs with the relative path to the source code.

    task1 = GlueJob(

        datajob_stack=datajob_stack, name="task1", job_path="glue_jobs/task.py"

    )

    task2 = GlueJob(

        datajob_stack=datajob_stack, name="task2", job_path="glue_jobs/task2.py"

    )

    # We instantiate a step functions workflow and orchestrate the glue jobs.

    with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn:

        task1 >> task2

app.synth()

```

We add the above code in a file called `datajob_stack.py` in the [root of the project](./examples/data_pipeline_with_packaged_project/).

### Configure CDK

Follow the steps [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config) to configure your credentials.

```shell script

export AWS_PROFILE=default

# use the aws cli to get your account number

export AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text --profile $AWS_PROFILE)

export AWS_DEFAULT_REGION=eu-west-1

# init cdk

cdk bootstrap aws://$AWS_ACCOUNT/$AWS_DEFAULT_REGION

```

### Deploy

Deploy the pipeline using CDK.

```shell

cd examples/data_pipeline_simple

cdk deploy --app  "python datajob_stack.py" --require-approval never

```

### Execute

```shell script

datajob execute --state-machine data-pipeline-simple-workflow

```

The terminal will show a link to the step functions page to follow up on your pipeline run.

![sfn](./assets/sfn.png)

### Destroy

```shell script

cdk destroy --app  "python datajob_stack.py"

```

# Examples

- [Data pipeline with parallel steps](./examples/data_pipeline_parallel/)

- [Data pipeline for processing big data using PySpark](./examples/data_pipeline_pyspark/)

- [Data pipeline where you package and ship your project as a wheel](./examples/data_pipeline_with_packaged_project/)

- [Machine Learning pipeline where we combine glue jobs with sagemaker](examples/ml_pipeline_end_to_end)

All our examples are in [./examples](./examples)

# Functionality

Deploy to a stage

Specify a stage to deploy an isolated pipeline.

Typical examples would be `dev` , `prod`, ...

```shell

cdk deploy --app "python datajob_stack.py" --context stage=my-stage

```

Using datajob's S3 data bucket

Dynamically reference the `datajob_stack` data bucket name to the arguments of your GlueJob by calling

`datajob_stack.context.data_bucket_name`.

```python

import pathlib

from aws_cdk import core

from datajob.datajob_stack import DataJobStack

from datajob.glue.glue_job import GlueJob

from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow

current_dir = str(pathlib.Path(__file__).parent.absolute())

app = core.App()

with DataJobStack(

        scope=app, id="datajob-python-pyspark", project_root=current_dir

) as datajob_stack:

    pyspark_job = GlueJob(

        datajob_stack=datajob_stack,

        name="pyspark-job",

        job_path="glue_job/glue_pyspark_example.py",

        job_type="glueetl",

        glue_version="2.0",  # we only support glue 2.0

        python_version="3",

        worker_type="Standard",  # options are Standard / G.1X / G.2X

        number_of_workers=1,

        arguments={

            "--source": f"s3://{datajob_stack.context.data_bucket_name}/raw/iris_dataset.csv",

            "--destination": f"s3://{datajob_stack.context.data_bucket_name}/target/pyspark_job/iris_dataset.parquet",

        },

    )

    with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn:

        pyspark_job >> ...

```

you can find this example [here](./examples/data_pipeline_pyspark/glue_job/glue_pyspark_example.py)

Deploy files to the datajob's deployment bucket

Specify the path to the folder we would like to include in the deployment bucket.

```python

from aws_cdk import core

from datajob.datajob_stack import DataJobStack

app = core.App()

with DataJobStack(

    scope=app, id="some-stack-name", include_folder="path/to/folder/"

) as datajob_stack:

    ...

```

Package your project as a wheel and ship it to AWS

You can find the example [here](./examples/data_pipeline_with_packaged_project/)

```python

# We add the path to the project root in the constructor of DataJobStack.

# By specifying project_root, datajob will look for a .whl in

# the dist/ folder in your project_root.

with DataJobStack(

    scope=app, id="data-pipeline-pkg", project_root=current_dir

) as datajob_stack:

```

Package you project using [poetry](https://python-poetry.org/)

```shell

poetry build

cdk deploy --app "python datajob_stack.py"

```

Package you project using [setup.py](./examples/data_pipeline_with_packaged_project)

```shell

python setup.py bdist_wheel

cdk deploy --app "python datajob_stack.py"

```

you can also use the datajob cli to do the two commands at once:

```shell

# for poetry

datajob deploy --config datajob_stack.py --package poetry

# for setup.py

datajob deploy --config datajob_stack.py --package setuppy

```

Processing big data using a Glue Pyspark job

```python

import pathlib

from aws_cdk import core

from datajob.datajob_stack import DataJobStack

from datajob.glue.glue_job import GlueJob

current_dir = str(pathlib.Path(__file__).parent.absolute())

app = core.App()

with DataJobStack(

        scope=app, id="datajob-python-pyspark", project_root=current_dir

) as datajob_stack:

    pyspark_job = GlueJob(

        datajob_stack=datajob_stack,

        name="pyspark-job",

        job_path="glue_job/glue_pyspark_example.py",

        job_type="glueetl",

        glue_version="2.0",  # we only support glue 2.0

        python_version="3",

        worker_type="Standard",  # options are Standard / G.1X / G.2X

        number_of_workers=1,

        arguments={

            "--source": f"s3://{datajob_stack.context.data_bucket_name}/raw/iris_dataset.csv",

            "--destination": f"s3://{datajob_stack.context.data_bucket_name}/target/pyspark_job/iris_dataset.parquet",

        },

    )

```

full example can be found in [examples/data_pipeline_pyspark](examples/data_pipeline_pyspark]).

Orchestrate stepfunctions tasks in parallel

```python

# Task2 comes after task1. task4 comes after task3.

# Task 5 depends on both task2 and task4 to be finished.

# Therefore task1 and task2 can run in parallel,

# as well as task3 and task4.

with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn:

    task1 >> task2

    task3 >> task4

    task2 >> task5

    task4 >> task5

```

More can be found in [examples/data_pipeline_parallel](./examples/data_pipeline_parallel)

Orchestrate 1 stepfunction task

Use the [Ellipsis](https://docs.python.org/dev/library/constants.html#Ellipsis) object to be able to orchestrate 1 job via step functions.

```python

some_task >> ...

```

Notify in case of error/success

Provide the parameter `notification` in the constructor of a `StepfunctionsWorkflow` object.

This will create an SNS Topic which will be triggered in case of failure or success.

The email will subscribe to the topic and receive the notification in its inbox.

```python

with StepfunctionsWorkflow(datajob_stack=datajob_stack,

                           name="workflow",

                           notification="[email protected]") as sfn:

    task1 >> task2

```

You can provide 1 email or a list of emails `["[email protected]", "[email protected]"]`.

# Datajob in depth

The `datajob_stack` is the instance that will result in a cloudformation stack.

The path in `project_root` helps `datajob_stack` locate the root of the project where

the setup.py/poetry pyproject.toml file can be found, as well as the `dist/` folder with the wheel of your project .

```python

import pathlib

from aws_cdk import core

from datajob.datajob_stack import DataJobStack

current_dir = pathlib.Path(__file__).parent.absolute()

app = core.App()

with DataJobStack(

    scope=app, id="data-pipeline-pkg", project_root=current_dir

) as datajob_stack:

    ...

```

When __entering the contextmanager__ of DataJobStack:

A [DataJobContext](./datajob/datajob_stack.py#L48) is initialized

to deploy and run a data pipeline on AWS.

The following resources are created:

1) "data bucket"

    - an S3 bucket that you can use to dump ingested data, dump intermediate results and the final output.

    - you can access the data bucket as a [Bucket](https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_s3/Bucket.html) object via ```datajob_stack.context.data_bucket```

    - you can access the data bucket name via ```datajob_stack.context.data_bucket_name```

2) "deployment bucket"

   - an s3 bucket to deploy code, artifacts, scripts, config, files, ...

   - you can access the deployment bucket as a [Bucket](https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_s3/Bucket.html) object via ```datajob_stack.context.deployment_bucket```

   - you can access the deployment bucket name via ```datajob_stack.context.deployment_bucket_name```

when __exiting the context manager__ all the resources of our DataJobStack object are created.

We can write the above example more explicitly...

```python

import pathlib

from aws_cdk import core

from datajob.datajob_stack import DataJobStack

from datajob.glue.glue_job import GlueJob

from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow

current_dir = pathlib.Path(__file__).parent.absolute()

app = core.App()

datajob_stack = DataJobStack(scope=app, id="data-pipeline-pkg", project_root=current_dir)

datajob_stack.init_datajob_context()

task1 = GlueJob(datajob_stack=datajob_stack, name="task1", job_path="glue_jobs/task.py")

task2 = GlueJob(datajob_stack=datajob_stack, name="task2", job_path="glue_jobs/task2.py")

with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as step_functions_workflow:

    task1 >> task2

datajob_stack.create_resources()

app.synth()

```

# Ideas

Any suggestions can be shared by starting a [discussion](https://github.com/vincentclaes/datajob/discussions)

These are the ideas, we find interesting to implement;

- add a time based trigger to the step functions workflow.

- add an s3 event trigger to the step functions workflow.

- add a lambda that copies data from one s3 location to another.

- version your data pipeline.

- cli command to view the logs / glue jobs / s3 bucket

- implement sagemaker services

    - processing jobs

    - hyperparameter tuning jobs

    - training jobs

- implement lambda

- implement ECS Fargate

- create a serverless UI that follows up on the different pipelines deployed on possibly different AWS accounts using Datajob

> [Feedback](https://github.com/vincentclaes/datajob/discussions) is much appreciated!
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vincentclaes/datajob

Awesome Lists containing this project

README