{"id":13505624,"url":"https://github.com/vincentclaes/datajob","last_synced_at":"2026-03-17T16:34:47.873Z","repository":{"id":39586642,"uuid":"306434736","full_name":"vincentclaes/datajob","owner":"vincentclaes","description":"Build and deploy a serverless data pipeline on AWS with no effort.","archived":false,"fork":false,"pushed_at":"2023-02-08T04:33:47.000Z","size":3307,"stargazers_count":111,"open_issues_count":19,"forks_count":19,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-01-08T09:15:50.335Z","etag":null,"topics":["aws","aws-cdk","data-pipeline","glue","glue-job","machine-learning","pipeline","sagemaker","serverless","stepfunctions"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/datajob/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vincentclaes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-10-22T19:07:31.000Z","updated_at":"2025-07-10T11:11:54.000Z","dependencies_parsed_at":"2024-01-06T10:13:05.059Z","dependency_job_id":"c85810f4-fa7c-4973-9992-ee2ad6a43ea7","html_url":"https://github.com/vincentclaes/datajob","commit_stats":{"total_commits":322,"total_committers":8,"mean_commits":40.25,"dds":0.06521739130434778,"last_synced_commit":"48bfea3c752467d78af74025f232ef1ec5e7d3c3"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/vincentclaes/datajob","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vincentclaes%2Fdatajob","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vincentclaes%2Fdatajob/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vincentclaes%2Fdatajob/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vincentclaes%2Fdatajob/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vincentclaes","download_url":"https://codeload.github.com/vincentclaes/datajob/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vincentclaes%2Fdatajob/sbom","scorecard":{"id":922182,"data":{"date":"2025-08-11","repo":{"name":"github.com/vincentclaes/datajob","commit":"48bfea3c752467d78af74025f232ef1ec5e7d3c3"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.5,"checks":[{"name":"Code-Review","score":0,"reason":"Found 1/13 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/pr.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/pr.yml:22: update your workflow using https://app.stepsecurity.io/secureworkflow/vincentclaes/datajob/pr.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/pr.yml:25: update your workflow using https://app.stepsecurity.io/secureworkflow/vincentclaes/datajob/pr.yml/main?enable=pin","Warn: containerImage not pinned by hash: .devcontainer/devcontainer.Dockerfile:8","Warn: pipCommand not pinned by hash: .devcontainer/devcontainer.Dockerfile:59-85","Warn: downloadThenRun not pinned by hash: .devcontainer/devcontainer.Dockerfile:105","Warn: npmCommand not pinned by hash: .devcontainer/devcontainer.Dockerfile:112","Info:   0 out of   2 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 containerImage dependencies pinned","Info:   0 out of   1 pipCommand dependencies pinned","Info:   0 out of   1 downloadThenRun dependencies pinned","Info:   0 out of   1 npmCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'main'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 25 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Vulnerabilities","score":0,"reason":"45 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2024-230 / GHSA-248v-346w-9cwc","Warn: Project is vulnerable to: PYSEC-2022-42986 / GHSA-43fp-rhv2-5gv8","Warn: Project is vulnerable to: PYSEC-2023-135 / GHSA-xqr8-7jwr-rhp7","Warn: Project is vulnerable to: GHSA-3ww4-gg4f-jr7f","Warn: Project is vulnerable to: GHSA-5cpq-8wj7-hf2v","Warn: Project is vulnerable to: PYSEC-2024-225 / GHSA-6vqw-3v5j-54x4","Warn: Project is vulnerable to: GHSA-9v9h-cgj8-h64p","Warn: Project is vulnerable to: GHSA-h4gh-qq45-vh27","Warn: Project is vulnerable to: PYSEC-2023-254 / GHSA-jfhm-5ghh-2f97","Warn: Project is vulnerable to: GHSA-jm77-qphf-c4w8","Warn: Project is vulnerable to: GHSA-v8gr-m533-ghj9","Warn: Project is vulnerable to: GHSA-w7pp-m8wf-vj6r","Warn: Project is vulnerable to: GHSA-x4qr-2fvf-3mr5","Warn: Project is vulnerable to: GHSA-wj6h-64fc-37mp","Warn: Project is vulnerable to: PYSEC-2024-60 / GHSA-jjg7-2v4v-x38h","Warn: Project is vulnerable to: GHSA-cpwx-vrp4-4pq7","Warn: Project is vulnerable to: GHSA-h5c8-rqwp-cp95","Warn: Project is vulnerable to: GHSA-h75v-3vvj-5mfj","Warn: Project is vulnerable to: GHSA-q2x7-8rv6-6q7h","Warn: Project is vulnerable to: GHSA-45x7-px36-x8w8","Warn: Project is vulnerable to: GHSA-8qvm-5x2c-j2w7","Warn: Project is vulnerable to: PYSEC-2022-42969","Warn: Project is vulnerable to: PYSEC-2023-117 / GHSA-mrwq-x4v8-fh7p","Warn: Project is vulnerable to: PYSEC-2024-232 / GHSA-6c5p-j8vq-pqhj","Warn: Project is vulnerable to: PYSEC-2024-233 / GHSA-cjwg-qfpm-7377","Warn: Project is vulnerable to: GHSA-9hjg-9r4m-mvj7","Warn: Project is vulnerable to: GHSA-9wx4-h78v-vm56","Warn: Project is vulnerable to: PYSEC-2023-74 / GHSA-j8r2-6x86-q33q","Warn: Project is vulnerable to: GHSA-32g6-mg92-ghm2","Warn: Project is vulnerable to: GHSA-7pc3-pr3q-58vg","Warn: Project is vulnerable to: GHSA-wjvx-jhpj-r54r","Warn: Project is vulnerable to: PYSEC-2025-49 / GHSA-5rjg-fvgr-3xxf","Warn: Project is vulnerable to: GHSA-cx63-2mw6-8hw5","Warn: Project is vulnerable to: GHSA-34jh-p97f-mpxf","Warn: Project is vulnerable to: PYSEC-2023-212 / GHSA-g4mx-q9vg-27p4","Warn: Project is vulnerable to: GHSA-pq67-6m6q-mj2v","Warn: Project is vulnerable to: PYSEC-2023-192 / GHSA-v845-jxx5-vc9f","Warn: Project is vulnerable to: PYSEC-2024-187 / GHSA-rqc4-2hc7-8c8v","Warn: Project is vulnerable to: GHSA-2g68-c3qc-8985","Warn: Project is vulnerable to: GHSA-f9vj-2wh5-fj8j","Warn: Project is vulnerable to: PYSEC-2023-221 / GHSA-hrfv-mqp8-q5rw","Warn: Project is vulnerable to: PYSEC-2023-57 / GHSA-px8h-6qxv-m22q","Warn: Project is vulnerable to: GHSA-q34m-jh98-gwm2","Warn: Project is vulnerable to: PYSEC-2023-58 / GHSA-xg9f-g7g7-2323","Warn: Project is vulnerable to: GHSA-jfmj-5v4g-7637"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-25T05:18:25.275Z","repository_id":39586642,"created_at":"2025-08-25T05:18:25.275Z","updated_at":"2025-08-25T05:18:25.275Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30627186,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-17T14:16:03.965Z","status":"ssl_error","status_checked_at":"2026-03-17T14:16:03.380Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","aws-cdk","data-pipeline","glue","glue-job","machine-learning","pipeline","sagemaker","serverless","stepfunctions"],"created_at":"2024-08-01T00:01:10.776Z","updated_at":"2026-03-17T16:34:47.811Z","avatar_url":"https://github.com/vincentclaes.png","language":"Python","funding_links":[],"categories":["High-Level Frameworks"],"sub_categories":["Multi-accounts setup"],"readme":"[![Awesome](https://awesome.re/badge.svg)](https://github.com/kolomied/awesome-cdk#high-level-frameworks)\n![logo](./assets/logo.png)\n\n\u003cdiv align=\"center\"\u003e\n \u003cb\u003eBuild and deploy a serverless data pipeline on AWS with no effort.\u003c/b\u003e\u003c/br\u003e\n \u003ci\u003eOur goal is to let developers think about the business logic, datajob does the rest...\u003c/i\u003e\n \u003c/br\u003e\n \u003c/br\u003e\n \u003c/br\u003e\n\u003c/div\u003e\n\n \u003c/br\u003e\n\n- Deploy code to python shell / pyspark **AWS Glue jobs**.\n- Use **AWS Sagemaker** to create ML Models.\n- Orchestrate the above jobs using **AWS Stepfunctions** as simple as `task1 \u003e\u003e task2`\n- Let us [know](https://github.com/vincentclaes/datajob/discussions) **what you want to see next**.\n\n \u003c/br\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n :rocket: :new: :rocket:\n \u003c/br\u003e\n\u003c/br\u003e\n[Check our new example of an End-to-end Machine Learning Pipeline with Glue, Sagemaker and Stepfunctions](examples/ml_pipeline_end_to_end)\n\u003c/br\u003e\n\u003c/br\u003e\n:rocket: :new: :rocket:\n\n\u003c/br\u003e\u003c/br\u003e\n\n\u003c/div\u003e\n\n \u003c/br\u003e\n\n# Installation\n\n Datajob can be installed using pip. \u003cbr/\u003e\n Beware that we depend on [aws cdk cli](https://github.com/aws/aws-cdk)!\n\n    pip install datajob\n    npm install -g aws-cdk@1.109.0 # latest version of datajob depends this version\n\n# Quickstart\n\nYou can find the full example in [examples/data_pipeline_simple](./examples/data_pipeline_simple/).\n\nWe have a simple data pipeline composed of [2 glue jobs](./examples/data_pipeline_simple/glue_jobs/) orchestrated sequentially using step functions.\n\n```python\nfrom aws_cdk import core\n\nfrom datajob.datajob_stack import DataJobStack\nfrom datajob.glue.glue_job import GlueJob\nfrom datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow\n\napp = core.App()\n\n# The datajob_stack is the instance that will result in a cloudformation stack.\n# We inject the datajob_stack object through all the resources that we want to add.\nwith DataJobStack(scope=app, id=\"data-pipeline-simple\") as datajob_stack:\n    # We define 2 glue jobs with the relative path to the source code.\n    task1 = GlueJob(\n        datajob_stack=datajob_stack, name=\"task1\", job_path=\"glue_jobs/task.py\"\n    )\n    task2 = GlueJob(\n        datajob_stack=datajob_stack, name=\"task2\", job_path=\"glue_jobs/task2.py\"\n    )\n\n    # We instantiate a step functions workflow and orchestrate the glue jobs.\n    with StepfunctionsWorkflow(datajob_stack=datajob_stack, name=\"workflow\") as sfn:\n        task1 \u003e\u003e task2\n\napp.synth()\n\n```\n\nWe add the above code in a file called `datajob_stack.py` in the [root of the project](./examples/data_pipeline_with_packaged_project/).\n\n\n### Configure CDK\nFollow the steps [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config) to configure your credentials.\n\n```shell script\nexport AWS_PROFILE=default\n# use the aws cli to get your account number\nexport AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text --profile $AWS_PROFILE)\nexport AWS_DEFAULT_REGION=eu-west-1\n\n# init cdk\ncdk bootstrap aws://$AWS_ACCOUNT/$AWS_DEFAULT_REGION\n```\n\n### Deploy\n\nDeploy the pipeline using CDK.\n\n```shell\ncd examples/data_pipeline_simple\ncdk deploy --app  \"python datajob_stack.py\" --require-approval never\n```\n\n### Execute\n\n```shell script\ndatajob execute --state-machine data-pipeline-simple-workflow\n```\nThe terminal will show a link to the step functions page to follow up on your pipeline run.\n\n![sfn](./assets/sfn.png)\n\n### Destroy\n\n```shell script\ncdk destroy --app  \"python datajob_stack.py\"\n```\n\n# Examples\n\n- [Data pipeline with parallel steps](./examples/data_pipeline_parallel/)\n- [Data pipeline for processing big data using PySpark](./examples/data_pipeline_pyspark/)\n- [Data pipeline where you package and ship your project as a wheel](./examples/data_pipeline_with_packaged_project/)\n- [Machine Learning pipeline where we combine glue jobs with sagemaker](examples/ml_pipeline_end_to_end)\n\nAll our examples are in [./examples](./examples)\n\n\n# Functionality\n\n\u003cdetails\u003e\n\u003csummary\u003eDeploy to a stage\u003c/summary\u003e\n\nSpecify a stage to deploy an isolated pipeline.\n\nTypical examples would be `dev` , `prod`, ...\n\n```shell\ncdk deploy --app \"python datajob_stack.py\" --context stage=my-stage\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\n\u003csummary\u003eUsing datajob's S3 data bucket\u003c/summary\u003e\n\nDynamically reference the `datajob_stack` data bucket name to the arguments of your GlueJob by calling\n`datajob_stack.context.data_bucket_name`.\n\n```python\nimport pathlib\n\nfrom aws_cdk import core\nfrom datajob.datajob_stack import DataJobStack\nfrom datajob.glue.glue_job import GlueJob\nfrom datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow\n\ncurrent_dir = str(pathlib.Path(__file__).parent.absolute())\n\napp = core.App()\n\nwith DataJobStack(\n        scope=app, id=\"datajob-python-pyspark\", project_root=current_dir\n) as datajob_stack:\n    pyspark_job = GlueJob(\n        datajob_stack=datajob_stack,\n        name=\"pyspark-job\",\n        job_path=\"glue_job/glue_pyspark_example.py\",\n        job_type=\"glueetl\",\n        glue_version=\"2.0\",  # we only support glue 2.0\n        python_version=\"3\",\n        worker_type=\"Standard\",  # options are Standard / G.1X / G.2X\n        number_of_workers=1,\n        arguments={\n            \"--source\": f\"s3://{datajob_stack.context.data_bucket_name}/raw/iris_dataset.csv\",\n            \"--destination\": f\"s3://{datajob_stack.context.data_bucket_name}/target/pyspark_job/iris_dataset.parquet\",\n        },\n    )\n\n    with StepfunctionsWorkflow(datajob_stack=datajob_stack, name=\"workflow\") as sfn:\n        pyspark_job \u003e\u003e ...\n\n```\n\nyou can find this example [here](./examples/data_pipeline_pyspark/glue_job/glue_pyspark_example.py)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eDeploy files to the datajob's deployment bucket\u003c/summary\u003e\n\nSpecify the path to the folder we would like to include in the deployment bucket.\n\n```python\n\nfrom aws_cdk import core\nfrom datajob.datajob_stack import DataJobStack\n\napp = core.App()\n\nwith DataJobStack(\n    scope=app, id=\"some-stack-name\", include_folder=\"path/to/folder/\"\n) as datajob_stack:\n\n    ...\n\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003ePackage your project as a wheel and ship it to AWS\u003c/summary\u003e\n\nYou can find the example [here](./examples/data_pipeline_with_packaged_project/)\n\n```python\n# We add the path to the project root in the constructor of DataJobStack.\n# By specifying project_root, datajob will look for a .whl in\n# the dist/ folder in your project_root.\nwith DataJobStack(\n    scope=app, id=\"data-pipeline-pkg\", project_root=current_dir\n) as datajob_stack:\n```\n\nPackage you project using [poetry](https://python-poetry.org/)\n\n```shell\npoetry build\ncdk deploy --app \"python datajob_stack.py\"\n```\n\nPackage you project using [setup.py](./examples/data_pipeline_with_packaged_project)\n\n```shell\npython setup.py bdist_wheel\ncdk deploy --app \"python datajob_stack.py\"\n```\nyou can also use the datajob cli to do the two commands at once:\n```shell\n# for poetry\ndatajob deploy --config datajob_stack.py --package poetry\n\n# for setup.py\ndatajob deploy --config datajob_stack.py --package setuppy\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eProcessing big data using a Glue Pyspark job\u003c/summary\u003e\n\n```python\nimport pathlib\n\nfrom aws_cdk import core\nfrom datajob.datajob_stack import DataJobStack\nfrom datajob.glue.glue_job import GlueJob\n\ncurrent_dir = str(pathlib.Path(__file__).parent.absolute())\n\napp = core.App()\n\nwith DataJobStack(\n        scope=app, id=\"datajob-python-pyspark\", project_root=current_dir\n) as datajob_stack:\n    pyspark_job = GlueJob(\n        datajob_stack=datajob_stack,\n        name=\"pyspark-job\",\n        job_path=\"glue_job/glue_pyspark_example.py\",\n        job_type=\"glueetl\",\n        glue_version=\"2.0\",  # we only support glue 2.0\n        python_version=\"3\",\n        worker_type=\"Standard\",  # options are Standard / G.1X / G.2X\n        number_of_workers=1,\n        arguments={\n            \"--source\": f\"s3://{datajob_stack.context.data_bucket_name}/raw/iris_dataset.csv\",\n            \"--destination\": f\"s3://{datajob_stack.context.data_bucket_name}/target/pyspark_job/iris_dataset.parquet\",\n        },\n    )\n```\nfull example can be found in [examples/data_pipeline_pyspark](examples/data_pipeline_pyspark]).\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eOrchestrate stepfunctions tasks in parallel\u003c/summary\u003e\n\n```python\n# Task2 comes after task1. task4 comes after task3.\n# Task 5 depends on both task2 and task4 to be finished.\n# Therefore task1 and task2 can run in parallel,\n# as well as task3 and task4.\nwith StepfunctionsWorkflow(datajob_stack=datajob_stack, name=\"workflow\") as sfn:\n    task1 \u003e\u003e task2\n    task3 \u003e\u003e task4\n    task2 \u003e\u003e task5\n    task4 \u003e\u003e task5\n\n```\nMore can be found in [examples/data_pipeline_parallel](./examples/data_pipeline_parallel)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eOrchestrate 1 stepfunction task\u003c/summary\u003e\n\nUse the [Ellipsis](https://docs.python.org/dev/library/constants.html#Ellipsis) object to be able to orchestrate 1 job via step functions.\n\n```python\nsome_task \u003e\u003e ...\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eNotify in case of error/success\u003c/summary\u003e\n\nProvide the parameter `notification` in the constructor of a `StepfunctionsWorkflow` object.\nThis will create an SNS Topic which will be triggered in case of failure or success.\nThe email will subscribe to the topic and receive the notification in its inbox.\n\n```python\nwith StepfunctionsWorkflow(datajob_stack=datajob_stack,\n                           name=\"workflow\",\n                           notification=\"email@domain.com\") as sfn:\n    task1 \u003e\u003e task2\n```\n\nYou can provide 1 email or a list of emails `[\"email1@domain.com\", \"email2@domain.com\"]`.\n\n\u003c/details\u003e\n\n# Datajob in depth\n\nThe `datajob_stack` is the instance that will result in a cloudformation stack.\nThe path in `project_root` helps `datajob_stack` locate the root of the project where\nthe setup.py/poetry pyproject.toml file can be found, as well as the `dist/` folder with the wheel of your project .\n\n```python\nimport pathlib\nfrom aws_cdk import core\n\nfrom datajob.datajob_stack import DataJobStack\n\ncurrent_dir = pathlib.Path(__file__).parent.absolute()\napp = core.App()\n\nwith DataJobStack(\n    scope=app, id=\"data-pipeline-pkg\", project_root=current_dir\n) as datajob_stack:\n\n    ...\n```\n\nWhen __entering the contextmanager__ of DataJobStack:\n\nA [DataJobContext](./datajob/datajob_stack.py#L48) is initialized\nto deploy and run a data pipeline on AWS.\nThe following resources are created:\n1) \"data bucket\"\n    - an S3 bucket that you can use to dump ingested data, dump intermediate results and the final output.\n    - you can access the data bucket as a [Bucket](https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_s3/Bucket.html) object via ```datajob_stack.context.data_bucket```\n    - you can access the data bucket name via ```datajob_stack.context.data_bucket_name```\n2) \"deployment bucket\"\n   - an s3 bucket to deploy code, artifacts, scripts, config, files, ...\n   - you can access the deployment bucket as a [Bucket](https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_s3/Bucket.html) object via ```datajob_stack.context.deployment_bucket```\n   - you can access the deployment bucket name via ```datajob_stack.context.deployment_bucket_name```\n\nwhen __exiting the context manager__ all the resources of our DataJobStack object are created.\n\n\u003cdetails\u003e\n\u003csummary\u003eWe can write the above example more explicitly...\u003c/summary\u003e\n\n```python\nimport pathlib\nfrom aws_cdk import core\n\nfrom datajob.datajob_stack import DataJobStack\nfrom datajob.glue.glue_job import GlueJob\nfrom datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow\n\ncurrent_dir = pathlib.Path(__file__).parent.absolute()\n\napp = core.App()\n\ndatajob_stack = DataJobStack(scope=app, id=\"data-pipeline-pkg\", project_root=current_dir)\ndatajob_stack.init_datajob_context()\n\ntask1 = GlueJob(datajob_stack=datajob_stack, name=\"task1\", job_path=\"glue_jobs/task.py\")\ntask2 = GlueJob(datajob_stack=datajob_stack, name=\"task2\", job_path=\"glue_jobs/task2.py\")\n\nwith StepfunctionsWorkflow(datajob_stack=datajob_stack, name=\"workflow\") as step_functions_workflow:\n    task1 \u003e\u003e task2\n\ndatajob_stack.create_resources()\napp.synth()\n```\n\u003c/details\u003e\n\n# Ideas\n\nAny suggestions can be shared by starting a [discussion](https://github.com/vincentclaes/datajob/discussions)\n\nThese are the ideas, we find interesting to implement;\n\n- add a time based trigger to the step functions workflow.\n- add an s3 event trigger to the step functions workflow.\n- add a lambda that copies data from one s3 location to another.\n- version your data pipeline.\n- cli command to view the logs / glue jobs / s3 bucket\n- implement sagemaker services\n    - processing jobs\n    - hyperparameter tuning jobs\n    - training jobs\n- implement lambda\n- implement ECS Fargate\n- create a serverless UI that follows up on the different pipelines deployed on possibly different AWS accounts using Datajob\n\n\u003e [Feedback](https://github.com/vincentclaes/datajob/discussions) is much appreciated!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvincentclaes%2Fdatajob","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvincentclaes%2Fdatajob","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvincentclaes%2Fdatajob/lists"}