{"id":14065637,"url":"https://github.com/purecloudlabs/aws_glue_etl_docker","last_synced_at":"2025-04-11T16:42:27.527Z","repository":{"id":149016070,"uuid":"148664691","full_name":"purecloudlabs/aws_glue_etl_docker","owner":"purecloudlabs","description":"Helper library to run AWS Glue ETL scripts docker container for local testing of development in a Jupyter notebook","archived":false,"fork":false,"pushed_at":"2024-02-13T23:04:41.000Z","size":25,"stargazers_count":20,"open_issues_count":1,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-25T12:51:22.013Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/purecloudlabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-13T16:14:36.000Z","updated_at":"2022-05-24T01:40:10.000Z","dependencies_parsed_at":null,"dependency_job_id":"ca6b044e-14e9-4d9b-88b3-6afb2c8bec8b","html_url":"https://github.com/purecloudlabs/aws_glue_etl_docker","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/purecloudlabs%2Faws_glue_etl_docker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/purecloudlabs%2Faws_glue_etl_docker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/purecloudlabs%2Faws_glue_etl_docker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/purecloudlabs%2Faws_glue_etl_docker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/purecloudlabs","download_url":"https://codeload.github.com/purecloudlabs/aws_glue_etl_docker/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248442314,"owners_count":21104155,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-13T07:04:36.424Z","updated_at":"2025-04-11T16:42:27.492Z","avatar_url":"https://github.com/purecloudlabs.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# AWS Glue ETL in Docker and Jupyter\nThis project is a helper for creating scripts that run in both [AWS Glue](https://aws.amazon.com/glue/), [Jupyter](http://jupyter.org/) notebooks, and in docker containers with spark-submit.  Glue supports running [Zepplin notebooks](https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-EC2-notebook.html) against a dev endpoint, but for quick dev sometimes you just want to run locally against a subset of data and don't want to have to pay to keep the dev endpoints running.\n\n## Glue Shim\nGlue has specific methods to load and save data to s3 which won't work when running in a jupyter notebook.  The glueshim provides a higher level api to work in both scenarios.  \n\n```python\nfrom aws_glue_etl_docker import glueshim\nshim = glueshim.GlueShim()\n\nparams = shim.arguments({'data_bucket': \"examples\"})\npprint(params)\n\n\nfiles = shim.get_all_files_with_prefix(params['data_bucket'], \"data/\")\nprint(files)\n\ndata = shim.load_data(files, 'example_data')\ndata.printSchema()\ndata.show()\n\nshim.write_parquet(data, params['data_bucket'], \"parquet\", None, 'parquetdata' )\nshim.write_parquet(data, params['data_bucket'], \"parquetpartition\", \"car\", 'partitioneddata' )\n\nshim.write_csv(data, params['data_bucket'],\"csv\", 'csvdata')\n\nshim.finish()\n```\n\n## Local environment\nRunning locally is easiest in a docker container\n\n1. Copy data locally, and map that folder to your docker container to the /data/\u003cbucket\u003e/\u003cfiles\u003e path.\n2. Start docker container, map your local notebook directory to ```/home/jovyan/work```\n\n*Example Docker command*\n```docker run -p 8888:8888 -v \"$PWD/examples\":/home/jovyan/work -v \"$PWD\":/data jupyter/pyspark-notebook```\n\n### Installing package in Jupyter\n\n```python\nimport sys\n!{sys.executable} -m pip install git+https://github.com/purecloudlabs/aws_glue_etl_docker\n```\n\n## AWS Deployment\nFor deployment to AWS, this library must be packaged and put into S3. You can use the helper script deploytos3.sh to package and copy.  \n\nUsage ```./deploytos3.sh s3://example-bucket/myprefix/aws-glue-etl-jupyter.zip\n\nThen when starting the glue job, use your S3 zip path in the _Python library path_ configuration\n\n## Bookmarks\nThe shim is currently setup to delete any data in the output folder so that if you run with bookmarks enabled and then need to reprocess the entire dataset and \n\n## Converting Workbook to Python Script\n\naws_glue_etl_docker can also be used as a cli tool to clean up Jupyter metadata from a workbook or convert it to a python script.\n\n## Clean\n\nThe clean command will open all workbooks in a given path and remove any metadata, output and execution information. This keeps the workbooks cleaner in source control\n\n``` aws_glue_etl_docker clean --path /dir/to/workbooks  ```\n\n## Build\n\nThe build command will open all workbooks in a given path and convert them to python scripts.  Build will convert any markdown cells to multiline comments.  This command will not convert any cells that contain ```#LOCALDEV``` or lines that start with ```!``` as in ```!{sys.executable} -m pip install git+https://github.com/purecloudlabs/aws_glue_etl_docker```\n\n``` aws_glue_etl_docker build --path /dir/to/workbooks  ```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpurecloudlabs%2Faws_glue_etl_docker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpurecloudlabs%2Faws_glue_etl_docker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpurecloudlabs%2Faws_glue_etl_docker/lists"}