{"id":20246630,"url":"https://github.com/civicdatalab/shepherd-api","last_synced_at":"2026-01-21T10:37:40.944Z","repository":{"id":63839243,"uuid":"415053556","full_name":"CivicDataLab/shepherd-api","owner":"CivicDataLab","description":null,"archived":false,"fork":false,"pushed_at":"2024-06-10T10:30:12.000Z","size":508,"stargazers_count":2,"open_issues_count":2,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-09-10T03:05:28.625Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CivicDataLab.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-10-08T16:18:51.000Z","updated_at":"2024-10-16T09:48:56.000Z","dependencies_parsed_at":"2024-11-14T09:31:58.984Z","dependency_job_id":"5d80cf2c-1c97-4635-84fa-657f1c8580ab","html_url":"https://github.com/CivicDataLab/shepherd-api","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CivicDataLab/shepherd-api","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CivicDataLab%2Fshepherd-api","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CivicDataLab%2Fshepherd-api/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CivicDataLab%2Fshepherd-api/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CivicDataLab%2Fshepherd-api/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CivicDataLab","download_url":"https://codeload.github.com/CivicDataLab/shepherd-api/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CivicDataLab%2Fshepherd-api/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28631937,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-21T04:47:28.174Z","status":"ssl_error","status_checked_at":"2026-01-21T04:47:22.943Z","response_time":86,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T09:31:31.636Z","updated_at":"2026-01-21T10:37:40.928Z","avatar_url":"https://github.com/CivicDataLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Overview\nThis is a django based application to create and run the data-pipelines. The application runs [Prefect](https://docs.prefect.io/) tasks. Eventually, the requests made via API are converted to corresponding Prefect tasks. The tasks' status, history etc. can be monitored in the Prefect cloud by running - `prefect orion start` and opening the url shown in the prompt. \n\n\n## Requirements\n- Once the code is cloned from the git, install the requirements from `requirements.txt` file by running `pip install -r requirements.txt`\n- The project uses rabbitmq, which can be installed from the [official website](https://www.rabbitmq.com/download.html)\n\n## Background tasks \nThis application needs several background tasks to be running. \nFollowing are the must-to run processes before running the application. Run all these processes parallely in different terminals. \n1. `python manage.py runserver ` - This starts the django server, and it listens to the API requests. This can be considered an entry-point to our program. \n2. `python manage.py process_tasks --queue create_pipeline` - Runs the **create-pipeline** background task. Request from the `runserver` is received by this task. \n3. `python manage.py runscript worker_demon.py` - Runs the rabbitmq - worker demon. \n\n## Demo Request\nA demo request to the shepherd API consists the following in the request body. \n\n```{\n  \"pipeline_name\": \"Skip_merge_anonymize on Res.271\",\n  \"res_id\" : 271,\n  \"db_action\":\"create\",\n  \"transformers_list\" : [{\"name\" : \"skip_column\", \"order_no\" : 1, \"context\": {\"columns\":[\"format\"]}},\n                        {\"name\" : \"merge_columns\", \"order_no\" : 2, \"context\": {\"column1\":\"title\", \"column2\":\"price\", \"output_column\":\"title with price\", \n                        \"separator\":\"|\"}},\n                        {\"name\" : \"anonymize\", \"order_no\" : 3, \"context\": {\"to_replace\" : \"Sir\", \n  \"replace_val\": \"Prof\", \"column\": \"author\"}}]\n}\n```\n1. _pipeline_name_ - Name of the pipeline that needs to be created.\n2. _res_id_ - Resource ID i.e. ID of the resource that needs transformation. This can be considered input data for our pipeline. \n3. _db_action_ - Takes either **create** or **update**. This tells us whether to **create** a new resource in our db out of transformed data or to **update** the existing resource with the transformed data.\n4. _transformers_list_ - List of json objects.\n   1. _name_ - Name of the task that needs to be performed.\n   2. _order_no_ - The order number of the task. In the above example - _skip_column_ is followed by _merge_columns_ which is followed by _anonymize_ tasks as the order numbers of the corresponding tasks are 1, 2 and 3 respectively.\n   3. _context_ - Necessary inputs to perform the task. This is task specific. \n\nThe final HTTP request looks like following.\n```\nPOST http://127.0.0.1:8000/transformer/res_transform\nContent-Type: application/json\n{\n    \"pipeline_name\": \"Skip_merge_anonymize on Res.271\",\n    \"res_id\" : 271,\n    \"db_action\":\"create\",\n    \"transformers_list\" : [{\"name\" : \"skip_column\", \"order_no\" : 1, \"context\": {\"columns\":[\"format\"]}},\n                        {\"name\" : \"merge_columns\", \"order_no\" : 2, \"context\": {\"column1\":\"title\", \"column2\":\"price\", \"output_column\":\"title with price\", \n                        \"separator\":\"|\"}},\n                        {\"name\" : \"anonymize\", \"order_no\" : 3, \"context\": {\"to_replace\" : \"Sir\", \n  \"replace_val\": \"Prof\", \"column\": \"author\"}}]\n}\n```\n## Adding new tasks to the pipeline\nFollowing are the steps to be followed to add a new task to the pipeline. \n1. Define your task name and the context (i.e. necessary information to perform the task).\n2. Write the task(i.e. your Python function) in [prefect_tasks](tasks/prefect_tasks.py) file as a prefect task. \n Note: Prefect task is a Python function annotated with `@task`. Make sure you have the same arguments passed to your function as other tasks defined in the file.\n\nLet's understand this through an example. Suppose you need to add a task named - **add_prefix** which adds a given prefix to all the values in the specified column\nSo, the task name would be - add_prefix. To define the context, let's define the necessary inputs first. \n- We will be needing the column name. So, the context should contain a key named 'column'\n- We will also be needing a string which acts as prefix. So the second key in the context should be - 'prefix'\n\nFinally, our context should look something like this. \n\n`\"context\" : {\"column\": \"\u003ccolumn_name_here\u003e\", \"prefix\":\"\u003cprefix_string_here\u003e\"}`\n\nNow we should add the task in [prefect_tasks](tasks/prefect_tasks.py) file. Go to the file, and add the task.\n```\n@task\ndef add_prefix(context, pipeline, task_obj):\n    column = context['column']\n    prefix_string = context['prefix']\n    # Rest of your logic here..\n```\nA request to create a pipeline with this task should look something like,\n```\nPOST http://127.0.0.1:8000/transformer/res_transform\nContent-Type: application/json\n{\n    \"pipeline_name\": \"Test_prefixing\",\n    \"res_id\" : 271,\n    \"db_action\":\"create\",\n    \"transformers_list\" : [\n        {\"name\" : \"add_prefix\", \n        \"order_no\" : 1, \n        \"context\": {\"column\": \"Planets\", \"prefix\": \"The\"}}\n        ]\n}\n```\n## Flow of the code\nAs there are many background tasks involved, it might be a bit confusing at first. \nHere is how control flows once request is made.\n\n[API end-point](datatransform/views.py)...\u003e[Pipeline creation](pipeline_creator_bg.py)...\u003e[Rabbitmq worker](worker_demon.py)...\u003e[Model to pipeline](pipeline/model_to_pipeline.py)...\u003e[Prefect tasks](tasks/prefect_tasks.py)...\u003e[Model to pipeline](pipeline/model_to_pipeline.py)...\u003e[Utils](utils.py)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcivicdatalab%2Fshepherd-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcivicdatalab%2Fshepherd-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcivicdatalab%2Fshepherd-api/lists"}