{"id":26671122,"url":"https://github.com/scaleapi/sail","last_synced_at":"2025-04-12T04:13:16.586Z","repository":{"id":40417951,"uuid":"311740032","full_name":"scaleapi/sail","owner":"scaleapi","description":null,"archived":false,"fork":false,"pushed_at":"2022-06-29T20:09:16.000Z","size":58,"stargazers_count":11,"open_issues_count":0,"forks_count":4,"subscribers_count":36,"default_branch":"main","last_synced_at":"2025-04-12T04:13:12.801Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scaleapi.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-11-10T17:48:07.000Z","updated_at":"2025-03-25T22:44:39.000Z","dependencies_parsed_at":"2022-08-09T19:50:45.734Z","dependency_job_id":null,"html_url":"https://github.com/scaleapi/sail","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scaleapi%2Fsail","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scaleapi%2Fsail/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scaleapi%2Fsail/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scaleapi%2Fsail/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scaleapi","download_url":"https://codeload.github.com/scaleapi/sail/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248514203,"owners_count":21116903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-25T23:43:55.065Z","updated_at":"2025-04-12T04:13:16.560Z","avatar_url":"https://github.com/scaleapi.png","language":"Python","readme":"# Sail\n\nSail is a data pipeline starter kit meant to optimize how you provide data to Scale.\n\nIt's meant to have the following included:\n\n- Project level Pipeline Config (with versioning abstracted)\n- Batch Creation and Finalization (both Easy and Complex cases)\n- Task Creation based on a .csv mapping, .json file, or just a Python Dictionary. A longer-term goal is to support passing in a folder location or S3 Bucket URI.\n- Concurrency for every operation so scripts can run ~30x faster\n- Logging built-in\n- Error Handling and retries on every part + Idempotency for Task creation\n\nWe've done our best to abstract the data pipeline nuances and incorporate Scale best practices throughout\n\n# Getting started\n- Python 3.6+ is required to run these scripts.\n- `API_KEY` environment variable must be set.\n- Modify `example_schemas/schema.py`. It's an example Python dictionary describing the project and tasks to be created. It has comments on what each field is, and more detailed documentation can be found in the [Schema Section](#Schema).\n- Run the main Sail script. A __Test API Key__ can be used to try out the API and the platform. When ready to create a production project, just switch to a __Live API Key__:\n```\nAPI_KEY=live_xxx python sail.py\n```\n\n# Working with batches\nFor large projects, batches can be created to group tasks between the same project. There's an example schema on `example_schemas/schema_with_batches.py`.\n\nMore detailed documentation can be found in the [Schema Section](#Schema)\n\nAlso, there's a [recommended workflow](#recommended-workflow) for working with batches.\n\n# Schema\nRunning `sail.py` will create a project with batches and tasks.\n\nDetailed info on these entities and how Scale works can be found on Scale Docs:\n\n- Scale 101: https://scale.com/docs/key-concepts \u003c- Start Here!\n- Project: https://docs.scale.com/reference#projects\n- Batch: https://docs.scale.com/reference#batches\n- Task: https://docs.scale.com/reference#task-object\n\n# Idempotency\nThere's a highly recommended, yet optional, field called `unique_id`. It will prevent the creation of duplicated tasks.\n\nIt can be set at the task level manually. Or, using the flag `generateUniqueId`, all tasks missing the `unique_id` field will generate one in the form of `\u003cproject_name\u003e_\u003cbatch_name\u003e_\u003cattachment_url\u003e`.\n\n# Recommended workflow\n1. Run as many times as necessary, using `unique_id` to ensure no duplicated tasks.\n2. After having the project, batches, and tasks created as desired, run one more time using the `--finalize-batches` flag. \n3. After a batch is finalized, tasks start being worked on. \n- Note that new tasks cannot be submitted into a finalized batch.\n\n# Task download\nThere is also a script `task_download.py`, which can be used for downloading all tasks from a project. \n\nThere's an optional `--resume` flag that allows resuming on a previous run. It will download only new batches. Also, if when running it for a large project some errors occur, this flag allows re-running the script downloading only the errored batches.\n\nUsage:\n\n```\npython task_download.py --api-key live_xxxx --project project_name --resume\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscaleapi%2Fsail","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscaleapi%2Fsail","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscaleapi%2Fsail/lists"}