https://github.com/scaleapi/sail
https://github.com/scaleapi/sail
Last synced: 10 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/scaleapi/sail
- Owner: scaleapi
- License: apache-2.0
- Created: 2020-11-10T17:48:07.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2022-06-29T20:09:16.000Z (over 3 years ago)
- Last Synced: 2025-04-12T04:13:12.801Z (10 months ago)
- Language: Python
- Size: 56.6 KB
- Stars: 11
- Watchers: 36
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# Sail
Sail is a data pipeline starter kit meant to optimize how you provide data to Scale.
It's meant to have the following included:
- Project level Pipeline Config (with versioning abstracted)
- Batch Creation and Finalization (both Easy and Complex cases)
- Task Creation based on a .csv mapping, .json file, or just a Python Dictionary. A longer-term goal is to support passing in a folder location or S3 Bucket URI.
- Concurrency for every operation so scripts can run ~30x faster
- Logging built-in
- Error Handling and retries on every part + Idempotency for Task creation
We've done our best to abstract the data pipeline nuances and incorporate Scale best practices throughout
# Getting started
- Python 3.6+ is required to run these scripts.
- `API_KEY` environment variable must be set.
- Modify `example_schemas/schema.py`. It's an example Python dictionary describing the project and tasks to be created. It has comments on what each field is, and more detailed documentation can be found in the [Schema Section](#Schema).
- Run the main Sail script. A __Test API Key__ can be used to try out the API and the platform. When ready to create a production project, just switch to a __Live API Key__:
```
API_KEY=live_xxx python sail.py
```
# Working with batches
For large projects, batches can be created to group tasks between the same project. There's an example schema on `example_schemas/schema_with_batches.py`.
More detailed documentation can be found in the [Schema Section](#Schema)
Also, there's a [recommended workflow](#recommended-workflow) for working with batches.
# Schema
Running `sail.py` will create a project with batches and tasks.
Detailed info on these entities and how Scale works can be found on Scale Docs:
- Scale 101: https://scale.com/docs/key-concepts <- Start Here!
- Project: https://docs.scale.com/reference#projects
- Batch: https://docs.scale.com/reference#batches
- Task: https://docs.scale.com/reference#task-object
# Idempotency
There's a highly recommended, yet optional, field called `unique_id`. It will prevent the creation of duplicated tasks.
It can be set at the task level manually. Or, using the flag `generateUniqueId`, all tasks missing the `unique_id` field will generate one in the form of `__`.
# Recommended workflow
1. Run as many times as necessary, using `unique_id` to ensure no duplicated tasks.
2. After having the project, batches, and tasks created as desired, run one more time using the `--finalize-batches` flag.
3. After a batch is finalized, tasks start being worked on.
- Note that new tasks cannot be submitted into a finalized batch.
# Task download
There is also a script `task_download.py`, which can be used for downloading all tasks from a project.
There's an optional `--resume` flag that allows resuming on a previous run. It will download only new batches. Also, if when running it for a large project some errors occur, this flag allows re-running the script downloading only the errored batches.
Usage:
```
python task_download.py --api-key live_xxxx --project project_name --resume
```