{"id":19757152,"url":"https://github.com/lyst/shovel","last_synced_at":"2025-04-30T12:31:25.002Z","repository":{"id":56632455,"uuid":"96991899","full_name":"lyst/shovel","owner":"lyst","description":null,"archived":false,"fork":false,"pushed_at":"2020-10-27T21:57:25.000Z","size":54,"stargazers_count":9,"open_issues_count":16,"forks_count":7,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-03-25T20:07:08.801Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lyst.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-07-12T09:49:53.000Z","updated_at":"2024-03-25T20:07:08.802Z","dependencies_parsed_at":"2022-08-15T22:20:19.427Z","dependency_job_id":null,"html_url":"https://github.com/lyst/shovel","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyst%2Fshovel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyst%2Fshovel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyst%2Fshovel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyst%2Fshovel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lyst","download_url":"https://codeload.github.com/lyst/shovel/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224208142,"owners_count":17273721,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T03:18:15.594Z","updated_at":"2024-11-12T03:18:16.096Z","avatar_url":"https://github.com/lyst.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Shovel\n\nYou can version-control your code, you want to version-control your datasets too.\nVery many data science workflows can be broken down into working on three stages of data:\n- \"input\": the dataset as provided to you, a query against Redshift, a query against Postgres, a query against your favourite API...\n- \"working\": various transformations that you do.\n- \"output\": various results, such as the accuracy of an ML algorithm on this dataset, summary graphs, etc.\n\nThe principle of `shovel` is to help store and version your \"input\", when combined with versioned code all of your results can be reproducible.\nHow you manage your \"working\" and your \"output\" is out-of-scope, and up to you.\nThis is the first major goal of `shovel`: making it easier to reproduce results in the future.\n\nThe second major goal is to store our datasets centrally (on S3 for now), so that everyone may access everything.\nThis is good for collaboration.\nThis is also good for organising our datasets, and for backing them up.\n\n## Installation\n\nTo install,\n```bash\npython setup.py install\n```\n\n(For development, `python setup.py develop` works.)\n\nIf you want to install directly from git, use:\n```bash\npip install git+https://github.com/lyst/shovel.git#egg=shovel\n```\n\nShovel reads its config from the environment. As a minimum, you need the following environment variables defines:\n- AWS_ACCESS_KEY_ID - for boto\n- AWS_SECRET_ACCESS_KEY - for boto \n- SHOVEL_DEFAULT_BUCKET - the bucket in which to store your data\n \nIn addition:\n- SHOVEL_DEFAULT_ROOT (bottomless-pit) - the root prefix for the default Pit your data will be stored in (shovel will always include this as the prefix when writing to your bucket. \n\n## Fetching datasets from your Pit\n\n`shovel` imposes that datasets should live in a namespace `PROJECT/DATASET/VERSION`.\n- PROJECT is the top-level project a dataset belongs to, e.g. `google-ngrams`...\n- DATASET is the name of the dataset. This is intended to contain different datasets e.g. `eng-all-20120701`  \n- VERSION is the version number and shold be in the format `f\"v{int(n)\"` and is intended to be used if errors are found in the dataset and they need updating. It should always make sense to re-run some analyses on the latest version of the dataset.\n\nYou should consider using a pre-existing dataset over creating a new one, if an appropriate one exists.\n\nUsing the `shovel` command-line tool, fetch existing datasets with\n```bash\nshovel dig \u003cLOCAL_DIRECTORY\u003e \u003cPROJECT\u003e \u003cDATASET\u003e \u003cVERSION\u003e\n```\n\nto fetch the dataset into `LOCAL_DIRECTORY`. For example `shovel dig ~/google-ngrams/english2012 google-ngrams eng-all-20120701 v0`\n\nOr from python\n```python\nfrom shovel import dig\n\ndig('~/google-ngrams/english2012', 'google-ngrams', 'eng-all-20120701', 'v0')\n```\n\n## Preparing and pushing datasets to S3\n\nPush a local directory containing a dataset to S3 with\n```bash\nshovel bury ~/google-ngrams/english2012 google-ngrams eng-all-20120701 v0\n```\n\nOr from python\n```python\nfrom shovel import dig\n\nbury('~/google-ngrams/english2012', 'google-ngrams', 'eng-all-20120701', 'v0')\n```\n\n`bury` will fail if the version already exists.\n\nEnough talk...\n\n![Shovel][shovel]\n\n[shovel]: https://www.mememaker.net/static/images/memes/4104864.jpg\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyst%2Fshovel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flyst%2Fshovel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyst%2Fshovel/lists"}