{"id":28975119,"url":"https://github.com/octoenergy/s3migrate","last_synced_at":"2025-06-24T12:08:24.817Z","repository":{"id":52548971,"uuid":"185855046","full_name":"octoenergy/s3migrate","owner":"octoenergy","description":"Bulk delete/copy/move files or modify Hive/Drill/Athena partitions using pythonic pattern matching","archived":false,"fork":false,"pushed_at":"2023-10-22T13:53:28.000Z","size":188,"stargazers_count":5,"open_issues_count":7,"forks_count":2,"subscribers_count":98,"default_branch":"master","last_synced_at":"2024-03-26T08:21:45.242Z","etag":null,"topics":["data","data-science"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/octoenergy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-05-09T18:53:26.000Z","updated_at":"2023-12-25T15:52:17.000Z","dependencies_parsed_at":"2023-01-25T06:00:47.777Z","dependency_job_id":null,"html_url":"https://github.com/octoenergy/s3migrate","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/octoenergy/s3migrate","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/octoenergy%2Fs3migrate","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/octoenergy%2Fs3migrate/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/octoenergy%2Fs3migrate/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/octoenergy%2Fs3migrate/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/octoenergy","download_url":"https://codeload.github.com/octoenergy/s3migrate/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/octoenergy%2Fs3migrate/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261669024,"owners_count":23192362,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-science"],"created_at":"2025-06-24T12:08:22.210Z","updated_at":"2025-06-24T12:08:24.798Z","avatar_url":"https://github.com/octoenergy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![CircleCI](https://circleci.com/gh/octoenergy/s3migrate.svg?style=svg)](https://circleci.com/gh/octoenergy/s3migrate)\n\n# s3migrate\nBulk delete/copy/move files or modify Hive/Drill/Athena partitions using pythonic pattern matching \n\n## Example\n\nImagine we have a dataset as follows:\n```\ns3://bucket/training_data/2019-01-01/part1.parquet \ns3://bucket/validation_data/2019-06-01/part13.parquet\n... \n```\n\nTo make this dataset Hive-friendly, we want to includ explicit key-value pairs in the paths, e.g.:\n```\ns3://bucket/data/split=training/execution_date=2019-01-01/part1.parquet\ns3://bucket/data/split=training/execution_date=2019-06-01/part13.parquet\n...\n```\n\nThis can be achieved using the `s3migrate.mv` (aka `move`) command with intutitive pattern matching:\n\n```python\nold_path = \"s3://bucket/{split}_data/{execution_date}/{filename}\"\nnew_path = \"s3://bucket/data/split={split}/execution_date={execution_date}/{filename}\"\ns3migrate.mv(\n    from=old_path,\n    to=new_path,\n    dryrun=False\n)\n```\n\nIf instead we want to delete all files matching `old_path` pattern, we can use `s3migrate.rm`:\n\n```python\ns3migrate.rm(\n    from=old_path,\n    dryrun=False\n)\n```\n\n## Supported commands\n### File-system-like operations\nThe module provides the following commands:\n\n|command|number of patterns|action|\n|---|---|---|\n|`cp`/`copy`|2|copy (duplicate) all matched files to new location|\n|`mv`/`move`|2|move (rename) all matched files|\n|`rm`/`remove`|1| remove all matched files|\n\nEeach takes one or two patterns, as well as the `dryrun` argument.\n\n\u003e **NB** when two patterns are provided, both must contain the same set of keys\n\n### General-purpose generators\n| command | usecase |\n| --- | --- |\n| `iter`| iterate over all matching filenames, e.g. to read each file |\n| `iterformats` | iterate over all matched `format dictionaries`, e.g. to collect all Hive key values |\n\n`s3migrate.iter(pattern)` will yield file names `filename` matching `pattern`. This allows custom file processing logic downstream.\n\n`s3migrate.iterformats(pattern)` will instead yield dictionaries `fmt_dict` such that `pattarn.format(**fmt_dict)` is equivalent to the matched `filename`.\n\n## Dry run mode\nDry run mode allows testing your patterns without performing any destructive operations.\n\nWith `dryrun=True` (default), information about operations to be performed is logged at `INFO` and `DEBUG` level - make sure\nto set your logging accordingly, e.g. inside a Jupyter Notebook:\n\n\n```python\nimport logging\n\nlogger = logging.getLogger()\nlogger.setLevel(logging.DEBUG)\nlogger.handlers = [logging.StreamHandler()]\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foctoenergy%2Fs3migrate","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foctoenergy%2Fs3migrate","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foctoenergy%2Fs3migrate/lists"}