{"id":16567927,"url":"https://github.com/drorata/dvc-manual-stage","last_synced_at":"2026-04-16T16:08:57.937Z","repository":{"id":150667589,"uuid":"249722420","full_name":"drorata/dvc-manual-stage","owner":"drorata","description":null,"archived":false,"fork":false,"pushed_at":"2020-03-24T13:57:59.000Z","size":3,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-05T10:46:38.330Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/drorata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-24T13:57:43.000Z","updated_at":"2020-03-24T13:58:02.000Z","dependencies_parsed_at":"2023-04-29T10:30:41.024Z","dependency_job_id":null,"html_url":"https://github.com/drorata/dvc-manual-stage","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/drorata/dvc-manual-stage","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorata%2Fdvc-manual-stage","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorata%2Fdvc-manual-stage/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorata%2Fdvc-manual-stage/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorata%2Fdvc-manual-stage/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/drorata","download_url":"https://codeload.github.com/drorata/dvc-manual-stage/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drorata%2Fdvc-manual-stage/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31893443,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-16T11:36:10.202Z","status":"ssl_error","status_checked_at":"2026-04-16T11:36:09.652Z","response_time":69,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-11T21:07:48.741Z","updated_at":"2026-04-16T16:08:57.897Z","avatar_url":"https://github.com/drorata.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Manual step as part of a `dvc` pipeline\n\n## Objective\n\nYou want to include a stage in your pipeline which involves a manual process, and you want DVC to properly track the flow.\nHere's a usecase.\nYou want to map words in a given text file using some manually crafted mapping of the existing words.\nTo achieve this you can follow these steps:\n\n1. Extract the unique values of the input\n2. Manually build the map (using the unique values)\n3. Map the input using the manually crafted dictionary.\n\nThe solution should provide the following:\n\n* If the raw data changes, the resulting mapped line should be invalidated\n* If the mapping is changed (even without having the raw data changed) the mapped line should be invalidated\n\n## Initial Solution\n\n### Unique values\n\nThis is a simple and straightforward step; given an input you generate a JSON with the unique words in your input.\n\n```bash\ndvc add raw_data.txt\ndvc run -d unique_values.py -d raw_data.txt -o unique_values.json python unique_values.py\n```\n\n### Crafting the map\n\nCopy the output [`unique_values.json`](./unique_values.json) to [`mapping.json`](mapping.json).\nEdit the JSON as per need.\n\n### Mapping stage\n\nThe mapping stage depends on the following:\n\n* The mapping code: [`mapping.py`](mapping.py)\n* The mapping dictionary: [`mapping.json`](mapping.json)\n* Lastly and implicitly, it also depends on the unique values extracted in the first stage.\n\nSo, the following looks reasonable:\n\n```bash\ndvc run -d mapping.py -d mapping.json -d unique_values.json -o mapped_line.txt python mapping.py\n```\n\n### Gotcha\n\nThink what happens if the unique values changed (due to some change in the raw input).\nThe stage `mapped_line.txt.dvc` would be invalidated because the dependency `unique_values.json` has changed and will be re-run upon reproducing.\nBut, `dvc` will not stop you and know that `mapping.json` has changed; well, because it didn't.\nThe problem is that a change in the unique values should also invalidate the mapping.\n\n## Better solution\n\nIntroduce a flag and empty file `MAPPING_IS_VALID`.\nIf this file exists, this would be an indication that the manual process involved in crafting `mapping.json` was completed.\nThe trick is threefold:\n\n* `MAPPING_IS_VALID` will be deleted by the unique values stage and\n* `MAPPING_IS_VALID` will be a dependency of the mapping stage\n* `MAPPING_IS_VALID` is tracked neither by `git` nor by `dvc`\n\nSo, here are the stages:\n\n```bash\n# Determine the unique values\ndvc run -d unique_values.py -d raw_data.txt -o unique_values.json \"python unique_values.py \u0026\u0026 rm -f MAPPING_IS_VALID\"\n```\n\nand\n\n```bash\n# Map the value\ndvc run -d mapping.py -d mapping.json -d unique_values.json -d MAPPING_IS_VALID -o mapped_line.txt python mapping.py\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrorata%2Fdvc-manual-stage","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdrorata%2Fdvc-manual-stage","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrorata%2Fdvc-manual-stage/lists"}