{"id":23131627,"url":"https://github.com/peopledoc/mlvtools","last_synced_at":"2025-08-17T08:31:36.767Z","repository":{"id":57442401,"uuid":"145836701","full_name":"peopledoc/mlvtools","owner":"peopledoc","description":"Public repository for versioning machine learning data","archived":false,"fork":false,"pushed_at":"2021-11-25T13:41:34.000Z","size":290,"stargazers_count":42,"open_issues_count":14,"forks_count":7,"subscribers_count":40,"default_branch":"master","last_synced_at":"2025-07-13T05:24:02.525Z","etag":null,"topics":["approved-public","ghec-mig-migrated","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/peopledoc.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-08-23T10:12:59.000Z","updated_at":"2024-09-23T20:00:44.000Z","dependencies_parsed_at":"2022-09-26T17:21:10.825Z","dependency_job_id":null,"html_url":"https://github.com/peopledoc/mlvtools","commit_stats":null,"previous_names":["peopledoc/ml-versionning-tools"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/peopledoc/mlvtools","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peopledoc%2Fmlvtools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peopledoc%2Fmlvtools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peopledoc%2Fmlvtools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peopledoc%2Fmlvtools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/peopledoc","download_url":"https://codeload.github.com/peopledoc/mlvtools/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peopledoc%2Fmlvtools/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270822973,"owners_count":24652024,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-17T02:00:09.016Z","response_time":129,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["approved-public","ghec-mig-migrated","machine-learning"],"created_at":"2024-12-17T11:15:36.537Z","updated_at":"2025-08-17T08:31:36.415Z","avatar_url":"https://github.com/peopledoc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# mlvtools\n\nThe Machine Learning Versioning Tools.  \nmlvtools version 2.1.1 is the last version supporting `dvc\u003c=0.94.1`.\n\n## Installing\n\nTo install mlvtools with `pip` from PyPI:\n\n```shell\n$ pip install mlvtools\n```\n\nTo install it from sources for development:\n\n```shell\n$ git clone http://github.com/peopledoc/mlvtools.git\n$ cd mlvtools\n$ pip install -e .[dev]\n```\n\n## Tutorial\n\nA tutorial is available to showcase how to use the tools. See [mlvtools\ntutorial](https://github.com/peopledoc/mlvtools-tutorial).\n\n## Keywords\n\n`Step Metadata`: in this document it refers to the first code cell when it\nis used to declare metadata such as parameters, dvc inputs/outputs, etc.\n\n`Working Directory`: the project's working directory. Files specified in the\nuser configuration are relative to this directory. The `--working-directory`\n(or `-w`) flag is used to specify the Working Directory. If not specified\nthe current directory is used.\n\n## Tools\n\n`ipynb_to_python`: this command converts a Jupyter Notebook to a parameterized and\nexecutable Python script (see specific syntax in section below).\n\n```shell\n$ ipynb_to_python -n [notebook_path] -o [python_script_path]\n```\n\n`gen_dvc`: this command creates a DVC command which calls the Python script generated by\n`ipynb_to_python`.\n\n```shell\n$ gen_dvc -i [python_script] --out-py-cmd [python_command] --out-bash-cmd [dvc_command]\n```\n\n`export_pipeline`: this command exports the pipeline corresponding to the given DVC meta\nfile into a bash script.  Pipeline steps are called sequentially in dependency order.\nOnly for local steps.\n\n```shell\n$ export_pipeline --dvc [DVC target meta file] -o [pipeline script]\n```\n\n`ipynb_to_dvc`: this command converts a Jupyter Notebook to a parameterized and\nexecutable Python script and a DVC command. It is the combination of\n`ipynb_to_python` and `gen_dvc`. It only works with a configuration file.\n\n```shell\n$ ipynb_to_dvc -n [notebook_path]\n```\n\n`check_script_consistency` and `check_all_scripts_consistency`: those commands ensure\nconsitency between a Jupyter notebook and its generated python script. It is possible to\nuse them as git hook or in the project's Continuous Integration. The consistency check\nignores blank lines and comments.\n\n```shell\n$ check_script_consistency -n [notebook_path] -s [script_path]\n```\n\n```shell\n$ check_all_scripts_consistency -n [notebook_directory]\n# Works only with a configuration file (provided or auto-detected)\n```\n\n## Configuration\n\nA configuration file can be provided, but it is not mandatory.  Its default location is\n`[working_directory]/.mlvtools`. Use the flag `--conf-path` (or `-c`) on the command\nline to specify a specific configuration file path.\n\nThe configuration file format is JSON.\n\n```json\n{\n  \"path\":\n  {\n    \"python_script_root_dir\": \"[path_to_the_script_directory]\",\n    \"dvc_cmd_root_dir\": \"[path_to_the_dvc_cmd_directory]\",\n    \"dvc_metadata_root_dir\": \"[path_to_the_dvc_metadata_directory] (optional)\"\n  },\n  \"ignore_keys\": [\"keywords\", \"to\", \"ignore\"],\n  \"dvc_var_python_cmd_path\": \"MLV_PY_CMD_PATH_CUSTOM\",\n  \"dvc_var_python_cmd_name\": \"MLV_PY_CMD_NAME_CUSTOM\",\n  \"docstring_conf\": \"./docstring_conf.yml\"\n}\n```\n\nAll given paths must be relative to the Working Directory.\n\n* `path_to_the_script_directory`: the directory where Python scripts will be generated\n  using `ipynb_to_script` commands. The generated Python script names are based on the\n  notebook names.\n\n  ```shell\n  $ ipynb_to_script -n ./data/My\\ Notebook.ipynb\n  ```\n  Generated script: `[path_to_the_script_directory]/my_notebook.py`\n\n* `path_to_the_dvc_cmd_directory`: the directory where DVC commands will be generated\n  using `gen_dvc` command. The generated command names are based on the Python script\n  names.\n\n  ```shell\n  $ gen_dvc -i ./scripts/my_notebook.py\n  ```\n  Generated command: `[path_to_the_python_cmd_directory]/my_notebook_dvc`\n\n* `path_to_the_dvc_metadata_directory`: the directory where DVC metadata files will be\n  generated when executing `gen_dvc` commands. This value is optional, by default\n  DVC metadata files will be saved in the Working Directory.  The generated DVC\n  metadata file names are based on the Python 3 script names.\n\n  Generated file: `[path_to_the_dvc_metadata_directory]/my_notebook.dvc`\n\n* `ignore_keys`: list of keywords use to discard a cell. Default value is `['# No effect\n  ]`.  (See \"Discard cell\" section)\n\n* `dvc_var_python_cmd_path`, `dvc_var_python_cmd_name`, `dvc_var_meta_filename`: allow\n  to customize variable names which can be used in `dvc-cmd` Docstring parameters.\n\n  They respectively correspond to the variables holding the Python command file path,\n  the file name and the variable holding the DVC default meta file name.\n\n  Default values are `MLV_PY_CMD_PATH`, `MLV_PY_CMD_NAME` and `MLV_DVC_META_FILENAME`.\n  (See DVC Command/Complex cases section for usage.)\n\n* `docstring_conf`: the path to the docstring configuration used for Jinja templating\n  (see DVC templating section).  This parameter is optional.\n\n\n## Jupyter Notebook syntax\n\nThe Step Metadata cell is used to declare script parameters and DVC outputs and\ndependencies.  This can be done using basic Docstring syntax. This Docstring must be the\nfirst statement is this cell, only comments can be writen above.\n\n\n### Good practices\n\nAvoid using relative paths in your Jupyter Notebook because they are relative to\nthe notebook location which is not the same when it will be converted to a script.\n\n\n### Python Script Parameters\n\nParameters can be declared in the Jupyter Notebook using basic Docstring syntax.  This\nparameters description is used to generate configurable and executable Python scripts.\n\nParameters declaration in Jupyter Notebook:\n\nJupyter Notebook: `process_files.ipynb`\n\n\n```\n#:param [type]? [param_name]: [description]?\n\"\"\"\n:param str input_file: the input file\n:param output_file: the output_file\n:param rate: the learning rate\n:param int retry:\n\"\"\"\n```\n\nGenerated Python script:\n\n```py\n[...]\ndef process_file(input_file, output_file, rate, retry):\n    \"\"\"\n     ...\n    \"\"\"\n[...]\n```\n\nScript command line parameters:\n\n```\nmy_script.py -h\n\nusage: my_cmd [-h] --input-file INPUT_FILE --output-file OUTPUT_FILE --rate RATE --retry RETRY\n\nCommand for script [script_name]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --input-file INPUT_FILE\n                        the input file\n  --output-file OUTPUT_FILE\n                        the output_file\n  --rate RATE           the rate\n  --retry RETRY\n```\n\nAll declared arguments are required.\n\n### DVC command\n\nA DVC command is a wrapper over a `dvc run` command called on a Python script generated\nwith the `ipynb_to_python` command. It is a step of a pipeline.\n\nIt is based on data declared in the Notebook's Step Metadata.\n\nTwo modes are available:\n* describe only input/output for simple cases (recommended)\n* describe full command for complex cases\n\n#### Simple cases\n\nSyntax:\n\n```\n:param str input_csv_file: Path to input file\n:param str output_csv_file_1: Path to output file 1\n:param str output_csv_file_2: Path to output file 2\n[...]\n\n[:dvc-[in|out][\\s{related_param}]?:[\\s{file_path}]?]*\n[:dvc-extra: {python_other_param}]?\n\n:dvc-in: ./data/filter.csv\n:dvc-in input_csv_file: ./data/info.csv\n:dvc-out: ./data/train_set_1.csv\n:dvc-out output_csv_file_1: ./data/test_set_1.csv\n:dvc-out-persist: ./data/train_set_2.csv\n:dvc-out-persist output_csv_file_2: ./data/test_set_2.csv\n:dvc-extra: --mode train --rate 12\n```\n\n* `{file_path}` path can be absolute or relative to the Working Directory.\n* `{related_param}` is a parameter of the corresponding Python script, it is filled in\n  for the python script call\n* `dvc-extra` allows to declare parameters which are not dvc outputs or dependencies.\n  Those parameters are provided to the call of the Python command.\n\n```\npushd /working-directory\n\nINPUT_CSV_FILE=\"./data/info.csv\"\nOUTPUT_CSV_FILE_1=\"./data/test_set_1.csv\"\nOUTPUT_CSV_FILE_2=\"./data/test_set_2.csv\"\n\ndvc run \\\n-d ./data/filter.csv\\\n-d $INPUT_CSV_FILE\\\n-o ./data/train_set_1.csv\\\n-o $OUTPUT_CSV_FILE_1\\\n--outs-persist ./data/train_set_2.csv\\\n--outs-persist $OUTPUT_CSV_FILE_2\\\ngen_src/python_script.py --mode train --rate 12\n        --input-csv-file $INPUT_CSV_FILE\n        --output-csv-file-1 $OUTPUT_CSV_FILE_1\n        --output-csv-file-2 $OUTPUT_CSV_FILE_2\n```\n\n#### Complex cases\n\nSyntax:\n\n```\n:dvc-cmd: {dvc_command}\n\n:dvc-cmd: dvc run -o ./out_train.csv -o ./out_test.csv\n    \"$MLV_PY_CMD_PATH -m train --out ./out_train.csv \u0026\u0026\n     $MLV_PY_CMD_PATH -m test --out ./out_test.csv\"\n```\n\nThis syntax allows to provide the full dvc command to generate. All paths can be\nabsolute or relative to the Working Directory.  The variables `$MLV_PY_CMD_PATH` and\n`$MLV_PY_CMD_NAME` are available. They correspond to the path and the name of the\ncorresponding Python command, respectively. The variable `$MLV_DVC_META_FILENAME`\ncontains the default name of the DVC meta file.\n\n```\npushd /working-directory\nMLV_PY_CMD_PATH=\"gen_src/python_script.py\"\nMLV_PY_CMD_NAME=\"python_script.py\"\n\ndvc run -f $MLV_DVC_META_FILENAME -o ./out_train.csv \\\n    -o ./out_test.csv \\\n    \"$MLV_PY_CMD_PATH -m train --out ./out_train.csv \u0026\u0026 \\\n    $MLV_PY_CMD_PATH -m test --out ./out_test.csv\"\npopd\n```\n\n### DVC templating\n\nIt is possible to use Jinja2 templates in the DVC Docstring parts. For example, it can\nbe useful to declare all steps dependencies, outputs and extra parameters.\n\nExample:\n\n```\n# Docstring in Jupyter notebook\n\"\"\"\n[...]\n:dvc-in: {{ conf.train_data_file_path }}\n:dvc-out: {{ conf.model_file_path }}\n:dvc-extra: --rate {{ conf.rate }}\n\"\"\"\n```\n\n```\n# Docstring configuration file (Yaml format): ./dc_conf.yml\ntrain_data_file_path: ./data/trainset.csv\nmodel_file_path: ./data/model.pkl\nrate: 45\n```\n\n```\n# DVC command generation\ngen_dvc -i ./python_script.py --docstring-conf ./dc_conf.yml\n```\n\nThe Docstring configuration file can be provided through the main configuration or using\nthe `--docstring-conf` argument. This feature is only available for `gen_dvc` command.\n\n### Discard cell\n\nSome cells in Jupyter Notebook are executed only to watch intermediate results.  In\na Python script those are statements with no effect.  The comment `# No effect` allows\nto discard a whole cell content to avoid waste of time running those statements.  It is\npossible to customize the list of discard keywords, see the Configuration section.\n\n\n## Contributing\n\nWe happily welcome contributions to mlvtools. Please see our [contribution](./CONTRIBUTING.md) guide for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeopledoc%2Fmlvtools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpeopledoc%2Fmlvtools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeopledoc%2Fmlvtools/lists"}