{"id":13720030,"url":"https://github.com/iterative/example-get-started","last_synced_at":"2025-06-18T02:39:20.859Z","repository":{"id":37789718,"uuid":"177689917","full_name":"iterative/example-get-started","owner":"iterative","description":"Get started DVC project (NLP, random forest)","archived":false,"fork":false,"pushed_at":"2024-05-27T17:18:14.000Z","size":23,"stargazers_count":181,"open_issues_count":2,"forks_count":183,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-06-13T23:46:04.126Z","etag":null,"topics":["dvc","example","machine-learning","nlp","python","random-forest","reproducibility","reproducible","reproducible-research"],"latest_commit_sha":null,"homepage":"https://dvc.org/doc/start","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iterative.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-26T01:08:17.000Z","updated_at":"2025-06-04T15:09:03.000Z","dependencies_parsed_at":"2024-04-19T18:50:20.215Z","dependency_job_id":"33593140-8c99-4ae2-ba34-edb769dae251","html_url":"https://github.com/iterative/example-get-started","commit_stats":null,"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"purl":"pkg:github/iterative/example-get-started","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iterative%2Fexample-get-started","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iterative%2Fexample-get-started/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iterative%2Fexample-get-started/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iterative%2Fexample-get-started/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iterative","download_url":"https://codeload.github.com/iterative/example-get-started/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iterative%2Fexample-get-started/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260476288,"owners_count":23015006,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dvc","example","machine-learning","nlp","python","random-forest","reproducibility","reproducible","reproducible-research"],"created_at":"2024-08-03T01:00:59.107Z","updated_at":"2025-06-18T02:39:15.845Z","avatar_url":"https://github.com/iterative.png","language":"Python","readme":"[![DVC](https://img.shields.io/badge/-Open_in_Studio-grey.svg?style=flat-square\u0026logo=dvc)](https://studio.iterative.ai/team/Iterative/views/example-get-started-zde16i6c4g)\n\n# DVC Get Started\n\nThis is an auto-generated repository for use in [DVC](https://dvc.org)\n[Get Started](https://dvc.org/doc/get-started). It is a step-by-step quick\nintroduction into basic DVC concepts.\n\n![](https://static.iterative.ai/img/example-get-started/readme-head.png)\n\nThe project is a natural language processing (NLP) binary classifier problem of\npredicting tags for a given StackOverflow question. For example, we want one\nclassifier which can predict a post that is about the R language by tagging it\n`R`.\n\n🐛 Please report any issues found in this project here -\n[example-repos-dev](https://github.com/iterative/example-repos-dev).\n\n## Installation\n\nPython 3.9+ is required to run code from this repo.\n\n```console\n$ git clone https://github.com/iterative/example-get-started\n$ cd example-get-started\n```\n\nNow let's install the requirements. But before we do that, we **strongly**\nrecommend creating a virtual environment with a tool such as\n[virtualenv](https://virtualenv.pypa.io/en/stable/):\n\n```console\n$ virtualenv -p python3 .venv\n$ source .venv/bin/activate\n$ pip install -r src/requirements.txt\n```\n\n\u003e This instruction assumes that DVC is already installed, as it is frequently\n\u003e used as a global tool like Git. If DVC is not installed, see the\n\u003e [DVC installation guide](https://dvc.org/doc/install) on how to install DVC.\n\nThis DVC project comes with a preconfigured DVC\n[remote storage](https://dvc.org/doc/commands-reference/remote) that holds raw\ndata (input), intermediate, and final results that are produced. This is a\nread-only HTTP remote.\n\n```console\n$ dvc remote list\nstorage https://remote.dvc.org/get-started\n```\n\nYou can run [`dvc pull`](https://man.dvc.org/pull) to download the data:\n\n```console\n$ dvc pull\n```\n\n## Running in your environment\n\nRun [`dvc exp run`](https://man.dvc.org/exp/run) to reproduce the\n[pipeline](https://dvc.org/doc/user-guide/pipelines) and create a new\n[experiment](https://dvc.org/doc/user-guide/experiment-management).\n\n```console\n$ dvc exp run\nRan experiment(s): rapid-cane\nExperiment results have been applied to your workspace.\n```\n\nIf you'd like to test commands like [`dvc push`](https://man.dvc.org/push),\nthat require write access to the remote storage, the easiest way would be to set\nup a \"local remote\" on your file system:\n\n\u003e This kind of remote is located in the local file system, but is external to\n\u003e the DVC project.\n\n```console\n$ mkdir -p /tmp/dvc-storage\n$ dvc remote add local /tmp/dvc-storage\n```\n\nYou should now be able to run:\n\n```console\n$ dvc push -r local\n```\n\n## Existing stages\n\nThis project with the help of the Git tags reflects the sequence of actions that\nare run in the DVC [get started](https://dvc.org/doc/get-started) guide. Feel\nfree to checkout one of them and play with the DVC commands having the\nplayground ready.\n\n- `0-git-init`: Empty Git repository initialized.\n- `1-dvc-init`: DVC has been initialized. `.dvc/` with the cache directory\n  created.\n- `2-track-data`: Raw data file `data.xml` downloaded and tracked with DVC using\n  [`dvc add`](https://man.dvc.org/add). First `.dvc` file created.\n- `3-config-remote`: Remote HTTP storage initialized. It's a shared read only\n  storage that contains all data artifacts produced during next steps.\n- `4-import-data`: Use `dvc import` to get the same `data.xml` from the DVC data\n  registry.\n- `5-source-code`: Source code downloaded and put into Git.\n- `6-prepare-stage`: Create `dvc.yaml` and the first pipeline stage with\n  [`dvc run`](https://man.dvc.org/run). It transforms XML data into TSV.\n- `7-ml-pipeline`: Feature extraction and train stages created. It takes data in\n  TSV format and produces two `.pkl` files that contain serialized feature\n  matrices. Train runs random forest classifier and creates the `model.pkl` file.\n- `8-evaluation`: Evaluation stage. Runs the model on a test dataset to produce\n  its performance AUC value. The result is dumped into a DVC metric file so that\n  we can compare it with other experiments later.\n- `9-bigrams-model`: Bigrams experiment, code has been modified to extract more\n  features. We run [`dvc repro`](https://man.dvc.org/repro) for the first time\n  to illustrate how DVC can reuse cached files and detect changes along the\n  computational graph, regenerating the model with the updated data.\n- `10-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based\n  model.\n- `11-random-forest-experiments`: Reproduce experiments to tune the random\n  forest classifier parameters and select the best experiment.\n\nThere are three additional tags:\n\n- `baseline-experiment`: First end-to-end result that we have performance metric\n  for.\n- `bigrams-experiment`: Second experiment (model trained using bigrams\n  features).\n- `random-forest-experiments`: Best of additional experiments tuning random\n  forest parameters.\n\nThese tags can be used to illustrate `-a` or `-T` options across different\n[DVC commands](https://man.dvc.org/).\n\n## Project structure\n\nThe data files, DVC files, and results change as stages are created one by one.\nAfter cloning and using [`dvc pull`](https://man.dvc.org/pull) to download\ndata, models, and plots tracked by DVC, the workspace should look like this:\n\n```console\n$ tree\n.\n├── README.md\n├── data                  # \u003c-- Directory with raw and intermediate data\n│   ├── data.xml          # \u003c-- Initial XML StackOverflow dataset (raw data)\n│   ├── data.xml.dvc      # \u003c-- .dvc file - a placeholder/pointer to raw data\n│   ├── features          # \u003c-- Extracted feature matrices\n│   │   ├── test.pkl\n│   │   └── train.pkl\n│   └── prepared          # \u003c-- Processed dataset (split and TSV formatted)\n│       ├── test.tsv\n│       └── train.tsv\n├── dvc.lock\n├── dvc.yaml              # \u003c-- DVC pipeline file\n├── eval\n│   ├── metrics.json      # \u003c-- Binary classifier final metrics (e.g. AUC)\n│   └── plots             \n│       ├── images\n│       │   └── importance.png    # \u003c-- Feature importance plot\n│       └── sklearn       # \u003c-- Data points for ROC, confusion matrix\n│           ├── cm\n│           │   ├── test.json\n│           │   └── train.json\n│           ├── prc\n│           │   ├── test.json\n│           │   └── train.json\n│           └── roc\n│               ├── test.json\n│               └── train.json\n├── model.pkl             # \u003c-- Trained model file\n├── params.yaml           # \u003c-- Parameters file\n└── src                   # \u003c-- Source code to run the pipeline stages\n    ├── evaluate.py\n    ├── featurization.py\n    ├── prepare.py\n    ├── requirements.txt  # \u003c-- Python dependencies needed in the project\n    └── train.py\n```\n","funding_links":[],"categories":["Tutorials","Python"],"sub_categories":["Iterative"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiterative%2Fexample-get-started","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiterative%2Fexample-get-started","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiterative%2Fexample-get-started/lists"}