{"id":15763806,"url":"https://github.com/dacbd/example-get-started","last_synced_at":"2025-10-24T20:51:24.067Z","repository":{"id":142596824,"uuid":"482688011","full_name":"dacbd/example-get-started","owner":"dacbd","description":"Get Started DVC project","archived":false,"fork":false,"pushed_at":"2022-04-18T02:54:15.000Z","size":99,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-20T07:32:50.865Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://dvc.org/doc/get-started","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dacbd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-04-18T02:23:11.000Z","updated_at":"2022-05-11T04:58:21.000Z","dependencies_parsed_at":"2023-06-07T05:30:48.520Z","dependency_job_id":null,"html_url":"https://github.com/dacbd/example-get-started","commit_stats":{"total_commits":18,"total_committers":2,"mean_commits":9.0,"dds":"0.33333333333333337","last_synced_commit":"c20cee52d6f25fceda57949e92c88f874cb93c54"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/dacbd/example-get-started","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dacbd%2Fexample-get-started","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dacbd%2Fexample-get-started/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dacbd%2Fexample-get-started/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dacbd%2Fexample-get-started/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dacbd","download_url":"https://codeload.github.com/dacbd/example-get-started/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dacbd%2Fexample-get-started/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280865097,"owners_count":26404443,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-24T02:00:06.418Z","response_time":73,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-04T12:01:11.745Z","updated_at":"2025-10-24T20:51:24.052Z","avatar_url":"https://github.com/dacbd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DVC Get Started\n\nThis is an auto-generated repository for use in DVC\n[Get Started](https://dvc.org/doc/get-started). It is a step-by-step quick\nintroduction into basic DVC concepts.\n\n![](https://dvc.org/img/example-flow-2x.png)\n\nThe project is a natural language processing (NLP) binary classifier problem of\npredicting tags for a given StackOverflow question. For example, we want one\nclassifier which can predict a post that is about the Python language by tagging\nit `python`.\n\n🐛 Please report any issues found in this project here -\n[example-repos-dev](https://github.com/iterative/example-repos-dev).\n\n## Installation\n\nPython 3.6+ is required to run code from this repo.\n\n```console\n$ git clone https://github.com/iterative/example-get-started\n$ cd example-get-started\n```\n\nNow let's install the requirements. But before we do that, we **strongly**\nrecommend creating a virtual environment with a tool such as\n[virtualenv](https://virtualenv.pypa.io/en/stable/):\n\n```console\n$ virtualenv -p python3 .env\n$ source .env/bin/activate\n$ pip install -r src/requirements.txt\n```\n\n\u003e This instruction assumes that DVC is already installed, as it is frequently\n\u003e used as a global tool like Git. If DVC is not installed, see the\n\u003e [DVC installation guide](https://dvc.org/doc/install) on how to install DVC.\n\nThis DVC project comes with a preconfigured DVC\n[remote storage](https://dvc.org/doc/commands-reference/remote) that holds raw\ndata (input), intermediate, and final results that are produced. This is a\nread-only HTTP remote.\n\n```console\n$ dvc remote list\nstorage https://remote.dvc.org/get-started\n```\n\nYou can run [`dvc pull`](https://man.dvc.org/pull) to download the data:\n\n```console\n$ dvc pull\n```\n\n## Running in your environment\n\nRun [`dvc repro`](https://man.dvc.org/repro) to reproduce the\n[pipeline](https://dvc.org/doc/commands-reference/pipeline):\n\n```console\n$ dvc repro\nData and pipelines are up to date.\n```\n\nIf you'd like to test commands like [`dvc push`](https://man.dvc.org/push),\nthat require write access to the remote storage, the easiest way would be to set\nup a \"local remote\" on your file system:\n\n\u003e This kind of remote is located in the local file system, but is external to\n\u003e the DVC project.\n\n```console\n$ mkdir -p /tmp/dvc-storage\n$ dvc remote add local /tmp/dvc-storage\n```\n\nYou should now be able to run:\n\n```console\n$ dvc push -r local\n```\n\n## Existing stages\n\nThis project with the help of the Git tags reflects the sequence of actions that\nare run in the DVC [get started](https://dvc.org/doc/get-started) guide. Feel\nfree to checkout one of them and play with the DVC commands having the\nplayground ready.\n\n- `0-git-init`: Empty Git repository initialized.\n- `1-dvc-init`: DVC has been initialized. `.dvc/` with the cache directory\n  created.\n- `2-track-data`: Raw data file `data.xml` downloaded and tracked with DVC using\n  [`dvc add`](https://man.dvc.org/add). First `.dvc` file created.\n- `3-config-remote`: Remote HTTP storage initialized. It's a shared read only\n  storage that contains all data artifacts produced during next steps.\n- `4-import-data`: Use `dvc import` to get the same `data.xml` from the DVC data\n  registry.\n- `5-source-code`: Source code downloaded and put into Git.\n- `6-prepare-stage`: Create `dvc.yaml` and the first pipeline stage with\n  [`dvc run`](https://man.dvc.org/run). It transforms XML data into TSV.\n- `7-ml-pipeline`: Feature extraction and train stages created. It takes data in\n  TSV format and produces two `.pkl` files that contain serialized feature\n  matrices. Train runs random forest classifier and creates the `model.pkl` file.\n- `8-evaluation`: Evaluation stage. Runs the model on a test dataset to produce\n  its performance AUC value. The result is dumped into a DVC metric file so that\n  we can compare it with other experiments later.\n- `9-bigrams-model`: Bigrams experiment, code has been modified to extract more\n  features. We run [`dvc repro`](https://man.dvc.org/repro) for the first time\n  to illustrate how DVC can reuse cached files and detect changes along the\n  computational graph, regenerating the model with the updated data.\n- `10-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based\n  model.\n- `11-random-forest-experiments`: Reproduce experiments to tune the random\n  forest classifier parameters and select the best experiment.\n\nThere are three additional tags:\n\n- `baseline-experiment`: First end-to-end result that we have performance metric\n  for.\n- `bigrams-experiment`: Second experiment (model trained using bigrams\n  features).\n- `random-forest-experiments`: Best of additional experiments tuning random\n  forest parameters.\n\nThese tags can be used to illustrate `-a` or `-T` options across different\n[DVC commands](https://man.dvc.org/).\n\n## Project structure\n\nThe data files, DVC files, and results change as stages are created one by one.\nAfter cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data\ntracked by DVC, the workspace should look like this:\n\n```console\n$ tree\n.\n├── README.md\n├── data                  # \u003c-- Directory with raw and intermediate data\n│   ├── data.xml          # \u003c-- Initial XML StackOverflow dataset (raw data)\n│   ├── data.xml.dvc      # \u003c-- .dvc file - a placeholder/pointer to raw data\n│   ├── features          # \u003c-- Extracted feature matrices\n│   │   ├── test.pkl\n│   │   └── train.pkl\n│   └── prepared          # \u003c-- Processed dataset (split and TSV formatted)\n│       ├── test.tsv\n│       └── train.tsv\n├── dvc.lock\n├── dvc.yaml              # \u003c-- DVC pipeline file\n├── model.pkl             # \u003c-- Trained model file\n├── params.yaml           # \u003c-- Parameters file\n├── prc.json              # \u003c-- Precision-recall curve data points\n├── roc.json              # \u003c-- ROC curve data points\n├── scores.json           # \u003c-- Binary classifier final metrics (e.g. AUC)\n└── src                   # \u003c-- Source code to run the pipeline stages\n    ├── evaluate.py\n    ├── featurization.py\n    ├── prepare.py\n    ├── requirements.txt  # \u003c-- Python dependencies needed in the project\n    └── train.py\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdacbd%2Fexample-get-started","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdacbd%2Fexample-get-started","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdacbd%2Fexample-get-started/lists"}