{"id":15763874,"url":"https://github.com/dacbd/studio-test-repo","last_synced_at":"2025-03-31T10:19:23.533Z","repository":{"id":142596899,"uuid":"556990049","full_name":"dacbd/studio-test-repo","owner":"dacbd","description":"test","archived":false,"fork":false,"pushed_at":"2022-10-25T17:36:30.000Z","size":38,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-10-11T12:03:29.575Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dacbd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-24T22:25:11.000Z","updated_at":"2022-10-24T22:38:09.000Z","dependencies_parsed_at":"2023-06-19T04:10:51.014Z","dependency_job_id":null,"html_url":"https://github.com/dacbd/studio-test-repo","commit_stats":{"total_commits":2,"total_committers":1,"mean_commits":2.0,"dds":0.0,"last_synced_commit":"b26f458bbb1bc0b3fd2e82bee76b08046254ce2e"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dacbd%2Fstudio-test-repo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dacbd%2Fstudio-test-repo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dacbd%2Fstudio-test-repo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dacbd%2Fstudio-test-repo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dacbd","download_url":"https://codeload.github.com/dacbd/studio-test-repo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246450478,"owners_count":20779421,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-04T12:01:16.499Z","updated_at":"2025-03-31T10:19:23.509Z","avatar_url":"https://github.com/dacbd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![DVC](https://img.shields.io/badge/-Open_in_Studio-grey.svg?style=flat-square\u0026logo=data-version-control)](https://studio.iterative.ai/team/Iterative/views/example-get-started-zde16i6c4g) [![DVC-metrics](https://img.shields.io/badge/dynamic/json?style=flat-square\u0026colorA=grey\u0026colorB=F46737\u0026label=Average%20Precision\u0026url=https://github.com/iterative/example-get-started/raw/main/evaluation.json\u0026query=avg_prec)](https://github.com/iterative/example-get-started/raw/main/evaluation.json)\n\n# DVC Get Started\n\nThis is an auto-generated repository for use in [DVC](https://dvc.org)\n[Get Started](https://dvc.org/doc/get-started). It is a step-by-step quick\nintroduction into basic DVC concepts.\n\n![](https://static.iterative.ai/img/example-get-started/readme-head.png)\n\nThe project is a natural language processing (NLP) binary classifier problem of\npredicting tags for a given StackOverflow question. For example, we want one\nclassifier which can predict a post that is about the R language by tagging it\n`R`.\n\n🐛 Please report any issues found in this project here -\n[example-repos-dev](https://github.com/iterative/example-repos-dev).\n\n## Installation\n\nPython 3.7+ is required to run code from this repo.\n\n```console\n$ git clone https://github.com/iterative/example-get-started\n$ cd example-get-started\n```\n\nNow let's install the requirements. But before we do that, we **strongly**\nrecommend creating a virtual environment with a tool such as\n[virtualenv](https://virtualenv.pypa.io/en/stable/):\n\n```console\n$ virtualenv -p python3 .venv\n$ source .venv/bin/activate\n$ pip install -r src/requirements.txt\n```\n\n\u003e This instruction assumes that DVC is already installed, as it is frequently\n\u003e used as a global tool like Git. If DVC is not installed, see the\n\u003e [DVC installation guide](https://dvc.org/doc/install) on how to install DVC.\n\nThis DVC project comes with a preconfigured DVC\n[remote storage](https://dvc.org/doc/commands-reference/remote) that holds raw\ndata (input), intermediate, and final results that are produced. This is a\nread-only HTTP remote.\n\n```console\n$ dvc remote list\nstorage https://remote.dvc.org/get-started\n```\n\nYou can run [`dvc pull`](https://man.dvc.org/pull) to download the data:\n\n```console\n$ dvc pull\n```\n\n## Running in your environment\n\nRun [`dvc repro`](https://man.dvc.org/repro) to reproduce the\n[pipeline](https://dvc.org/doc/commands-reference/pipeline):\n\n```console\n$ dvc repro\nData and pipelines are up to date.\n```\n\nIf you'd like to test commands like [`dvc push`](https://man.dvc.org/push),\nthat require write access to the remote storage, the easiest way would be to set\nup a \"local remote\" on your file system:\n\n\u003e This kind of remote is located in the local file system, but is external to\n\u003e the DVC project.\n\n```console\n$ mkdir -p /tmp/dvc-storage\n$ dvc remote add local /tmp/dvc-storage\n```\n\nYou should now be able to run:\n\n```console\n$ dvc push -r local\n```\n\n## Existing stages\n\nThis project with the help of the Git tags reflects the sequence of actions that\nare run in the DVC [get started](https://dvc.org/doc/get-started) guide. Feel\nfree to checkout one of them and play with the DVC commands having the\nplayground ready.\n\n- `0-git-init`: Empty Git repository initialized.\n- `1-dvc-init`: DVC has been initialized. `.dvc/` with the cache directory\n  created.\n- `2-track-data`: Raw data file `data.xml` downloaded and tracked with DVC using\n  [`dvc add`](https://man.dvc.org/add). First `.dvc` file created.\n- `3-config-remote`: Remote HTTP storage initialized. It's a shared read only\n  storage that contains all data artifacts produced during next steps.\n- `4-import-data`: Use `dvc import` to get the same `data.xml` from the DVC data\n  registry.\n- `5-source-code`: Source code downloaded and put into Git.\n- `6-prepare-stage`: Create `dvc.yaml` and the first pipeline stage with\n  [`dvc run`](https://man.dvc.org/run). It transforms XML data into TSV.\n- `7-ml-pipeline`: Feature extraction and train stages created. It takes data in\n  TSV format and produces two `.pkl` files that contain serialized feature\n  matrices. Train runs random forest classifier and creates the `model.pkl` file.\n- `8-evaluation`: Evaluation stage. Runs the model on a test dataset to produce\n  its performance AUC value. The result is dumped into a DVC metric file so that\n  we can compare it with other experiments later.\n- `9-bigrams-model`: Bigrams experiment, code has been modified to extract more\n  features. We run [`dvc repro`](https://man.dvc.org/repro) for the first time\n  to illustrate how DVC can reuse cached files and detect changes along the\n  computational graph, regenerating the model with the updated data.\n- `10-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based\n  model.\n- `11-random-forest-experiments`: Reproduce experiments to tune the random\n  forest classifier parameters and select the best experiment.\n\nThere are three additional tags:\n\n- `baseline-experiment`: First end-to-end result that we have performance metric\n  for.\n- `bigrams-experiment`: Second experiment (model trained using bigrams\n  features).\n- `random-forest-experiments`: Best of additional experiments tuning random\n  forest parameters.\n\nThese tags can be used to illustrate `-a` or `-T` options across different\n[DVC commands](https://man.dvc.org/).\n\n## Project structure\n\nThe data files, DVC files, and results change as stages are created one by one.\nAfter cloning and using [`dvc pull`](https://man.dvc.org/pull) to download\ndata, models, and plots tracked by DVC, the workspace should look like this:\n\n```console\n$ tree\n.\n├── README.md\n├── data                  # \u003c-- Directory with raw and intermediate data\n│   ├── data.xml          # \u003c-- Initial XML StackOverflow dataset (raw data)\n│   ├── data.xml.dvc      # \u003c-- .dvc file - a placeholder/pointer to raw data\n│   ├── features          # \u003c-- Extracted feature matrices\n│   │   ├── test.pkl\n│   │   └── train.pkl\n│   └── prepared          # \u003c-- Processed dataset (split and TSV formatted)\n│       ├── test.tsv\n│       └── train.tsv\n├── evaluation\n│   ├── importance.png    # \u003c-- Feature importance plot\n│   └── plots             # \u003c-- Data points for ROC, PRC, confusion matrix\n│       ├── confusion_matrix.json\n│       ├── precision_recall.json\n│       └── roc.json\n├── dvc.lock\n├── dvc.yaml              # \u003c-- DVC pipeline file\n├── model.pkl             # \u003c-- Trained model file\n├── params.yaml           # \u003c-- Parameters file\n├── evaluation.json       # \u003c-- Binary classifier final metrics (e.g. AUC)\n└── src                   # \u003c-- Source code to run the pipeline stages\n    ├── evaluate.py\n    ├── featurization.py\n    ├── prepare.py\n    ├── requirements.txt  # \u003c-- Python dependencies needed in the project\n    └── train.py\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdacbd%2Fstudio-test-repo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdacbd%2Fstudio-test-repo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdacbd%2Fstudio-test-repo/lists"}