{"id":13476825,"url":"https://github.com/data-centric-ai/dcbench","last_synced_at":"2025-03-27T04:31:12.743Z","repository":{"id":77335173,"uuid":"390444855","full_name":"data-centric-ai/dcbench","owner":"data-centric-ai","description":"A benchmark of data-centric tasks from across the machine learning lifecycle.","archived":false,"fork":false,"pushed_at":"2022-06-08T21:03:59.000Z","size":641,"stargazers_count":70,"open_issues_count":3,"forks_count":9,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-01-28T23:12:12.691Z","etag":null,"topics":["data-science","machine-learning"],"latest_commit_sha":null,"homepage":"https://www.datacentricai.cc/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/data-centric-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2021-07-28T17:35:26.000Z","updated_at":"2024-01-04T16:59:49.000Z","dependencies_parsed_at":"2024-01-06T14:50:19.032Z","dependency_job_id":null,"html_url":"https://github.com/data-centric-ai/dcbench","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/data-centric-ai%2Fdcbench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/data-centric-ai%2Fdcbench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/data-centric-ai%2Fdcbench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/data-centric-ai%2Fdcbench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/data-centric-ai","download_url":"https://codeload.github.com/data-centric-ai/dcbench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245784820,"owners_count":20671620,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","machine-learning"],"created_at":"2024-07-31T16:01:35.036Z","updated_at":"2025-03-27T04:31:10.625Z","avatar_url":"https://github.com/data-centric-ai.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"docs/assets/banner.png\" height=150 alt=\"banner\"/\u003e\n\n-----\n![GitHub Workflow Status](https://img.shields.io/github/workflow/status/data-centric-ai/dcbench/CI)\n![GitHub](https://img.shields.io/github/license/data-centric-ai/dcbench)\n[![Documentation Status](https://readthedocs.org/projects/dcbench/badge/?version=latest)](https://dcbench.readthedocs.io/en/latest/?badge=latest)\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit\u0026logoColor=white)](https://github.com/pre-commit/pre-commit)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/dcbench)](https://pypi.org/project/dcbench/)\n[![codecov](https://codecov.io/gh/data-centric-ai/dcbench/branch/main/graph/badge.svg?token=MOLQYUSYQU)](https://codecov.io/gh/data-centric-ai/dcbench)\n\nA benchmark of data-centric tasks from across the machine learning lifecycle.\n\n[**Getting Started**](#%EF%B8%8F-quickstart)\n| [**What is dcbench?**](#-what-is-dcbench)\n| [**Docs**](https://dcbench.readthedocs.io/en/latest/index.html)\n| [**Contributing**](CONTRIBUTING.md)\n| [**Website**](https://www.datacentricai.cc/)\n| [**About**](#%EF%B8%8F-about)\n\u003c/div\u003e\n\n\n## ⚡️ Quickstart\n\n```bash\npip install dcbench\n```\n\u003e Optional: some parts of Meerkat rely on optional dependencies. If you know which optional dependencies you'd like to install, you can do so using something like `pip install dcbench[dev]` instead. See setup.py for a full list of optional dependencies.\n\n\u003e Installing from dev: `pip install \"dcbench[dev] @ git+https://github.com/data-centric-ai/dcbench@main\"`\n\nUsing a Jupyter notebook or some other interactive environment, you can import the library \nand explore the data-centric problems in the benchmark:\n\n```python\nimport dcbench\ndcbench.tasks\n```\nTo learn more, follow the [walkthrough](https://dcbench.readthedocs.io/en/latest/intro.html#api-walkthrough) in the docs. \n\n\n## 💡 What is dcbench?\nThis benchmark evaluates the steps in your machine learning workflow beyond model training and tuning. This includes feature cleaning, slice discovery, and coreset selection. We call these “data-centric” tasks because they're focused on exploring and manipulating data – not training models. ``dcbench`` supports a growing list of them:\n\n* [Minimal Data Selection](https://dcbench.readthedocs.io/en/latest/tasks.html#minimal-data-selection)\n* [Slice Discovery](https://dcbench.readthedocs.io/en/latest/tasks.html#slice-discovery)\n* [Minimal Feature Cleaning](https://dcbench.readthedocs.io/en/latest/tasks.html#minimal-feature-cleaning)\n\n\n``dcbench`` includes tasks that look very different from one another: the inputs and\noutputs of the slice discovery task are not the same as those of the\nminimal data cleaning task. However, we think it important that\nresearchers and practitioners be able to run evaluations on data-centric\ntasks across the ML lifecycle without having to learn a bunch of\ndifferent APIs or rewrite evaluation scripts.\n\nSo, ``dcbench`` is designed to be a common home for these diverse, but\nrelated, tasks. In ``dcbench`` all of these tasks are structured in a\nsimilar manner and they are supported by a common Python API that makes\nit easy to download data, run evaluations, and compare methods.\n\n\n## ✉️ About\n`dcbench` is being developed alongside the data-centric-ai benchmark. Reach out to Bojan Karlaš (karlasb [at] inf [dot] ethz [dot] ch) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu if you would like to get involved or contribute!)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdata-centric-ai%2Fdcbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdata-centric-ai%2Fdcbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdata-centric-ai%2Fdcbench/lists"}