{"id":16999862,"url":"https://github.com/davidberenstein1957/dataset-viber","last_synced_at":"2025-03-06T05:13:05.760Z","repository":{"id":252072046,"uuid":"839344278","full_name":"davidberenstein1957/dataset-viber","owner":"davidberenstein1957","description":"Dataset Viber is your chill repo for data collection, annotation and vibe checks.","archived":false,"fork":false,"pushed_at":"2024-09-05T07:13:10.000Z","size":1361,"stargazers_count":45,"open_issues_count":13,"forks_count":12,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-27T04:14:53.891Z","etag":null,"topics":["data-collection","data-quality","evaluation","human-feedback"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidberenstein1957.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-07T12:23:36.000Z","updated_at":"2025-02-26T17:32:05.000Z","dependencies_parsed_at":"2024-08-07T15:01:41.372Z","dependency_job_id":"0b2f0e6c-e8f4-4837-ae8d-6d3b2da38f23","html_url":"https://github.com/davidberenstein1957/dataset-viber","commit_stats":null,"previous_names":["davidberenstein1957/gradio-data-collectors","davidberenstein1957/awesome-data-collectors","davidberenstein1957/dataset-viber"],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fdataset-viber","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fdataset-viber/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fdataset-viber/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fdataset-viber/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidberenstein1957","download_url":"https://codeload.github.com/davidberenstein1957/dataset-viber/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242150807,"owners_count":20080006,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-collection","data-quality","evaluation","human-feedback"],"created_at":"2024-10-14T04:10:24.795Z","updated_at":"2025-03-06T05:13:05.728Z","avatar_url":"https://github.com/davidberenstein1957.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n  \u003ca href=\"\"\u003e\u003cimg src=\"https://cdn-icons-png.flaticon.com/512/2091/2091395.png\" alt=\"dataset-viber\" width=\"150\"\u003e\u003c/a\u003e\n  \u003cbr\u003e\n  Dataset Viber\n  \u003cbr\u003e\n\u003c/h1\u003e\n\n\u003ch3 align=\"center\"\u003eAvoid the hype, check the vibe!\u003c/h2\u003e\n\nI've cooked up Dataset Viber, a cool set of tools to make your life easier when dealing with data for AI models. Dataset Viber is all about making your data prep journey smooth and fun. It's **not for team collaboration or production**, nor trying to be all fancy and formal - just a bunch of **cool tools to help you collect feedback and do vibe-checks** as an AI engineer or lover. Want to see it in action? Just plug it in and start vibing with your data. It's that easy!\n\n- **CollectorInterface**: Lazily collect data of model interactions without human annotation.\n- **AnnotatorInterface**: Walk through your data and annotate it with models in the loop.\n- **Synthesizer**: Synthesize data with `distilabel` in the loop.\n- **BulkInterface**: Explore your data distribution and annotate in bulk.\n\nNeed any tweaks or want to hear more about a specific tool? Just [open an issue](https://github.com/davidberenstein1957/dataset-viber/issues/new) or give me a shout!\n\n\u003e [!NOTE]\n\u003e\n\u003e - Data is logged to a local CSV or directly to the Hugging Face Hub.\n\u003e - All tools also run in `.ipynb` notebooks.\n\u003e - Models in the loop through `fn_model`.\n\u003e - Input with custom data streamers or pre-built `Synthesizer` classes with the `fn_next_input` argument.\n\u003e - It supports various tasks for `text`, `chat` and `image` modalities.\n\u003e - Import and export from the Hugging Face Hub or CSV files.\n\n\u003e [!TIP]\n\u003e\n\u003e - Code examples: [src/dataset_viber/examples](https://github.com/davidberenstein1957/dataset-viber/tree/main/src/dataset_viber/examples).\n\u003e - Hub examples: [https://huggingface.co/dataset-viber](https://huggingface.co/dataset-viber).\n\n## Installation\n\nYou can install the package via pip:\n\n```bash\npip install dataset-viber\n```\n\nOr install `Synthesizer` dependencies. Note, that the `Synthesizer` relies on `distilabel[hf-inference-endpoints]`, but you can use other [LLMs available to distilabel](https://distilabel.argilla.io) too, like for example `distilabel[ollama]`.\n\n```bash\npip install dataset-viber[synthesizer]\n```\n\nOr install `BulkInterface` dependencies:\n\n```bash\npip install dataset-viber[bulk]\n```\n\n## How are we vibing?\n\n### CollectorInterface\n\n\u003e Built on top of the `gr.Interface` and `gr.ChatInterface` to lazily collect data for interactions automatically.\n\n\u003chttps://github.com/user-attachments/assets/4ddac8a1-62ab-4b3b-9254-f924f5898075\u003e\n\n[Hub dataset](https://huggingface.co/datasets/davidberenstein1957/dataset-viber-token-classification)\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eCollectorInterface\u003c/code\u003e\u003c/summary\u003e\n\n```python\nimport gradio as gr\nfrom dataset_viber import CollectorInterface\n\ndef calculator(num1, operation, num2):\n    if operation == \"add\":\n        return num1 + num2\n    elif operation == \"subtract\":\n        return num1 - num2\n    elif operation == \"multiply\":\n        return num1 * num2\n    elif operation == \"divide\":\n        return num1 / num2\n\ninputs = [\"number\", gr.Radio([\"add\", \"subtract\", \"multiply\", \"divide\"]), \"number\"]\noutputs = \"number\"\n\ninterface = CollectorInterface(\n    fn=calculator,\n    inputs=inputs,\n    outputs=outputs,\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=\"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\"\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eCollectorInterface.from_interface\u003c/code\u003e\u003c/summary\u003e\n\n```python\ninterface = gr.Interface(\n    fn=calculator,\n    inputs=inputs,\n    outputs=outputs\n)\ninterface = CollectorInterface.from_interface(\n   interface=interface,\n   csv_logger=False, # True if you want to log to a CSV\n   dataset_name=\"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\"\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eCollectorInterface.from_pipeline\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom transformers import pipeline\nfrom dataset_viber import CollectorInterface\n\npipeline = pipeline(\"text-classification\", model=\"mrm8488/bert-tiny-finetuned-sms-spam-detection\")\ninterface = CollectorInterface.from_pipeline(\n    pipeline=pipeline,\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=\"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\"\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n### AnnotatorInterface\n\n\u003e Built on top of the `CollectorInterface` to collect and annotate data and log it to the Hub.\n\n\n#### Text\n\nhttps://github.com/user-attachments/assets/d1abda66-9972-4c60-89d2-7626f5654f15\n\n[Hub dataset](https://huggingface.co/datasets/davidberenstein1957/dataset-viber-text-classification)\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003etext-classification\u003c/code\u003e/\u003ccode\u003emulti-label-text-classification\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\ntexts = [\n    \"Anthony Bourdain was an amazing chef!\",\n    \"Anthony Bourdain was a terrible tv persona!\"\n]\nlabels = [\"positive\", \"negative\"]\n\ninterface = AnnotatorInterFace.for_text_classification(\n    texts=texts,\n    labels=labels,\n    multi_label=False, # True if you have multi-label data\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003etoken-classification\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\ntexts = [\"Anthony Bourdain was an amazing chef in New York.\"]\nlabels = [\"NAME\", \"LOC\"]\n\ninterface = AnnotatorInterFace.for_token_classification(\n    texts=texts,\n    labels=labels,\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eextractive-question-answering\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\nquestions = [\"Where was Anthony Bourdain located?\"]\ncontexts = [\"Anthony Bourdain was an amazing chef in New York.\"]\n\ninterface = AnnotatorInterFace.for_question_answering(\n    questions=questions,\n    contexts=contexts,\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003etext-generation\u003c/code\u003e/\u003ccode\u003etranslation\u003c/code\u003e/\u003ccode\u003ecompletion\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\nprompts = [\"Tell me something about Anthony Bourdain.\"]\ncompletions = [\"Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.\"]\n\ninterface = AnnotatorInterFace.for_text_generation(\n    prompts=prompts, # source\n    completions=completions, # optional to show initial completion / target\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003etext-generation-preference\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\nprompts = [\"Tell me something about Anthony Bourdain.\"]\ncompletions_a = [\"Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.\"]\ncompletions_b = [\"Anthony Michael Bourdain was an cool guy that knew how to cook.\"]\n\ninterface = AnnotatorInterFace.for_text_generation_preference(\n    prompts=prompts,\n    completions_a=completions_a,\n    completions_b=completions_b,\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n#### Chat and multi-modal chat\n\nhttps://github.com/user-attachments/assets/fe7f0139-95a3-40e8-bc03-e37667d4f7a9\n\n[Hub dataset](https://huggingface.co/datasets/davidberenstein1957/dataset-viber-chat-generation-preference)\n\n\u003e [!TIP]\n\u003e I recommend uploading the files files to a cloud storage and using the remote URL to avoid any issues. This can be done [using Hugging Face Datasets](https://huggingface.co/docs/datasets/en/image_load#local-files). As shown in [utils](#utils). Additionally [GradioChatbot](https://www.gradio.app/docs/gradio/chatbot#behavior) shows how to use the chatbot interface for multi-modal.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003echat-classification\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\nprompts = [\n    [\n        {\n            \"role\": \"user\",\n            \"content\": \"Tell me something about Anthony Bourdain.\"\n        },\n        {\n            \"role\": \"assistant\",\n            \"content\": \"Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.\"\n        }\n    ]\n]\n\ninterface = AnnotatorInterFace.for_chat_classification(\n    prompts=prompts,\n    labels=[\"toxic\", \"non-toxic\"],\n    multi_label=False, # True if you have multi-label data\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003echat-generation\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\nprompts = [\n    [\n        {\n            \"role\": \"user\",\n            \"content\": \"Tell me something about Anthony Bourdain.\"\n        }\n    ]\n]\n\ncompletions = [\n    \"Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.\",\n]\n\ninterface = AnnotatorInterFace.for_chat_generation(\n    prompts=prompts,\n    completions=completions,\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003echat-generation-preference\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\nprompts = [\n    [\n        {\n            \"role\": \"user\",\n            \"content\": \"Tell me something about Anthony Bourdain.\"\n        }\n    ]\n]\ncompletions_a = [\n    \"Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.\",\n]\ncompletions_b = [\n    \"Anthony Michael Bourdain was an cool guy that knew how to cook.\"\n]\n\ninterface = AnnotatorInterFace.for_chat_generation_preference(\n    prompts=prompts,\n    completions_a=completions_a,\n    completions_b=completions_b,\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n#### Image and multi-modal\n\n\u003chttps://github.com/user-attachments/assets/57d89edf-ae40-4942-a20a-bf8443100b66\u003e\n\n[Hub dataset](https://huggingface.co/datasets/davidberenstein1957/dataset-viber-image-question-answering)\n\n\u003e [!TIP]\n\u003e I recommend uploading the files files to a cloud storage and using the remote URL to avoid any issues. This can be done [using Hugging Face Datasets](https://huggingface.co/docs/datasets/en/image_load#local-files). As shown in [utils](#utils).\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eimage-classification\u003c/code\u003e/\u003ccode\u003emulti-label-image-classification\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\nimages = [\n    \"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg\",\n    \"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg\"\n]\nlabels = [\"anthony-bourdain\", \"not-anthony-bourdain\"]\n\ninterface = AnnotatorInterFace.for_image_classification(\n    images=images,\n    labels=labels,\n    multi_label=False, # True if you have multi-label data\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eimage-generation\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\nprompts = [\n    \"Anthony Bourdain laughing\",\n    \"David Chang wearing a suit\"\n]\nimages = [\n    \"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg\",\n    \"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg\",\n]\n\ninterface = AnnotatorInterFace.for_image_generation(\n    prompts=prompts,\n    completions=images,\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\n\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eimage-description\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\nimages = [\n    \"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg\",\n    \"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg\"\n]\ndescriptions = [\"Anthony Bourdain laughing\", \"David Chang wearing a suit\"]\n\ninterface = AnnotatorInterFace.for_image_description(\n    images=images,\n    descriptions=descriptions, # optional to show initial descriptions\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eimage-question-answering\u003c/code\u003e/\u003ccode\u003evisual-question-answering\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\nimages = [\n    \"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg\",\n    \"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg\"\n]\nquestions = [\"Who is this?\", \"What is he wearing?\"]\nanswers = [\"Anthony Bourdain\", \"a suit\"]\n\ninterface = AnnotatorInterFace.for_image_question_answering(\n    images=images,\n    questions=questions, # optional to show initial questions\n    answers=answers, # optional to show initial answers\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eimage-generation-preference\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\n\nprompts = [\n    \"Anthony Bourdain laughing\",\n    \"David Chang wearing a suit\"\n]\n\nimages_a = [\n    \"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg\",\n    \"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg\",\n]\n\nimages_b = [\n    \"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg\",\n    \"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg\"\n]\n\ninterface = AnnotatorInterFace.for_image_generation_preference(\n    prompts=prompts,\n    completions_a=images_a,\n    completions_b=images_b,\n    fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`\n    fn_next_input=None, # a function that feeds gradio components actively with the next input\n    csv_logger=False, # True if you want to log to a CSV\n    dataset_name=None # \"\u003cmy_hf_org\u003e/\u003cmy_dataset\u003e\" if you want to log to the hub\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n### Synthesizer\n\n\u003e Built on top of the `distilabel` to synthesize data with models in the loop.\n\n\u003e [!TIP]\n\u003e You can use also call the synthesizer directly to generate data. `synthesizer() -\u003e Tuple` or `Synthesizer.batch_synthesize(n) -\u003e List[Tuple]` to get inputs for the various tasks.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003etext-classification\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\nfrom dataset_viber.synthesizer import Synthesizer\n\nsynthesizer = Synthesizer.for_text_classification(\n    prompt_context=\"IMDB movie reviews\"\n)\n\ninterface = AnnotatorInterFace.for_text_classification(\n    fn_next_input=synthesizer,\n    labels=[\"positive\", \"negative\"]\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003etext-generation\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\nfrom dataset_viber.synthesizer import Synthesizer\n\nsynthesizer = Synthesizer.for_text_generation(\n    prompt_context=\"Phone company customer support.\"\n)\n\ninterface = AnnotatorInterFace.for_text_generation(\n    fn_next_input=synthesizer\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003echat-classification\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\nfrom dataset_viber.synthesizer import Synthesizer\n\nsynthesizer = Synthesizer.for_chat_classification(\n    prompt_context=\"Phone company customer support.\"\n)\n\ninterface = AnnotatorInterFace.for_chat_classification(\n    fn_next_input=synthesizer,\n    labels=[\"positive\", \"negative\"]\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003echat-generation\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\nfrom dataset_viber.synthesizer import Synthesizer\n\nsynthesizer = Synthesizer.for_chat_generation(\n    prompt_context=\"Phone company customer support.\"\n)\n\ninterface = AnnotatorInterFace.for_chat_generation(\n    fn_next_input=synthesizer\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003echat-generation-preference\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\nfrom dataset_viber.synthesizer import Synthesizer\n\nsynthesizer = Synthesizer.for_chat_generation_preference(\n    prompt_context=\"Phone company customer support.\"\n)\n\ninterface = AnnotatorInterFace.for_chat_generation_preference(\n    fn_next_input=synthesizer\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003eimage-classification\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\nfrom dataset_viber.synthesizer import Synthesizer\n\nsynthesizer = Synthesizer.for_image_classification(\n    prompt_context=\"Phone company customer support.\"\n)\n\ninterface = AnnotatorInterFace.for_image_classification(\n    fn_next_input=synthesizer,\n    labels=[\"positive\", \"negative\"]\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\n\u003csummary\u003e\u003ccode\u003eimage-generation\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\nfrom dataset_viber.synthesizer import Synthesizer\n\nsynthesizer = Synthesizer.for_image_generation(\n    prompt_context=\"Phone company customer support.\"\n)\n\ninterface = AnnotatorInterFace.for_image_generation(\n    fn_next_input=synthesizer\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\n\u003csummary\u003e\u003ccode\u003eimage-description\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\nfrom dataset_viber.synthesizer import Synthesizer\n\nsynthesizer = Synthesizer.for_image_description(\n    prompt_context=\"Phone company customer support.\"\n)\n\ninterface = AnnotatorInterFace.for_image_description(\n    fn_next_input=synthesizer\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\n\u003csummary\u003e\u003ccode\u003eimage-question-answering\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\nfrom dataset_viber.synthesizer import Synthesizer\n\nsynthesizer = Synthesizer.for_image_question_answering(\n    prompt_context=\"Phone company customer support.\"\n)\n\ninterface = AnnotatorInterFace.for_image_question_answering(\n    fn_next_input=synthesizer\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\n\u003csummary\u003e\u003ccode\u003eimage-generation-preference\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import AnnotatorInterFace\nfrom dataset_viber.synthesizer import Synthesizer\n\nsynthesizer = Synthesizer.for_image_generation_preference(\n    prompt_context=\"Phone company customer support.\"\n)\n\ninterface = AnnotatorInterFace.for_image_generation_preference(\n    fn_next_input=synthesizer\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n### BulkInterface\n\n\u003e Built on top of the `Dash`, `plotly-express`, `umap-learn`, and `fast-sentence-transformers` to embed and understand your distribution and annotate your data.\n\nhttps://github.com/user-attachments/assets/5e96c06d-e37f-45a0-9633-1a8e714d71ed\n\n[Hub dataset](https://huggingface.co/datasets/SetFit/ag_news)\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003etext-visualization\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import BulkInterface\nfrom datasets import load_dataset\n\nds = load_dataset(\"SetFit/ag_news\", split=\"train[:2000]\")\n\ninterface: BulkInterface = BulkInterface.for_text_visualization(\n    ds.to_pandas()[[\"text\", \"label_text\"]],\n    content_column='text',\n    label_column='label_text',\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003etext-classification\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber import BulkInterface\nfrom datasets import load_dataset\n\nds = load_dataset(\"SetFit/ag_news\", split=\"train[:2000]\")\ndf = ds.to_pandas()[[\"text\", \"label_text\"]]\n\ninterface = BulkInterface.for_text_classification(\n    dataframe=df,\n    content_column='text',\n    label_column='label_text',\n    labels=df['label_text'].unique().tolist()\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003echat-visualization\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber.bulk import BulkInterface\nfrom datasets import load_dataset\n\nds = load_dataset(\"argilla/distilabel-capybara-dpo-7k-binarized\", split=\"train[:1000]\")\ndf = ds.to_pandas()[[\"chosen\"]]\n\ninterface = BulkInterface.for_chat_visualization(\n    dataframe=df,\n    chat_column='chosen',\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ccode\u003echat-classification\u003c/code\u003e\u003c/summary\u003e\n\n```python\nfrom dataset_viber.bulk import BulkInterface\nfrom datasets import load_dataset\n\nds = load_dataset(\"argilla/distilabel-capybara-dpo-7k-binarized\", split=\"train[:1000]\")\ndf = ds.to_pandas()[[\"chosen\"]]\n\ninterface = BulkInterface.for_chat_classification(\n    dataframe=df,\n    chat_column='chosen',\n    labels=[\"math\", \"science\", \"history\", \"question seeking\"],\n)\ninterface.launch()\n```\n\n\u003c/details\u003e\n\n### Utils\n\n\u003cdetails\u003e\n\u003csummary\u003eShuffle inputs in the same order\u003c/summary\u003e\n\nWhen working with multiple inputs, you might want to shuffle them in the same order.\n\n```python\ndef shuffle_lists(*lists):\n    if not lists:\n        return []\n\n    # Get the length of the first list\n    length = len(lists[0])\n\n    # Check if all lists have the same length\n    if not all(len(lst) == length for lst in lists):\n        raise ValueError(\"All input lists must have the same length\")\n\n    # Create a list of indices and shuffle it\n    indices = list(range(length))\n    random.shuffle(indices)\n\n    # Reorder each list based on the shuffled indices\n    return [\n        [lst[i] for i in indices]\n        for lst in lists\n    ]\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eRandom swap to randomize completions\u003c/summary\u003e\n\nWhen working with multiple completions, you might want to swap out the completions at the same index, where each completion index x is swapped with a random completion at the same index. This is useful for preference learning.\n\n```python\ndef swap_completions(*lists):\n    # Assuming all lists are of the same length\n    length = len(lists[0])\n\n    # Check if all lists have the same length\n    if not all(len(lst) == length for lst in lists):\n        raise ValueError(\"All input lists must have the same length\")\n\n    # Convert the input lists (which are tuples) to a list of lists\n    lists = [list(lst) for lst in lists]\n\n    # Iterate over each index\n    for i in range(length):\n        # Get the elements at index i from all lists\n        elements = [lst[i] for lst in lists]\n\n        # Randomly shuffle the elements\n        random.shuffle(elements)\n\n        # Assign the shuffled elements back to the lists\n        for j, lst in enumerate(lists):\n            lst[i] = elements[j]\n\n    return lists\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eLoad remote image URLs from Hugging Face Hub\u003c/summary\u003e\n\nWhen working with images, you might want to load remote URLs from the Hugging Face Hub.\n\n```python\nfrom datasets import Dataset, Image, load_dataset\n\ndataset = load_dataset(\n    \"my_hf_org/my_image_dataset\"\n).cast_column(\"my_image_column\", Image(decode=False))\ndataset[0][\"my_image_column\"]\n# {'bytes': None, 'path': 'path_to_image.jpg'}\n```\n\n\u003c/details\u003e\n\n## Contribute and development setup\n\nFirst, [install PDM](https://pdm-project.org/latest/#installation).\n\nThen, install the environment, this will automatically create a `.venv` virtual env and install the dev environment.\n\n```bash\npdm install\n```\n\nLastly, run pre-commit for formatting on commit.\n\n```bash\npre-commit install\n```\n\nFollow this [guide on making first contributions](https://github.com/firstcontributions/first-contributions?tab=readme-ov-file#first-contributions).\n\n## References\n\n### Logo\n\n\u003ca href=\"https://www.flaticon.com/free-icons/keyboard\" title=\"keyboard icons\"\u003eKeyboard icons created by srip - Flaticon\u003c/a\u003e\n\n### Inspirations\n\n- \u003chttps://huggingface.co/spaces/davidberenstein1957/llm-human-feedback-collector-chat-interface-dpo\u003e\n- \u003chttps://huggingface.co/spaces/davidberenstein1957/llm-human-feedback-collector-chat-interface-kto\u003e\n- \u003chttps://medium.com/@oxenai/collecting-data-from-human-feedback-for-generative-ai-ec9e20bf01b9\u003e\n- \u003chttps://hamel.dev/notes/llm/finetuning/04_data_cleaning.html\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidberenstein1957%2Fdataset-viber","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidberenstein1957%2Fdataset-viber","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidberenstein1957%2Fdataset-viber/lists"}