https://github.com/davidberenstein1957/dataset-viber
Dataset Viber is your chill repo for data collection, annotation and vibe checks.
https://github.com/davidberenstein1957/dataset-viber
data-collection data-quality evaluation human-feedback
Last synced: 3 months ago
JSON representation
Dataset Viber is your chill repo for data collection, annotation and vibe checks.
- Host: GitHub
- URL: https://github.com/davidberenstein1957/dataset-viber
- Owner: davidberenstein1957
- License: apache-2.0
- Created: 2024-08-07T12:23:36.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-09-05T07:13:10.000Z (10 months ago)
- Last Synced: 2025-02-27T04:14:53.891Z (4 months ago)
- Topics: data-collection, data-quality, evaluation, human-feedback
- Language: Python
- Homepage:
- Size: 1.3 MB
- Stars: 45
- Watchers: 1
- Forks: 12
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![]()
Dataset Viber
Avoid the hype, check the vibe!
I've cooked up Dataset Viber, a cool set of tools to make your life easier when dealing with data for AI models. Dataset Viber is all about making your data prep journey smooth and fun. It's **not for team collaboration or production**, nor trying to be all fancy and formal - just a bunch of **cool tools to help you collect feedback and do vibe-checks** as an AI engineer or lover. Want to see it in action? Just plug it in and start vibing with your data. It's that easy!
- **CollectorInterface**: Lazily collect data of model interactions without human annotation.
- **AnnotatorInterface**: Walk through your data and annotate it with models in the loop.
- **Synthesizer**: Synthesize data with `distilabel` in the loop.
- **BulkInterface**: Explore your data distribution and annotate in bulk.Need any tweaks or want to hear more about a specific tool? Just [open an issue](https://github.com/davidberenstein1957/dataset-viber/issues/new) or give me a shout!
> [!NOTE]
>
> - Data is logged to a local CSV or directly to the Hugging Face Hub.
> - All tools also run in `.ipynb` notebooks.
> - Models in the loop through `fn_model`.
> - Input with custom data streamers or pre-built `Synthesizer` classes with the `fn_next_input` argument.
> - It supports various tasks for `text`, `chat` and `image` modalities.
> - Import and export from the Hugging Face Hub or CSV files.> [!TIP]
>
> - Code examples: [src/dataset_viber/examples](https://github.com/davidberenstein1957/dataset-viber/tree/main/src/dataset_viber/examples).
> - Hub examples: [https://huggingface.co/dataset-viber](https://huggingface.co/dataset-viber).## Installation
You can install the package via pip:
```bash
pip install dataset-viber
```Or install `Synthesizer` dependencies. Note, that the `Synthesizer` relies on `distilabel[hf-inference-endpoints]`, but you can use other [LLMs available to distilabel](https://distilabel.argilla.io) too, like for example `distilabel[ollama]`.
```bash
pip install dataset-viber[synthesizer]
```Or install `BulkInterface` dependencies:
```bash
pip install dataset-viber[bulk]
```## How are we vibing?
### CollectorInterface
> Built on top of the `gr.Interface` and `gr.ChatInterface` to lazily collect data for interactions automatically.
[Hub dataset](https://huggingface.co/datasets/davidberenstein1957/dataset-viber-token-classification)
CollectorInterface
```python
import gradio as gr
from dataset_viber import CollectorInterfacedef calculator(num1, operation, num2):
if operation == "add":
return num1 + num2
elif operation == "subtract":
return num1 - num2
elif operation == "multiply":
return num1 * num2
elif operation == "divide":
return num1 / num2inputs = ["number", gr.Radio(["add", "subtract", "multiply", "divide"]), "number"]
outputs = "number"interface = CollectorInterface(
fn=calculator,
inputs=inputs,
outputs=outputs,
csv_logger=False, # True if you want to log to a CSV
dataset_name="/"
)
interface.launch()
```
CollectorInterface.from_interface
```python
interface = gr.Interface(
fn=calculator,
inputs=inputs,
outputs=outputs
)
interface = CollectorInterface.from_interface(
interface=interface,
csv_logger=False, # True if you want to log to a CSV
dataset_name="/"
)
interface.launch()
```
CollectorInterface.from_pipeline
```python
from transformers import pipeline
from dataset_viber import CollectorInterfacepipeline = pipeline("text-classification", model="mrm8488/bert-tiny-finetuned-sms-spam-detection")
interface = CollectorInterface.from_pipeline(
pipeline=pipeline,
csv_logger=False, # True if you want to log to a CSV
dataset_name="/"
)
interface.launch()
```### AnnotatorInterface
> Built on top of the `CollectorInterface` to collect and annotate data and log it to the Hub.
#### Text
https://github.com/user-attachments/assets/d1abda66-9972-4c60-89d2-7626f5654f15
[Hub dataset](https://huggingface.co/datasets/davidberenstein1957/dataset-viber-text-classification)
text-classification
/multi-label-text-classification
```python
from dataset_viber import AnnotatorInterFacetexts = [
"Anthony Bourdain was an amazing chef!",
"Anthony Bourdain was a terrible tv persona!"
]
labels = ["positive", "negative"]interface = AnnotatorInterFace.for_text_classification(
texts=texts,
labels=labels,
multi_label=False, # True if you have multi-label data
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```
token-classification
```python
from dataset_viber import AnnotatorInterFacetexts = ["Anthony Bourdain was an amazing chef in New York."]
labels = ["NAME", "LOC"]interface = AnnotatorInterFace.for_token_classification(
texts=texts,
labels=labels,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```
extractive-question-answering
```python
from dataset_viber import AnnotatorInterFacequestions = ["Where was Anthony Bourdain located?"]
contexts = ["Anthony Bourdain was an amazing chef in New York."]interface = AnnotatorInterFace.for_question_answering(
questions=questions,
contexts=contexts,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```
text-generation
/translation
/completion
```python
from dataset_viber import AnnotatorInterFaceprompts = ["Tell me something about Anthony Bourdain."]
completions = ["Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."]interface = AnnotatorInterFace.for_text_generation(
prompts=prompts, # source
completions=completions, # optional to show initial completion / target
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```
text-generation-preference
```python
from dataset_viber import AnnotatorInterFaceprompts = ["Tell me something about Anthony Bourdain."]
completions_a = ["Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."]
completions_b = ["Anthony Michael Bourdain was an cool guy that knew how to cook."]interface = AnnotatorInterFace.for_text_generation_preference(
prompts=prompts,
completions_a=completions_a,
completions_b=completions_b,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```#### Chat and multi-modal chat
https://github.com/user-attachments/assets/fe7f0139-95a3-40e8-bc03-e37667d4f7a9
[Hub dataset](https://huggingface.co/datasets/davidberenstein1957/dataset-viber-chat-generation-preference)
> [!TIP]
> I recommend uploading the files files to a cloud storage and using the remote URL to avoid any issues. This can be done [using Hugging Face Datasets](https://huggingface.co/docs/datasets/en/image_load#local-files). As shown in [utils](#utils). Additionally [GradioChatbot](https://www.gradio.app/docs/gradio/chatbot#behavior) shows how to use the chatbot interface for multi-modal.
chat-classification
```python
from dataset_viber import AnnotatorInterFaceprompts = [
[
{
"role": "user",
"content": "Tell me something about Anthony Bourdain."
},
{
"role": "assistant",
"content": "Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian."
}
]
]interface = AnnotatorInterFace.for_chat_classification(
prompts=prompts,
labels=["toxic", "non-toxic"],
multi_label=False, # True if you have multi-label data
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```
chat-generation
```python
from dataset_viber import AnnotatorInterFaceprompts = [
[
{
"role": "user",
"content": "Tell me something about Anthony Bourdain."
}
]
]completions = [
"Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.",
]interface = AnnotatorInterFace.for_chat_generation(
prompts=prompts,
completions=completions,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```
chat-generation-preference
```python
from dataset_viber import AnnotatorInterFaceprompts = [
[
{
"role": "user",
"content": "Tell me something about Anthony Bourdain."
}
]
]
completions_a = [
"Anthony Michael Bourdain was an American celebrity chef, author, and travel documentarian.",
]
completions_b = [
"Anthony Michael Bourdain was an cool guy that knew how to cook."
]interface = AnnotatorInterFace.for_chat_generation_preference(
prompts=prompts,
completions_a=completions_a,
completions_b=completions_b,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```#### Image and multi-modal
[Hub dataset](https://huggingface.co/datasets/davidberenstein1957/dataset-viber-image-question-answering)
> [!TIP]
> I recommend uploading the files files to a cloud storage and using the remote URL to avoid any issues. This can be done [using Hugging Face Datasets](https://huggingface.co/docs/datasets/en/image_load#local-files). As shown in [utils](#utils).
image-classification
/multi-label-image-classification
```python
from dataset_viber import AnnotatorInterFaceimages = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
labels = ["anthony-bourdain", "not-anthony-bourdain"]interface = AnnotatorInterFace.for_image_classification(
images=images,
labels=labels,
multi_label=False, # True if you have multi-label data
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```
image-generation
```python
from dataset_viber import AnnotatorInterFaceprompts = [
"Anthony Bourdain laughing",
"David Chang wearing a suit"
]
images = [
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
]interface = AnnotatorInterFace.for_image_generation(
prompts=prompts,
completions=images,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)interface.launch()
```
image-description
```python
from dataset_viber import AnnotatorInterFaceimages = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
descriptions = ["Anthony Bourdain laughing", "David Chang wearing a suit"]interface = AnnotatorInterFace.for_image_description(
images=images,
descriptions=descriptions, # optional to show initial descriptions
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```
image-question-answering
/visual-question-answering
```python
from dataset_viber import AnnotatorInterFaceimages = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]
questions = ["Who is this?", "What is he wearing?"]
answers = ["Anthony Bourdain", "a suit"]interface = AnnotatorInterFace.for_image_question_answering(
images=images,
questions=questions, # optional to show initial questions
answers=answers, # optional to show initial answers
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```
image-generation-preference
```python
from dataset_viber import AnnotatorInterFaceprompts = [
"Anthony Bourdain laughing",
"David Chang wearing a suit"
]images_a = [
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
]images_b = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Anthony_Bourdain_Peabody_2014b.jpg/440px-Anthony_Bourdain_Peabody_2014b.jpg",
"https://upload.wikimedia.org/wikipedia/commons/8/85/David_Chang_David_Shankbone_2010.jpg"
]interface = AnnotatorInterFace.for_image_generation_preference(
prompts=prompts,
completions_a=images_a,
completions_b=images_b,
fn_model=None, # a callable e.g. (function or transformers pipelines) that returns `str`
fn_next_input=None, # a function that feeds gradio components actively with the next input
csv_logger=False, # True if you want to log to a CSV
dataset_name=None # "/" if you want to log to the hub
)
interface.launch()
```### Synthesizer
> Built on top of the `distilabel` to synthesize data with models in the loop.
> [!TIP]
> You can use also call the synthesizer directly to generate data. `synthesizer() -> Tuple` or `Synthesizer.batch_synthesize(n) -> List[Tuple]` to get inputs for the various tasks.
text-classification
```python
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizersynthesizer = Synthesizer.for_text_classification(
prompt_context="IMDB movie reviews"
)interface = AnnotatorInterFace.for_text_classification(
fn_next_input=synthesizer,
labels=["positive", "negative"]
)
interface.launch()
```
text-generation
```python
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizersynthesizer = Synthesizer.for_text_generation(
prompt_context="Phone company customer support."
)interface = AnnotatorInterFace.for_text_generation(
fn_next_input=synthesizer
)
interface.launch()
```
chat-classification
```python
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizersynthesizer = Synthesizer.for_chat_classification(
prompt_context="Phone company customer support."
)interface = AnnotatorInterFace.for_chat_classification(
fn_next_input=synthesizer,
labels=["positive", "negative"]
)
interface.launch()
```
chat-generation
```python
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizersynthesizer = Synthesizer.for_chat_generation(
prompt_context="Phone company customer support."
)interface = AnnotatorInterFace.for_chat_generation(
fn_next_input=synthesizer
)
interface.launch()
```
chat-generation-preference
```python
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizersynthesizer = Synthesizer.for_chat_generation_preference(
prompt_context="Phone company customer support."
)interface = AnnotatorInterFace.for_chat_generation_preference(
fn_next_input=synthesizer
)
interface.launch()
```
image-classification
```python
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizersynthesizer = Synthesizer.for_image_classification(
prompt_context="Phone company customer support."
)interface = AnnotatorInterFace.for_image_classification(
fn_next_input=synthesizer,
labels=["positive", "negative"]
)
interface.launch()
```
image-generation
```python
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizersynthesizer = Synthesizer.for_image_generation(
prompt_context="Phone company customer support."
)interface = AnnotatorInterFace.for_image_generation(
fn_next_input=synthesizer
)
interface.launch()
```
image-description
```python
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizersynthesizer = Synthesizer.for_image_description(
prompt_context="Phone company customer support."
)interface = AnnotatorInterFace.for_image_description(
fn_next_input=synthesizer
)
interface.launch()
```
image-question-answering
```python
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizersynthesizer = Synthesizer.for_image_question_answering(
prompt_context="Phone company customer support."
)interface = AnnotatorInterFace.for_image_question_answering(
fn_next_input=synthesizer
)
interface.launch()
```
image-generation-preference
```python
from dataset_viber import AnnotatorInterFace
from dataset_viber.synthesizer import Synthesizersynthesizer = Synthesizer.for_image_generation_preference(
prompt_context="Phone company customer support."
)interface = AnnotatorInterFace.for_image_generation_preference(
fn_next_input=synthesizer
)
interface.launch()
```### BulkInterface
> Built on top of the `Dash`, `plotly-express`, `umap-learn`, and `fast-sentence-transformers` to embed and understand your distribution and annotate your data.
https://github.com/user-attachments/assets/5e96c06d-e37f-45a0-9633-1a8e714d71ed
[Hub dataset](https://huggingface.co/datasets/SetFit/ag_news)
text-visualization
```python
from dataset_viber import BulkInterface
from datasets import load_datasetds = load_dataset("SetFit/ag_news", split="train[:2000]")
interface: BulkInterface = BulkInterface.for_text_visualization(
ds.to_pandas()[["text", "label_text"]],
content_column='text',
label_column='label_text',
)
interface.launch()
```
text-classification
```python
from dataset_viber import BulkInterface
from datasets import load_datasetds = load_dataset("SetFit/ag_news", split="train[:2000]")
df = ds.to_pandas()[["text", "label_text"]]interface = BulkInterface.for_text_classification(
dataframe=df,
content_column='text',
label_column='label_text',
labels=df['label_text'].unique().tolist()
)
interface.launch()
```
chat-visualization
```python
from dataset_viber.bulk import BulkInterface
from datasets import load_datasetds = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train[:1000]")
df = ds.to_pandas()[["chosen"]]interface = BulkInterface.for_chat_visualization(
dataframe=df,
chat_column='chosen',
)
interface.launch()
```
chat-classification
```python
from dataset_viber.bulk import BulkInterface
from datasets import load_datasetds = load_dataset("argilla/distilabel-capybara-dpo-7k-binarized", split="train[:1000]")
df = ds.to_pandas()[["chosen"]]interface = BulkInterface.for_chat_classification(
dataframe=df,
chat_column='chosen',
labels=["math", "science", "history", "question seeking"],
)
interface.launch()
```### Utils
Shuffle inputs in the same order
When working with multiple inputs, you might want to shuffle them in the same order.
```python
def shuffle_lists(*lists):
if not lists:
return []# Get the length of the first list
length = len(lists[0])# Check if all lists have the same length
if not all(len(lst) == length for lst in lists):
raise ValueError("All input lists must have the same length")# Create a list of indices and shuffle it
indices = list(range(length))
random.shuffle(indices)# Reorder each list based on the shuffled indices
return [
[lst[i] for i in indices]
for lst in lists
]
```Random swap to randomize completions
When working with multiple completions, you might want to swap out the completions at the same index, where each completion index x is swapped with a random completion at the same index. This is useful for preference learning.
```python
def swap_completions(*lists):
# Assuming all lists are of the same length
length = len(lists[0])# Check if all lists have the same length
if not all(len(lst) == length for lst in lists):
raise ValueError("All input lists must have the same length")# Convert the input lists (which are tuples) to a list of lists
lists = [list(lst) for lst in lists]# Iterate over each index
for i in range(length):
# Get the elements at index i from all lists
elements = [lst[i] for lst in lists]# Randomly shuffle the elements
random.shuffle(elements)# Assign the shuffled elements back to the lists
for j, lst in enumerate(lists):
lst[i] = elements[j]return lists
```Load remote image URLs from Hugging Face Hub
When working with images, you might want to load remote URLs from the Hugging Face Hub.
```python
from datasets import Dataset, Image, load_datasetdataset = load_dataset(
"my_hf_org/my_image_dataset"
).cast_column("my_image_column", Image(decode=False))
dataset[0]["my_image_column"]
# {'bytes': None, 'path': 'path_to_image.jpg'}
```## Contribute and development setup
First, [install PDM](https://pdm-project.org/latest/#installation).
Then, install the environment, this will automatically create a `.venv` virtual env and install the dev environment.
```bash
pdm install
```Lastly, run pre-commit for formatting on commit.
```bash
pre-commit install
```Follow this [guide on making first contributions](https://github.com/firstcontributions/first-contributions?tab=readme-ov-file#first-contributions).
## References
### Logo
Keyboard icons created by srip - Flaticon
### Inspirations
-
-
-
-