Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dennisbakhuis/pigeonXT
🐦 Quickly annotate data from the comfort of your Jupyter notebook
https://github.com/dennisbakhuis/pigeonXT
Last synced: 7 days ago
JSON representation
🐦 Quickly annotate data from the comfort of your Jupyter notebook
- Host: GitHub
- URL: https://github.com/dennisbakhuis/pigeonXT
- Owner: dennisbakhuis
- License: apache-2.0
- Fork: true (agermanidis/pigeon)
- Created: 2020-05-07T12:30:13.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2023-06-09T18:46:44.000Z (over 1 year ago)
- Last Synced: 2024-09-02T16:38:50.485Z (2 months ago)
- Language: Python
- Homepage:
- Size: 2.83 MB
- Stars: 272
- Watchers: 10
- Forks: 44
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🐦 pigeonXT - Quickly annotate data in Jupyter Lab
PigeonXT is an extention to the original [Pigeon](https://github.com/agermanidis/pigeon), created by [Anastasis Germanidis](https://pypi.org/user/agermanidis/).
PigeonXT is a simple widget that lets you quickly annotate a dataset of
unlabeled examples from the comfort of your Jupyter notebook.PigeonXT currently support the following annotation tasks:
- binary / multi-class classification
- multi-label classification
- regression tasks
- captioning tasksAnything that can be displayed on Jupyter
(text, images, audio, graphs, etc.) can be displayed by pigeon
by providing the appropriate `display_fn` argument.Additionally, custom hooks can be attached to each row update (`example_process_fn`),
or when the annotating task is complete(`final_process_fn`).There is a full blog post on the usage of PigeonXT on [Towards Data Science](https://towardsdatascience.com/quickly-label-data-in-jupyter-lab-999e7e455e9e).
### Contributors
- Anastasis Germanidis
- Dennis Bakhuis
- Ritesh Agrawal
- Deepak Tunuguntla
- Bram van Es## Installation
PigeonXT obviously needs a Jupyter Lab environment. Futhermore, it requires ipywidgets.
The widget itself can be installed using pip:
```bash
pip install pigeonXT-jupyter
```Currently, it is much easier to install due to Jupyterlab 3:
To run the provided examples in a new environment using Conda:
```bash
conda create --name pigeon python=3.9
conda activate pigeon
pip install numpy pandas jupyterlab ipywidgets pigeonXT-jupyter
```For an older Jupyterlab or any other trouble, please try the old method:
```bash
conda create --name pigeon python=3.7
conda activate pigeon
conda install nodejs
pip install numpy pandas jupyterlab ipywidgets
jupyter nbextension enable --py widgetsnbextension
jupyter labextension install @jupyter-widgets/jupyterlab-managerpip install pigeonXT-jupyter
```Starting Jupyter Lab environment:
```bash
jupyter lab
```### Development environment
I have moved the development environment to Poetry. To create an identical environment use:
```bash
conda env create -f environment.yml
conda activate pigeonxt
poetry install
pre-commit install
```## Examples
Examples are also provided in the accompanying notebook.### Binary or multi-class text classification
Code:
```python
import pandas as pd
import pigeonXT as pixtannotations = pixt.annotate(
['I love this movie', 'I was really disappointed by the book'],
options=['positive', 'negative', 'inbetween']
)
```Preview:
![Jupyter notebook multi-class classification](/assets/multiclassexample.png)### Multi-label text classification
Code:
```python
import pandas as pd
import pigeonXT as pixtdf = pd.DataFrame([
{'example': 'Star wars'},
{'example': 'The Positively True Adventures of the Alleged Texas Cheerleader-Murdering Mom'},
{'example': 'Eternal Sunshine of the Spotless Mind'},
{'example': 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb'},
{'example': 'Killer klowns from outer space'},
])labels = ['Adventure', 'Romance', 'Fantasy', 'Science fiction', 'Horror', 'Thriller']
annotations = pixt.annotate(
df,
options=labels,
task_type='multilabel-classification',
buttons_in_a_row=3,
reset_buttons_after_click=True,
include_next=True,
include_back=True,
)
```Preview:
![Jupyter notebook multi-label classification](/assets/multilabelexample.png)### Image classification
Code:
```python
import pandas as pd
import pigeonXT as pixtfrom IPython.display import display, Image
annotations = pixt.annotate(
['assets/img_example1.jpg', 'assets/img_example2.jpg'],
options=['cat', 'dog', 'horse'],
display_fn=lambda filename: display(Image(filename))
)
```Preview:
![Jupyter notebook multi-label classification](/assets/imagelabelexample.png)### Audio classification
Code:
```python
import pandas as pd
import pigeonXT as pixtfrom IPython.display import Audio
annotations = pixt.annotate(
['assets/audio_1.mp3', 'assets/audio_2.mp3'],
task_type='regression',
options=(1,5,1),
display_fn=lambda filename: display(Audio(filename, autoplay=True))
)annotations
```Preview:
![Jupyter notebook multi-label classification](/assets/audiolabelexample.png)### multi-label text classification with custom hooks
Code:
```python
import pandas as pd
import numpy as npfrom pathlib import Path
from pigeonXT import annotatedf = pd.DataFrame([
{'example': 'Star wars'},
{'example': 'The Positively True Adventures of the Alleged Texas Cheerleader-Murdering Mom'},
{'example': 'Eternal Sunshine of the Spotless Mind'},
{'example': 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb'},
{'example': 'Killer klowns from outer space'},
])labels = ['Adventure', 'Romance', 'Fantasy', 'Science fiction', 'Horror', 'Thriller']
shortLabels = ['A', 'R', 'F', 'SF', 'H', 'T']df.to_csv('inputtestdata.csv', index=False)
def setLabels(labels, numClasses):
row = np.zeros([numClasses], dtype=np.uint8)
row[labels] = 1
return rowdef labelPortion(
inputFile,
labels = ['yes', 'no'],
outputFile='output.csv',
portionSize=2,
textColumn='example',
shortLabels=None,
):
if shortLabels == None:
shortLabels = labelsout = Path(outputFile)
if out.exists():
outdf = pd.read_csv(out)
currentId = outdf.index.max() + 1
else:
currentId = 0indf = pd.read_csv(inputFile)
examplesInFile = len(indf)
indf = indf.loc[currentId:currentId + portionSize - 1]
actualPortionSize = len(indf)
print(f'{currentId + 1} - {currentId + actualPortionSize} of {examplesInFile}')
sentences = indf[textColumn].tolist()for label in shortLabels:
indf[label] = Nonedef updateRow(example, selectedLabels):
print(example, selectedLabels)
labs = setLabels([labels.index(y) for y in selectedLabels], len(labels))
indf.loc[indf[textColumn] == example, shortLabels] = labsdef finalProcessing(annotations):
if out.exists():
prevdata = pd.read_csv(out)
outdata = pd.concat([prevdata, indf]).reset_index(drop=True)
else:
outdata = indf.copy()
outdata.to_csv(out, index=False)annotated = annotate(
sentences,
options=labels,
task_type='multilabel-classification',
buttons_in_a_row=3,
reset_buttons_after_click=True,
include_next=False,
example_process_fn=updateRow,
final_process_fn=finalProcessing
)
return indfdef getAnnotationsCountPerlabel(annotations, shortLabels):
countPerLabel = pd.DataFrame(columns=shortLabels, index=['count'])
for label in shortLabels:
countPerLabel.loc['count', label] = len(annotations.loc[annotations[label] == 1.0])return countPerLabel
def getAnnotationsCountPerlabel(annotations, shortLabels):
countPerLabel = pd.DataFrame(columns=shortLabels, index=['count'])
for label in shortLabels:
countPerLabel.loc['count', label] = len(annotations.loc[annotations[label] == 1.0])return countPerLabel
annotations = labelPortion('inputtestdata.csv',
labels=labels,
shortLabels= shortLabels)# counts per label
getAnnotationsCountPerlabel(annotations, shortLabels)
```Preview:
![Jupyter notebook multi-label classification](/assets/pigeonhookfunctions.png)The complete and runnable examples are available in the provided Notebook.