Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yasufumy/sequence-label
A Tensor Creation and Label Reconstruction for Sequence Labeling
https://github.com/yasufumy/sequence-label
machine-learning named-entity-recognition natural-language-processing nlp sequence-labeling
Last synced: 6 days ago
JSON representation
A Tensor Creation and Label Reconstruction for Sequence Labeling
- Host: GitHub
- URL: https://github.com/yasufumy/sequence-label
- Owner: yasufumy
- License: mit
- Created: 2023-08-13T09:45:24.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-15T13:46:05.000Z (8 months ago)
- Last Synced: 2024-03-15T15:01:11.879Z (8 months ago)
- Topics: machine-learning, named-entity-recognition, natural-language-processing, nlp, sequence-labeling
- Language: Python
- Homepage:
- Size: 39.1 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# sequence-label
`sequence-label` is a Python library that streamlines the process of creating tensors for sequence labels and reconstructing sequence labels data from tensors. Whether you're working on named entity recognition, part-of-speech tagging, or any other sequence labeling task, this library offers a convenient utility to simplify your workflow.
## Basic Usage
Import the necessary dependencies:
```py
from transformers import AutoTokenizerfrom sequence_label import LabelSet, SequenceLabel
from sequence_label.transformers import create_alignments
```Start by creating sequence labels using the `SequenceLabel.from_dict` method. Define your text and associated labels:
```py
text1 = "Tokyo is the capital of Japan."
label1 = SequenceLabel.from_dict(
tags=[
{"start": 0, "end": 5, "label": "LOC"},
{"start": 24, "end": 29, "label": "LOC"},
],
size=len(text1),
)text2 = "The Monster Naoya Inoue is the who's who of boxing."
label2 = SequenceLabel.from_dict(
tags=[{"start": 12, "end": 23, "label": "PER"}],
size=len(text2),
)texts = [text1, text2]
labels = [label1, label2]
```Next, tokenize your `texts` and create the `alignments` using the `create_alignments` method. `alignments` is a tuple of instances of `LabelAlignment` that aligns sequence labels with the tokenized result:
```py
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
batch_encoding = tokenizer(texts)alignments = create_alignments(
batch_encoding=batch_encoding,
lengths=list(map(len, texts)),
padding_token=tokenizer.pad_token
)
```Now, create a `label_set` that will allow you to create tensors from sequence labels and reconstruct sequence labels from tensors. Use the `label_set.encode_to_tag_indices` method to create `tag_indices`:
```py
label_set = LabelSet(
labels={"ORG", "LOC", "PER", "MISC"},
padding_index=-1,
)tag_indices = label_set.encode_to_tag_indices(
labels=labels,
alignments=alignments,
)
```Finally, use the `label_set.decode` method to reconstruct the sequence labels from `tag_indices` and `alignments`:
```py
labels2 = label_set.decode(
tag_indices=tag_indices, alignments=alignments,
)assert labels == labels2
```## Installation
```
pip install sequence-label
```