https://github.com/compnet/grimbert
https://github.com/compnet/grimbert
nlp novels speaker-attribution
Last synced: 5 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/compnet/grimbert
- Owner: CompNet
- License: gpl-3.0
- Created: 2023-09-13T12:29:50.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-12-20T11:51:04.000Z (over 1 year ago)
- Last Synced: 2025-09-10T05:06:22.929Z (9 months ago)
- Topics: nlp, novels, speaker-attribution
- Language: Python
- Homepage:
- Size: 665 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Grimbert
Speaker attribution in novels. Based on the older [bert-quote-attribution](https://gitlab.com/Aethor/bert-quote-attribution) project.
# Documentation
```python
from grimbert.model import SpeakerAttributionModel
from grimbert.predict import predict_speaker
from grimbert.datas import (
SpeakerAttributionDataset,
SpeakerAttributionDocument,
SpeakerAttributionQuote,
SpeakerAttributionMention
)
from transformers import BertTokenizerFast
model = SpeakerAttributionModel.from_pretrained(
"compnet-renard/spanbert-base-cased-literary-speaker-attribution"
)
tokenizer = BertTokenizerFast.from_pretrained(
"compnet-renard/spanbert-base-cased-literary-speaker-attribution"
)
tokens = '" This is horrible " , John said to Max .'.split(" ")
quote_start = 0
quote_end = 4
john_mention_start = 6
john_mention_end = 7
max_mention_start = 9
max_mention_end = 10
dataset = SpeakerAttributionDataset(
[
SpeakerAttributionDocument(
tokens,
[SpeakerAttributionQuote(
tokens[quote_start:quote_end], quote_start, quote_end, "John"
)],
[
SpeakerAttributionMention(
tokens[john_mention_start:john_mention_end],
john_mention_start,
john_mention_end,
"John"
),
SpeakerAttributionMention(
tokens[max_mention_start:max_mention_end],
max_mention_start,
max_mention_end,
"Max"
),
]
)
],
quote_ctx_len=512,
speaker_repr_nb=4,
tokenizer=tokenizer
)
preds = predict_speaker(dataset, model, tokenizer, batch_size=4)
```