https://github.com/lucasste/havina
A tool to generate Knowledge Graphs from sentences or evaluate language models' text comprehension.
https://github.com/lucasste/havina
bert bert-embeddings knowledge-graph language-model python pytorch
Last synced: about 1 month ago
JSON representation
A tool to generate Knowledge Graphs from sentences or evaluate language models' text comprehension.
- Host: GitHub
- URL: https://github.com/lucasste/havina
- Owner: LucasSte
- License: mit
- Created: 2023-10-24T11:55:59.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-01-11T20:58:16.000Z (over 2 years ago)
- Last Synced: 2025-02-08T23:18:26.031Z (over 1 year ago)
- Topics: bert, bert-embeddings, knowledge-graph, language-model, python, pytorch
- Language: Python
- Homepage:
- Size: 188 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Havina
Havina is a Python library that can generate knowledge graphs triplets from an input text. Its implementation
is based on the paper "[Language models are open knowledge graphs](https://arxiv.org/abs/2010.11967)" with some
tweaks to improve performance. Most notably, instead of summing the attention scores of each word in a relation,
I am calculating their average.
The reasoning behind this change is that a simple sum of scores favors longer relations even if the extra words
do not carry any relevant meaning.
Havina can be used to evaluate the language comprehension of AI models or as a tool to extract triplets from text
and build knowledge graphs.
## How to use it
Run `pip install havina` to install the library. Then, after importing the `GraphGenerator` class
from havina, simply call the object with the sentence to evaluate and an optional number of workers.
Each worker will span a different process and the algorithm will split the work between them.
For more information about the constructor parameters, check the
[Constructor parameters section](#constructor-parameters).
```python
from havina import GraphGenerator
text = 'John Lennon is a famous singer.'
generator = GraphGenerator(
top_k=4,
contiguous_token=False
)
triplets = generator(text, workers=1)
print(triplets)
```
The code above will print the following:
```
[
HeadTailRelations(
head=Entity(text='john lennon', wikidata_id=None),
tail=Entity(text='a famous singer', wikidata_id=None),
relations=['be'])
]
```
The returned type is a list of `HeadTailReations` objects, each of which contains
the head and tail entities and the possible relations between them. Relations are
Python strings.
## Example sentence
Grabbing the following paragraph from [Wikipedia](https://en.wikipedia.org/wiki/Amsterdam) and using the library as in [example.py](example.py),
we have the graph depicted below. It contains only the relations for the `amsterdam` and the
`the netherlands` nodes to avoid cluttering the image.
> Amsterdam was founded at the mouth of the Amstel River that was dammed to control flooding; the city's name
> derives from a local linguistic variation of the word dam. Originally a small fishing village in the late 12th
> century, Amsterdam became a major world port during the Dutch Golden Age of the 17th century, when the Netherlands
> was an economic powerhouse. Amsterdam was the leading centre for finance and trade, as well as a hub of
> production of secular art. In the 19th and 20th centuries, the city expanded and many new neighborhoods and
> suburbs were planned and built. The canals of Amsterdam and the 19-20th century Defence Line of Amsterdam are both
> on the UNESCO World Heritage List. Sloten, annexed in 1921 by the municipality of Amsterdam, is the oldest part of
> the city, dating to the 9th century. The city has a long tradition of openness, liberalism, and tolerance. Cycling
> is key to the city's modern character, and there are numerous biking paths and lanes spread throughout
> the entire city.

## How it works
The last layer of a transformer based language model, like BERT, outputs attention matrices for all its attention heads.
We calculate the average of all the matrices to operate the algorithm. If we are looking for head-tail relationships
with tokens from left to right, we would have a matrix like the following. We disregard
the attention scores from below the diagonal because they represent a word-to-word relationship from right to left.
We assume that each attention score represent the probability of two words being related in the sentence.
We show below an example of the beam-search input for the sentence "Joe is curious about cars".
| | Joe | is | curious | about | cars |
|---------|-----|-----|---------|-------|------|
| Joe | X | 0.1 | 0.4 | 0.2 | 0.3 |
| is | X | X | 0.1 | 0.3 | 0.1 |
| curious | X | X | X | 0.4 | 0.2 |
| about | X | X | X | X | 0.4 |
| cars | X | X | X | X | X |
We utilize spaCy to determine the noun chunks and link them to form head-tail pairs. One possible
head-tail pair for this example would be "(Joe, cars)". Taking "Joe" as the first word, we traverse
the first line of the matrix forming candidate relationships. "is", "curious", "about" are all candidates.
The beam-search only passes for its next iteration the top-k candidates, based on their average attention scores.
If k is one, the only candidate for the second iteration is "curious" with a score of 0.4.
We now traverse the third line in the matrix, looking for possible next tokens given "curious".
"curious about" is the only possibility here because "cars" belongs to the tail chunk. "curious" about
has a score of `(0.4+0.4)/2=0.4`.
The relationship we found for the head-tail pair "(Joe, cars)" is "curious about", so the triplet
looks like `(Joe, curious about, cars)`.
In a later stage, we remove prepositions for the relations and uncapitalize, so the final triplet is
`(joe, curious, cars)`.
## Using other language models
Havina can be a tool for evaluating the text understanding of language models. As it finds the relationship between
words from the attention matrices, checking the resulting triplets indicates whether the model is learning correct
parameters.
The easiest way to use the library with another language model is to derive the `LanguageModel` class for it, as
the examples in [language_model.py](havina/language_model.py).
```python
import havina
class MyModel(havina.LanguageModel):
# Implement the functions here
pass
generator = havina.GraphGenerator(model=MyModel)
```
## Constructor parameters
The `GraphGenerator` has many parameters that change the algorithm's behavior. Use table below to understand
how each of them may affect the results.
| Parameter name | Effect on results | Default value |
|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| `top_k` | Determines the number of candidates for the next iteration of beam search. A larger number improves the quality of relations, but takes longer to run. | 4 |
| `threshold` | Sets the score limit for the algorithm to eliminate low quality relations. A higher threshold returns less results, but of better quality. | 0.015 |
| `link_entity` | Whether to link nouns to Wikidata entities. If set to true, the algorithm will only find relations for nouns linked to an entity. Setting to false will increase the number of results. | `False` |
| `model` | Language model to use. Currently, BERT and Llama2 (MPT-7B) are supported. | `bert` |
| `contiguous_tokens` | Whether the relation tokens will be adjacent in the original sentence. If set to `True`, `(Joe, considers a friend, Anne)` will not be a triplet for "Joe considers Anne a friend", because the relation tokens are not adjacent. | `True` |
| `forward_tokens` | Whether the relation tokens can follow only a right-to-left order. If set to `False`, `(Joe, nice friend, Anne)` will be a potential relation for "Joe finds Anne nice and is her friend." Note that `nice` is on the left of `friends`. | `True` |
| `frequency` | How many times a relation should appear in the text corpus not to be eliminated by the algorithm. | 1 |
| `relation_length` | How long the relation can be. If the number is higher, better results may arise, but the algorithm will take more time to converge. | 8 |
| `resolve_reference` | Resolve noun and pronoun references. For example, replace "he" and "she" from the triple head or tail by the noun they refer to. | `True` |
| `device` | PyTorch device to execute the language model. If set to `None`, the model is executed on CPU. | `None` |