Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/chrislemke/deep-martin
Text simplification for a better world: Deep-Martin Transformer 🤗
https://github.com/chrislemke/deep-martin
deep-learning huggingface nlp python pytorch text-simplification transformers
Last synced: 4 days ago
JSON representation
Text simplification for a better world: Deep-Martin Transformer 🤗
- Host: GitHub
- URL: https://github.com/chrislemke/deep-martin
- Owner: chrislemke
- Created: 2021-07-04T13:15:04.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2023-07-04T10:09:28.000Z (over 1 year ago)
- Last Synced: 2023-07-04T11:24:36.239Z (over 1 year ago)
- Topics: deep-learning, huggingface, nlp, python, pytorch, text-simplification, transformers
- Language: Python
- Homepage:
- Size: 42 MB
- Stars: 16
- Watchers: 1
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Deep Martin
Text simplification for the democratization of knowledge
Danach ist das In-der-Welt-sein ein Sich-vorweg-schon sein-in-der Welt als Sein-bei-innerweltlich-begegnendem-Seienden
Martin HeideggerUnsimplifiable, untranslatable
Language as a fundamental characteristic of man and society is the center of NLP. It has the potential of great enlightenment, as well as great concealment. Language and thinking must be brought into harmony.
Simplification of language leads to the democratization of knowledge. Thus, it can provide access to knowledge that may otherwise be hidden. No more complex language!
Deep Martin aims to contribute to this.
The project is dedicated to different models to make complicated and complex content accessible to all.
It follows the approach of Simple Wikipedia.About the project
How to use
Two different approaches are available.
One is to use the super nice Hugging Face library.
This can be used to create various state-of-the-art sequence to sequence models.
The other part is a self-made transformer.
Here it is mainly about trying out different approaches.Hugging Face
For using the Hugging Face implementation you need to provide a dataset. It needs to have one column with the normal version (
Normal
)
and one for the simplified version (Simple
).
TheHuggingFaceDataset
class can help you with it.
To train
a model you then simply run something like:
python /your/path/to/deep-martin/src/hf_transformer_trainer.py \
--eval_steps 5000 \ # This number should be based on the size of the dataset.
--warmup_steps 800 \ # This number should be based on the size of the dataset.
--ds_path /path/ \ # Path to you dataset.
--save_model_path /path/ \ # Path to where the trained model should be stored.
--training_output_path /path/ \ # Path to where the checkpoints and the training data should be stored.
--tokenizer_id bert-base-cased # Path or identifier to Hugging Face tokenizer.There are a lot more parameters. Check out
hf_transformer_trainer.py
to get an overview.Self-made-transformer
This transformer is more for experimenting. Have a look at the code and get an overview of what is going on.
To train the self-made-transformer, a train and a test dataset as CSV is needed. This will be transformed
to a suitable dataset at the beginning of the training. Same as with the transformers from above the dataset needs to have one column with the normal version (Normal
)
and one for the simplified version (Simple
)
To start the training you can run:
python /your/path/to/deep-martin/src/custom_transformer_trainer.py \
--ds_path /path \ # Path of the folder which contains the `train_file.csv` and the `test_file.csv`
--train_file train_file.csv \
--test_file test_file.csv \
--epochs 3 \
--save_model_path /path/ # Path to where the trained model should be stored.Challenges
Let's talk about the problems in this project.
Dataset
As so often, one problem lies in obtaining high-quality data.
Multiple datasets were used for this project. You can find them
here.
While the ASSET dataset provides a very good quality due to the multiple simplification of each record, its size is simply too small for training a transformer.
This problem is also true for other datasets.
The two datasets based on Wikipedia unfortunately suffer from
lack of quality. Either a record is not a simplification,
but simply the same article. Or the simplification is of poor quality. In both cases, using it meant worse results.
To increase the overall quality, the records were compared and
filtered out using Doc2Vec and cosine distance.Model size and computation
Transformers are huge, need a lot of data and a lot of time to train.
Google colab can help, but it is not the most convenient way.
With the help of AWS EC2, things can be sped up a lot and, training of larger models is also possible.Next steps
Since the self-made-transformer is a work-in-progress project, it is never finished.
It is made for learning and trying out. One interesting idea is to use the
transformer as a generator in a GAN to improve the overall output.