{"id":20624848,"url":"https://github.com/chrislemke/deep-martin","last_synced_at":"2025-04-15T14:59:00.463Z","repository":{"id":40953773,"uuid":"382855984","full_name":"chrislemke/deep-martin","owner":"chrislemke","description":"Text simplification for a better world: Deep-Martin Transformer 🤗","archived":false,"fork":false,"pushed_at":"2023-09-25T19:07:05.000Z","size":44038,"stargazers_count":22,"open_issues_count":5,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-15T14:58:53.931Z","etag":null,"topics":["deep-learning","huggingface","nlp","python","pytorch","text-simplification","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chrislemke.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-04T13:15:04.000Z","updated_at":"2024-09-19T21:57:42.000Z","dependencies_parsed_at":"2023-02-13T02:15:51.945Z","dependency_job_id":null,"html_url":"https://github.com/chrislemke/deep-martin","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrislemke%2Fdeep-martin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrislemke%2Fdeep-martin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrislemke%2Fdeep-martin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrislemke%2Fdeep-martin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chrislemke","download_url":"https://codeload.github.com/chrislemke/deep-martin/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249094940,"owners_count":21211837,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","huggingface","nlp","python","pytorch","text-simplification","transformers"],"created_at":"2024-11-16T13:06:57.723Z","updated_at":"2025-04-15T14:59:00.443Z","avatar_url":"https://github.com/chrislemke.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"https://github.com/stoffy/deep-martin/blob/master/images/title_image.png?raw=true\" alt=\"\"\u003e\n\n\u003ch1\u003eDeep Martin\u003c/h1\u003e\n\u003ch2\u003eText simplification for the democratization of knowledge\u003c/h2\u003e\n\u003cblockquote\u003e\u003ca href=\"https://www.deepl.com/translator#de/en/Danach%20ist%20das%20In-der-Welt-sein%20ein%20Sich-vorweg-schon%20sein-in-der%20Welt%20als%20Sein-bei-innerweltlich-begegnendem-Seienden\"\u003eDanach ist das In-der-Welt-sein ein Sich-vorweg-schon sein-in-der Welt als Sein-bei-innerweltlich-begegnendem-Seienden\u003c/a\u003e\u003cbr\u003e\u003cb\u003eMartin Heidegger\u003c/b\u003e\u003c/blockquote\u003e\n\u003ch3\u003eUnsimplifiable, untranslatable\u003c/h3\u003e\n\n\u003cp\u003eLanguage as a fundamental characteristic of man and society is the center of NLP. It has the potential of great enlightenment, as well as great concealment.  Language and thinking must be brought into harmony.\nSimplification of language leads to the democratization of knowledge. Thus, it can provide access to knowledge that may otherwise be hidden. No more complex language!\n\u003cbr\u003e\u003cbr\u003e\nDeep Martin aims to contribute to this.\nThe project is dedicated to different models to make complicated and complex content accessible to all. \nIt follows the approach of \u003ca href=\"https://simple.wikipedia.org/wiki/Main_Page\"\u003eSimple Wikipedia\u003c/a\u003e.\u003c/p\u003e \n\n\u003ch3\u003eAbout the project\u003c/h3\u003e\n\n\u003ch3\u003eHow to use\u003c/h3\u003e\n\n\u003cp\u003eTwo different approaches are available. \nOne is to use the super nice \u003ca href=\"https://huggingface.co\"\u003eHugging Face\u003c/a\u003e library. \nThis can be used to create various state-of-the-art sequence to sequence models. \nThe other part is a self-made transformer. \nHere it is mainly about trying out different approaches.\u003c/p\u003e\n\u003ch4\u003eHugging Face\u003c/h4\u003e\n\u003cp\u003eFor using the Hugging Face implementation you need to provide a dataset. It needs to have one column with the normal version (\u003ccode\u003eNormal\u003c/code\u003e)\nand one for the simplified version (\u003ccode\u003eSimple\u003c/code\u003e).\nThe \u003ccode\u003eHuggingFaceDataset\u003c/code\u003e class can help you with it.\u003cbr\u003eTo train\na model you then simply run something like:\u003cbr\u003e\u003c/p\u003e\n\u003cpre\u003e\npython /your/path/to/deep-martin/src/hf_transformer_trainer.py \\\n--eval_steps 5000 \\ # This number should be based on the size of the dataset. \n--warmup_steps 800 \\ # This number should be based on the size of the dataset.\n--ds_path /path/ \\ # Path to you dataset.\n--save_model_path /path/ \\ # Path to where the trained model should be stored.\n--training_output_path /path/ \\ # Path to where the checkpoints and the training data should be stored.\n--tokenizer_id bert-base-cased # Path or identifier to Hugging Face tokenizer.\n\u003c/pre\u003e\n\u003cp\u003eThere are a lot more parameters. Check out \u003ccode\u003ehf_transformer_trainer.py\u003c/code\u003e to get an overview.\u003c/p\u003e\n\n\u003ch4\u003eSelf-made-transformer\u003c/h4\u003e\n\u003cp\u003eThis transformer is more for experimenting. Have a look at the code and get an overview of what is going on.\nTo train the self-made-transformer, a train and a test dataset as CSV is needed. This will be transformed\nto a suitable dataset at the beginning of the training. Same as with the transformers from above the dataset needs to have one column with the normal version (\u003ccode\u003eNormal\u003c/code\u003e)\nand one for the simplified version (\u003ccode\u003eSimple\u003c/code\u003e) \u003cbr\u003e\nTo start the training you can run:\u003cbr\u003e\u003c/p\u003e\n\u003cpre\u003e\npython /your/path/to/deep-martin/src/custom_transformer_trainer.py \\\n--ds_path /path \\ # Path of the folder which contains the `train_file.csv` and the `test_file.csv`\n--train_file train_file.csv \\\n--test_file test_file.csv \\\n--epochs 3 \\\n--save_model_path /path/ # Path to where the trained model should be stored.\n\u003c/pre\u003e\n\n\u003ch3\u003eChallenges\u003c/h3\u003e\n\u003cp\u003eLet's talk about the problems in this project. \u003c/p\u003e\n\u003ch4\u003eDataset\u003c/h4\u003e\n\u003cp\u003eAs so often, one problem lies in obtaining high-quality data.\nMultiple datasets were used for this project. You can find them \n\u003ca href=\"https://paperswithcode.com/task/text-simplification\"\u003ehere\u003c/a\u003e.\u003cbr\u003e\nWhile the ASSET dataset provides a very good quality due to the multiple simplification of each record, its size is simply too small for training a transformer. \nThis problem is also true for other datasets. \nThe two datasets based on Wikipedia unfortunately suffer from\nlack of quality. Either a record is not a simplification, \nbut simply the same article. Or the simplification is of poor quality. In both cases, using it meant worse results.\nTo increase the overall quality, the records were compared and \nfiltered out using \u003ca href=\"https://radimrehurek.com/gensim/models/doc2vec.html\"\u003eDoc2Vec\u003c/a\u003e and cosine distance. \n\u003c/p\u003e\n\n\u003ch4\u003eModel size and computation\u003c/h4\u003e\n\u003cp\u003e\nTransformers are huge, need a lot of data and a lot of time to train. \n\u003ca href=\"http://research.google.com/colaboratory/\"\u003eGoogle colab\u003c/a\u003e can help, but it is not the most convenient way. \nWith the help of \u003ca href=\"https://aws.amazon.com/de/ec2\"\u003eAWS EC2\u003c/a\u003e, things can be sped up a lot and, training of larger models is also possible.\n\u003c/p\u003e\n\n\u003ch3\u003eNext steps\u003c/h3\u003e\n\u003cp\u003eSince the self-made-transformer is a work-in-progress project, it is never finished.\nIt is made for learning and trying out. One interesting idea is to use the\ntransformer as a generator in a GAN to improve the overall output.\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrislemke%2Fdeep-martin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchrislemke%2Fdeep-martin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrislemke%2Fdeep-martin/lists"}