{"id":18339925,"url":"https://github.com/mxagar/text_generator","last_synced_at":"2026-04-28T08:03:18.221Z","repository":{"id":106858086,"uuid":"546630225","full_name":"mxagar/text_generator","owner":"mxagar","description":"This repository contains a text generator deep learning model based on LSTMs.","archived":false,"fork":false,"pushed_at":"2022-10-10T10:57:52.000Z","size":5990,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-09T20:47:41.012Z","etag":null,"topics":["deep-learning","language-model","lstm","neural-network","nlp","pytorch","text-generator"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mxagar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-06T11:45:18.000Z","updated_at":"2022-11-01T19:47:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"7fffc290-9f5a-4390-bc6c-ffb537b2971a","html_url":"https://github.com/mxagar/text_generator","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mxagar/text_generator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxagar%2Ftext_generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxagar%2Ftext_generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxagar%2Ftext_generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxagar%2Ftext_generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mxagar","download_url":"https://codeload.github.com/mxagar/text_generator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxagar%2Ftext_generator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32371673,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-27T20:07:02.737Z","status":"online","status_checked_at":"2026-04-28T02:00:07.250Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","language-model","lstm","neural-network","nlp","pytorch","text-generator"],"created_at":"2024-11-05T20:19:52.093Z","updated_at":"2026-04-28T08:03:18.205Z","avatar_url":"https://github.com/mxagar.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Text Generation Project: Writing TV Scripts\n\nThis repository contains a text generator which works with a Recurrent Neural Network (RNN) based on LSTM units. The [Seinfeld Chronicles Dataset from Kaggle](https://www.kaggle.com/datasets/thec03u5/seinfeld-chronicles) is used, which contains the complete scripts from the [Seinfield TV Show](https://en.wikipedia.org/wiki/Seinfeld).\n\nThe project is a modification of the [Character-level RNN](https://github.com/karpathy/char-rnn) [implemented by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). I have used [Pytorch](https://pytorch.org/) and materials from the [Udacity Computer Vision Nanodegree](https://www.udacity.com/course/computer-vision-nanodegree--nd891), which can be obtained in their original form in [project-tv-script-generation](https://github.com/mxagar/deep-learning-v2-pytorch/tree/master/project-tv-script-generation).\n\nIf you're interested in the topic, I recommend you to read [my blog post on it](https://mikelsagardia.io/blog/text-generation-rnn.html), where I introduce Recurrent Neural Network (RNN) based on LSTM units and their application to language modeling.\n\nRegarding the results, even though the text that the trained model is able to generate doesn't make much sense, it seems it follows the general structure that the scripts from the dataset have:\n\n```\njerry: you know, it's the way i can do. i don't know what the hell happened.\n\njerry: what?\n\ngeorge: what about it?\n\nelaine: i think you could be able to get out of here.\n\njerry: oh, i can't do anything about the guy.\n\njerry: what?\n\ngeorge:(smiling) yeah..........\n\ngeorge: you know, you should do the same thing.\n\njerry: i think i can.\n\njerry: oh, no, no! no. no.\n\njerry: i don't know.(to the phone) what do you think?\n\ngeorge: what?\n\njerry: oh, i think you're not a good friend.\n\n...\n```\n\n... and that's with very few hours of effort and GPU training, so I I'd say it's a good starting point :sweat_smile:\n\nTable of Contents:\n\n- [Text Generation Project: Writing TV Scripts](#text-generation-project-writing-tv-scripts)\n  - [How to Use This](#how-to-use-this)\n    - [Overview of Files and Contents](#overview-of-files-and-contents)\n    - [Dependencies](#dependencies)\n  - [Brief Notes on RNNs and Their Application to Language Modeling](#brief-notes-on-rnns-and-their-application-to-language-modeling)\n  - [Practical Notes on the Text Generation Application](#practical-notes-on-the-text-generation-application)\n  - [Improvements, Next Steps](#improvements-next-steps)\n  - [Interesting Links](#interesting-links)\n  - [Authorship](#authorship)\n\n## How to Use This\n\nIn order to use the model, you need to install the [dependencies](#dependencies) and execute the notebook [tv_script_generation.ipynb](tv_script_generation.ipynb), which is the main application file that defines and trains the network.\n\nNext, I give a more detailed description on the contents and the usage.\n\n### Overview of Files and Contents\n\nAltogether, the project directory contains the following files and folders:\n\n```\n.\n├── README.md                           # This file\n├── data/                               # Dataset folder\n│   └── Seinfeld_Scripts.txt            # Dataset: scripts\n├── tv_script_generation.ipynb          # Project notebook\n├── helper.py                           # Utilities: load/preprocess data, etc.\n├── problem_unittests.py                # Unit tests\n└── requirements.txt                    # Dependencies\n```\n\nAs already introduced, the notebook [tv_script_generation.ipynb](tv_script_generation.ipynb) takes care of almost everything. That file uses the following two scripts:\n\n- [helper.py](helper.py), which contains utility functions related to data preprocessing and model persisting,\n- and [problem_unittests.py](problem_unittests.py), which contains the definitions of the unit tests run across the whole notebook.\n\nAll in all, the following sections/tasks are implemented in the project notebook:\n\n- The dataset is loaded and briefly explored.\n- The dataset is preprocessed: tokenization is performed, vocabulary dictionaries are created.\n- A parametrized data loader is defined which delivers batches of token sequences with their expected target token. Basically, if we have a sequence `X` of `N` tokens, the target `y` is the next token in the text; and all that is provided in batches of a desired size.\n- Definition of a RNN, which has:\n  - An embedding layer.\n  - An LSTM layer with parametrized layers within it.\n  - A fully connected layer, preceded with dropout.\n- Training of the network.\n- Generation of new scripts.\n\nWhen the complete notebook is executed, several other artifacts are generated:\n\n- A binary with the pre-processed text.\n- The trained models (best and last).\n- A TV script generated with the trained model.\n\n### Dependencies\n\nYou should create a python environment (e.g., with [conda](https://docs.conda.io/en/latest/)) and install the dependencies listed in the [requirements.txt](requirements.txt) file.\n\nA short summary of commands required to have all in place is the following:\n\n```bash\nconda create -n text-gen python=3.6\nconda activate text-gen\nconda install pytorch -c pytorch \nconda install pip\npip install -r requirements.txt\n```\n\n## Brief Notes on RNNs and Their Application to Language Modeling\n\nWhile [Convolutional Neural Networks (CNNs)](https://en.wikipedia.org/wiki/Convolutional_neural_network) are particularly good at capturing spatial relationships, [Recurrent Neural Networks (RNNs)](https://en.wikipedia.org/wiki/Recurrent_neural_network) model sequential structures very efficiently. Also, in recent years, the [Transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) architecture has been shown to work remarkably well with language data -- but let's keep it aside for this small toy project.\n\nIn many language modeling applications, and in the particular text generation case  explained here, we need to undertake the following general steps:\n\n- The text needs to be **processed** as sequences of numerical vectors.\n- We define **recurrent layers** which take those sequences of vectors and yield sequences of outputs.\n- We take the complete or partial output sequence and we **map it to the target space**, e.g., words.\n\nIf you'd like to know more about how these steps, you should my [blog post](https://mikelsagardia.io/blog/text-generation-rnn.html) on the project.\n\n## Practical Notes on the Text Generation Application\n\nThe notebook has many notes as Markdown text and code comments so that it is fairly understandable what is being done at each stage. In addition, consider these general comments:\n\n- LSTM units are defined with `nn.LSTM` in Pytorch, and although they are called *units*, they are more like a layer than a neuron, akin to `nn.RNN`; its equivalent would be `nn.Linear`. Additionally, `nn.LSTM` can have several stacked layers inside.\n- We can pass one vector after the another in a loop. However, it's more efficient to pass a sequence of vectors together in a tensor. On top of a sequence, we can define batches of sequences. While sequences are usually defined by the application programmer, I'd advise to create batches automatically with the [Pytorch `DataLoader`](https://pytorch.org/docs/stable/data.html) API, as shown in the notebook.\n- When we input a sequence, we get as output a sequence of the same length; the output sequence is composed of hidden memory state vectors. The size of a hidden state vector doesn't need to be the same as the size of an input vector. This can be seen in the notebook, too; if you'd like more explanations, I encourage you to read [my blog post](https://mikelsagardia.io/blog/text-generation-rnn.html).\n- RNNs have many hyperparameters and it can be overwhelming to select the correct starting set. [Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) gives a great collection of hints in his project [char-rnn](https://github.com/karpathy/char-rnn), which I have followed.\n\n## Improvements, Next Steps\n\n- [ ] Try different model weight initializations (e.g., for the embedding layer) to check if it is possible to improve model convergence.\n- [ ] Carry out hyperparameter tuning, maybe with [skorch](https://skorch.readthedocs.io/en/stable/).\n\n## Interesting Links\n\n- [My blog post on the project](https://mikelsagardia.io/blog/text-generation-rnn.html).\n- [My notes and code](https://github.com/mxagar/computer_vision_udacity) on the [Udacity Computer Vision Nanodegree](https://www.udacity.com/course/computer-vision-nanodegree--nd891).\n- My toy project on [sentiment analysis](https://github.com/mxagar/text_sentiment).\n- My toy project on [image captioning](https://github.com/mxagar/image_captioning).\n- [My notes and code](https://github.com/mxagar/deep_learning_udacity) on the [Udacity Deep Learning Nanodegree](https://www.udacity.com/course/deep-learning-nanodegree--nd101).\n- [Character-level LSTM to generate text](https://github.com/mxagar/CVND_Exercises/blob/master/2_4_LSTMs/3_1.Chararacter-Level%20RNN%2C%20Exercise.ipynb), based on [a post by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).\n- Generating Bach music: [DeepBach](https://arxiv.org/pdf/1612.01010.pdf).\n- Predicting seizures in intracranial EEG recordings: [American Epilepsy Society Seizure Prediction Challenge](https://www.kaggle.com/c/seizure-prediction).\n\n## Authorship\n\nMikel Sagardia, 2022.  \nNo guarantees.\n\nYou are free to use this project, but please link it back to the original source.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmxagar%2Ftext_generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmxagar%2Ftext_generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmxagar%2Ftext_generator/lists"}