{"id":25491540,"url":"https://github.com/alperiox/bookbot","last_synced_at":"2026-04-15T16:08:16.091Z","repository":{"id":239689083,"uuid":"799956754","full_name":"alperiox/bookbot","owner":"alperiox","description":"A toy project for my generative AI studies on text data. Train generative models with given book/text files with just a single script.","archived":false,"fork":false,"pushed_at":"2025-05-25T13:13:19.000Z","size":1930,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-19T08:02:29.534Z","etag":null,"topics":["generative-ai","generative-model","gpt","machine-learning","nlp","nlp-machine-learning","python","python3","pytorch","torch","transformers"],"latest_commit_sha":null,"homepage":"https://github.com/alperiox/bookbot/wiki","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alperiox.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-05-13T12:34:38.000Z","updated_at":"2025-05-25T13:13:22.000Z","dependencies_parsed_at":"2024-10-21T03:09:03.310Z","dependency_job_id":null,"html_url":"https://github.com/alperiox/bookbot","commit_stats":null,"previous_names":["alperiox/bookbot"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/alperiox/bookbot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alperiox%2Fbookbot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alperiox%2Fbookbot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alperiox%2Fbookbot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alperiox%2Fbookbot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alperiox","download_url":"https://codeload.github.com/alperiox/bookbot/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alperiox%2Fbookbot/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267707123,"owners_count":24131330,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["generative-ai","generative-model","gpt","machine-learning","nlp","nlp-machine-learning","python","python3","pytorch","torch","transformers"],"created_at":"2025-02-18T22:17:54.260Z","updated_at":"2026-04-15T16:08:16.085Z","avatar_url":"https://github.com/alperiox.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# bookbot\n\na project that reads the given file and uses a neural network to generate text that looks like from the book.\n\nthe built-in neural network is MLP, Wavenet-inspired Hierarchical MLP, and a GPT network that's built with pure Pytorch (from scratch) along with batch normalization layer and Kaiming initialization.\n\nThanks to Andrej Karpathy for his great course on deep learning.\n\nAvailable file types as of the moment:\n\n- PDF\n- TXT\n\n## Usage\n\n### Installation\n\nYou can try the project out by cloning the git repository\n\n```bash\ngit clone https://github.com/alperiox/bookbot.git\n```\n\nThen just install the `poetry` environment and move on to the next steps.\n\n### How to train the network?\n\nSimply run the `main.py` by setting up the arguments below.\n\nYou can start the training using the script like in the following:\n\n```bash\npython main.py --file=romeo-and-juliet.txt --model gpt --max_steps 100\n```\n\nOr if you want to have more control over the whole training, consider using a more detailed configuration:\n\n| Argument | Default Value | Description |\n|----------|---------------|-------------|\n| train_ratio | 0.8 | Ratio of the input data that will be used for training |\n| file | - | Path to the PDF/TXT file |\n| n_embed | 15 | Embedding vector's dimension |\n| n_hidden | 400 | Hidden layer's dimensions (the hidden layers will be defined as n_hidden x n_hidden) |\n| block_size | 10 | Block size to set up the dataset, it's our context window in this project |\n| batch_size | 32 | The amount of samples that'll be processed in one go |\n| epochs | 10 | Number of epochs to train the model |\n| lr | 0.001 | Learning rate to update the weights |\n| generate | False | To run the generation mode, it's required to generate text using the pre-trained model. So you should train a model first |\n| max_new_tokens | 100 | The amount of tokens that will be generated if `generate` flag is active |\n| model | gpt | Hierarchical mlp (hmlp), mlp model (mlp) or gpt (gpt) model to train |\n| n_consecutive | 2 | The amount of consecutive tokens to concatenate in the hierarchical model |\n| n_layers | 4 | Number of processor blocks in the model, check out the models in `layers.py` for more information about its usage |\n| num_heads | 3 | Number of self-attention heads in the multi-head self-attention layer in GPT implementation |\n| num_blocks | 2 | Number of layer blocks given the model. Sequential linear blocks for MLP and Hierarchical MLP, DecoderTransformerBlocks for GPT |\n| context | None | The context for the text generation, please try to use a longer context than the `block_size` (required if `generate` is True) |\n| device | cpu | The device to train the models on, available values are `mps`, `cpu` and `cuda`. |\n\nThe training will generate several artifacts and will save them in the `artifacts` directory. The saved artifacts include the model, the data loaders, calculated losses along the training, and finally the tokenizer to use the constructed character-level vocabulary.\n\n### How to generate new text?\n\nYou can generate text __after training a model first.__ That's because the generation pipeline makes use of the saved artifacts. In order to start the generation, you need to pass the `generate` flag:\n\n```bash\npython main.py --generate --context=\"Juliet,\" --max_new_tokens=100\n\u003e\u003e\u003e juliet, and have know lie thee why!\n```\n\nThe generation will run until the wanted character length is matched.\n\n## Further plans\n\n- [ ] Implement debugging tools to analyze the neural network's training performance. (more like useful graphs and statistics.)\n  - [ ] graphs to check out the layer outputs' distributions.\n        layer output distributions (with extra information about mean, std and the distribution plot)\n  - [ ] graphs to check the gradient flow\n    - [ ] layer gradient means\n    - [ ] layer gradient stds\n    - [ ] ratio of amount of change in the parameters given the weights\n          we multiply the learning rate with the layer's gradient's std and divide it by\n          parameters' std. this ratio will be higher if gradient std is larger (grads vary too much from the mean)\n          and the params are smaller in comparison.\n    - [ ] layer grad distributions (with extra information about mean, std and the distribution plot)\n    - [ ] ratio of the gradient of a specific layer to its input\n          so if the ratio is too high, it means that the gradients are too high wtr to the input\n          and we actually want constant but smaller updates throughout the network to not miss any local minimas etc\n    - [ ] ratio of the amount of change vs the weights, the stats should be saved in L7 here\n  - [ ] summary for the training\n- [ ] More modeling options such as LSTMs, RNNs, and Transformer-based architectures.\n  - [x] Wavenet? (implemented the hierarchical architecture)\n  - [x] GPT\n  - [ ] GPT-2\n- [ ] GPT tokenizer implementation to further improve the generation quality.\n\n## Contributing\n\nWhile I'm open to new feature ideas and stuff, please let me do the coding part since I'm trying to improve my overall understanding. Thus, I'd love to accept any feature requests as new PRs. You can reach me from Discord (@alperiox)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falperiox%2Fbookbot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falperiox%2Fbookbot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falperiox%2Fbookbot/lists"}