{"id":25504723,"url":"https://github.com/krypt0nn/inductor","last_synced_at":"2025-11-14T07:30:15.274Z","repository":{"id":277423740,"uuid":"932320868","full_name":"krypt0nn/inductor","owner":"krypt0nn","description":"Logical continuation of my markov-chains text generator using neural networks. Research project, not meant for real use.","archived":false,"fork":false,"pushed_at":"2025-02-13T20:44:46.000Z","size":209,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-13T21:32:41.607Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/krypt0nn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-13T18:14:17.000Z","updated_at":"2025-02-13T20:44:49.000Z","dependencies_parsed_at":"2025-02-13T21:43:59.571Z","dependency_job_id":null,"html_url":"https://github.com/krypt0nn/inductor","commit_stats":null,"previous_names":["krypt0nn/inductor"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krypt0nn%2Finductor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krypt0nn%2Finductor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krypt0nn%2Finductor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krypt0nn%2Finductor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/krypt0nn","download_url":"https://codeload.github.com/krypt0nn/inductor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239605094,"owners_count":19666998,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-19T05:56:08.345Z","updated_at":"2025-11-14T07:30:15.212Z","avatar_url":"https://github.com/krypt0nn.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Inductor\n\nLogical continuation of my [markov-chains](https://github.com/krypt0nn/markov-chains) project for text generation using neural networks.\n\n## Get started\n\n```bash\ncargo build --release\n```\n\n## 0. Create new project\n\nEvery time you want to experiment with some model you need to create a new project. This is a separate folder with\na TOML formatted config file, bunch of sqlite databases and compressed neural network models. Config file, by default\nnamed `inductor.toml`, contains parameters used by the tokens parser, word embeddings and text generation models,\nsqlite databases and so on.\n\n```bash\ninductor --config 'path/to/inductor.toml' init\n```\n\nBy default `./inductor.toml` path is assumed. It's **generally recommended** to look into this file and tweak\nparameters depending on your needs because defaults are not meant to be good for every usecase.\n\n## 1. Prepare documents\n\nDocument is a structured block of text which contains \"input\", \"context\" and \"output\" sections divided by XML tags.\nBy default input and context of a document are empty but you can manually assign them. Documents are compressed and\nstored in SQLite database to save your disk space.\n\n### Insert documents to the dataset\n\n| Optional flags              | Meaning                                                                  |\n| --------------------------- | ------------------------------------------------------------------------ |\n| `--discord-chat`            | Assume given document path is a discord chat history dump in JSON format |\n| `--discord-split-documents` | Split messages into separate documents                                   |\n| `--discord-last-n`          | Export only given amount of last messages                                |\n\n```bash\ninductor documents insert --document my_document.txt\n```\n\n## 2. Create tokens database\n\nToken is a minimal undividable entity of a document. Depending on documents parser configuration it can contain\none or many characters, including whitespaces. Tokens database indexes unique tokens which is needed in later steps.\n\n### Update tokens database from documents dataset\n\n```bash\ninductor tokens update\n```\n\n## 3. Train word embeddings model\n\nWord embeddings model maps each token into a multi-dimensional vector. During training model\ntried to figure out relations between the words in all the documents you provided it with\nand distribute all the tokens in a vast space. If two words have similar meaning - they will\nbe much closer to each other in this space than other words, which will greatly improve\ntext generation quality since the prediction error will not be so easily observable.\n\n### Train the model on given documents\n\nDuring training model will try to learn relative positions of tokens in natural text documents. To do this\nit will read `embedding_context_radius * 2` tokens around the target token and learn to predict it using\nsurrounding tokens. To improve model's performance we skip tokens with too few occurences in the documents\n(with less than `minimal_occurences`) and randomly skip tokens based on their frequency and `subsampling_value`.\n\nToken skipping probability is calculated as:\n\n```\nP_skip(token) = 1 - clamp(sqrt(subsample_value / token_frequency))\n```\n\nWhere `clamp` ensures that `sqrt` value is within `[0.0, 1.0]` range.\n\n```bash\ninductor embeddings train\n```\n\n### Update embeddinds using pre-trained model\n\nAfter model's training embeddings database is updated automatically, but if you\ndownloaded the pre-trained model - you can use this method and your own tokens database\nto create word embeddings database.\n\n```bash\ninductor embeddings update\n```\n\n### Compare words to each other\n\nThis method allows you to use word embeddings database to find words with meaning\nclosest to your input word. Useful to debug your model's training results.\n\n```bash\ninductor embeddings compare\n```\n\n### Export word embeddings from the database\n\nWith this method you can export all the tokens and their embedding vectors to a CSV table\nto analyze them manually, e.g. by using [this website](https://www.csvplot.com).\n\n| Optional flags | Meaning                                                    |\n| -------------- | ---------------------------------------------------------- |\n| `--csv`        | Path to the CSV file where to save all the word embeddings |\n\n```bash\ninductor embeddings export --csv embeddings.csv\n```\n\n## 4. Train text generation model\n\nText generation model uses `context_tokens_num` tokens to predict the following one.\nIt also uses positional encoding which adds sines with different properties to the tokens' embeddings.\nTheoretically positional encoding should allow model to differ the same word (embedding) placed\non different positions within the text. You can disable positional encoding by setting\n`position_encoding_period` to 0.\n\n### Train the model on given documents and word embeddings\n\n```bash\ninductor text-generator train\n```\n\n### Generate text using the trained model\n\n| Optional flags | Meaning                                                    |\n| -------------- | ---------------------------------------------------------- |\n| `--context`    | Optional context string applied to the generating document |\n\n```bash\ninductor text-generator generate\n```\n\n## Bonus: host your device for remote model training\n\nIf you have access to remove devices - you can host their computation resources\nto train the model remotely. Good for making sparse computations pool.\n\nUsed in combination with `--remote-device` flags.\n\n| Optional flags | Meaning         |\n| -------------- | --------------- |\n| `--port`       | Connection port |\n\n```bash\ninductor serve\n```\n\n\u003e Note: burn doesn't support remote devices yet. This is a placeholder for potential future support.\n\nAuthor: [Nikita Podvirnyi](https://github.com/krypt0nn)\\\nLicensed under [GPL-3.0](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrypt0nn%2Finductor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkrypt0nn%2Finductor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrypt0nn%2Finductor/lists"}