{"id":18301193,"url":"https://github.com/jaymody/simplegpt","last_synced_at":"2025-04-05T14:30:49.314Z","repository":{"id":189409466,"uuid":"680627362","full_name":"jaymody/simpleGPT","owner":"jaymody","description":"Simple implementation of a GPT (training and inference) in PyTorch.","archived":false,"fork":false,"pushed_at":"2023-12-11T20:30:05.000Z","size":10,"stargazers_count":10,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-21T05:43:02.275Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaymody.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-08-19T21:24:19.000Z","updated_at":"2024-10-14T09:29:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"48cbaf92-0bc1-4f08-876b-a8f90e6f5785","html_url":"https://github.com/jaymody/simpleGPT","commit_stats":null,"previous_names":["jaymody/simplegpt"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaymody%2FsimpleGPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaymody%2FsimpleGPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaymody%2FsimpleGPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaymody%2FsimpleGPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaymody","download_url":"https://codeload.github.com/jaymody/simpleGPT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247352273,"owners_count":20925243,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T15:14:43.240Z","updated_at":"2025-04-05T14:30:49.049Z","avatar_url":"https://github.com/jaymody.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SimpleGPT\n\nSimple implementation of a GPT (training and inference) in PyTorch.\n\nBasically my version of [nanoGPT](https://github.com/karpathy/nanoGPT) with some minor differences:\n\n* Using [lightning](https://lightning.ai) to handle training.\n* Using [einops](https://github.com/arogozhnikov/einops) for readable neural net code.\n* Using [pydantic](https://docs.pydantic.dev/latest/) instead of [Poor Man's Configurator](https://github.com/karpathy/nanoGPT/blob/eba36e84649f3c6d840a93092cb779a260544d08/configurator.py#L2).\n* Is even simpler (imo).\n\n## Install Dependencies\n```shell\npip install .\n```\n\nIf you're developing changes to the codebase, use:\n```shell\npip install -e \".[dev]\"\n```\n\n## Run GPT2 Inference\n\nRunning inference on pre-trained GPT2 model:\n\n```shell\npython -m simplegpt.inference \\\n    \"Alan Turing theorized that computers would one day become\" \\\n    --model_name_or_ckpt_path \"gpt2\"\n```\n\nPre-trained GPT2 models come in the sizes `gpt2`, `gpt2-medium`, `gpt2-large`,\nand `gpt2-xl`.\n\nHere's an example with more options:\n```shell\npython -m simplegpt.inference \\\n    \"Alan Turing theorized that computers would one day become\" \\\n    --model_name_or_ckpt_path \"gpt2\"\n    --n_tokens_to_generate 40 \\\n    --batch_size 4 \\\n    --seed 321 \\\n    --temperature 0.2\n```\n\n## Train a Model from Scratch\n\nLet's train a baby GPT model on the [tiny shakespeare dataset](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt).\n\nFirst let's download the dataset text:\n\n```ShellSession\n$ mkdir data\n\n$ cd data\n\n$ wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\n\n$ head -n 20 input.txt\nFirst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you know Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us kill him, and we'll have corn at our own price.\n```\n\nThen, we'll split the dataset into a train and validation set by simply splitting the file at the 90% mark by line number:\n\n```ShellSession\n$ wc -l input.txt # count number of lines\n40000\n\n$ split -l 36000 input.txt  # 40000 * 0.9 = 36000\n\n$ ls\ninput.txt\nxaa\nxab\n\n$ wc -l xaa\n3600\n\n$ wc -l xab\n4000\n\n$ mv xaa train.txt\n\n$ mv xab val.txt\n```\n\nThen, we'll need to encode the text into tokens:\n\n```ShellSession\n$ python -m simplegpt.data --text_file \"data/train.txt\" --output_file \"data/train-char.bin\" --tokenizer_name \"char\"\n100%|████████████████████████████████████████████████████████████████████████████| 36000/36000 [00:00\u003c00:00, 694079.64it/s]\nnumber of tokens = 1016242\nsaving to data/train.bin\n\n$ python -m simplegpt.data --text_file \"data/val.txt\" --output_file \"data/val-char.bin\" --tokenizer_name \"char\"\n100%|██████████████████████████████████████████████████████████████████████████████| 4000/4000 [00:00\u003c00:00, 719403.80it/s]\nnumber of tokens = 99152\nsaving to data/val.bin\n```\n\nHere, we're using a character-level tokenizer which just tokenizes the text based on it's ascii value (which works on this dataset since it contains only ascii characters). We could have also used the regular BPE based tokenizer by instead passing in `--tokenizer_name \"gpt2\"`.\n\nFinally, we train our model using the provided configuration file `configs/shakespeare_char.toml`:\n```shell\npython -m simplegpt.train configs/shakespeare_char.toml\n```\n\nIf you used the regular \"gpt2\" tokenizer instead of the character-level tokenizer, use the `config/shakespeare.toml` config file:\n\nOn my 3090, this takes about 5 minutes to train. We can run inference on the\nnewly trained model by passing in the checkpoint path to `simplegpt.inference`:\n\n```shell\npython -m simplegpt.inference \\\n    \"The lady doth protest\" \\\n    --model_name_or_ckpt_path \"models/tiny_shakespeare_char/ckpts/last.ckpt\" \\\n    --n_tokens_to_generate 100\n```\n\nFor the character-level model, this gives an output of:\n\n```\n==== Result 1 ====\n that I have seen.\n\nQUEEN ELIZABETH:\nThe linealness hath been done to have it so.\n\nKING RICHARD III:\nAnd shall I live, if to hear a thousand affects\nAre to be so still a son of the king.\n\nQUEEN ELIZAB\n\n==== Result 2 ====\n; and with the time shall\nWith some men of the world and look for his head.\n\nQUEEN MARGARET:\nAnd leave me so, and leave the world alone.\n\nKING HENRY VI:\nWhat said Clarence to my son? what say you?\n```\n\nFor the regularly tokenized model, this gives an output of:\n\n```\n==== Result 1 ====\n; and so still so much,\nThat, were I madmen,--\n\nPAULINA:\nThat's enough.\nI must be so far gone, sir, sit by my side,\nAnd leave it as you to part your company:\nGood night.\n\nLEONTES:\nThou'rt i' the or bad;\nI have forgot already made thy beauty.\n\nPAULINA:\nA most unworthy and unnatural lord\nCan do no\n\n==== Result 2 ====\n; and so I am:\nI think there is not half the night in her\nUntil the fair ladies of York's lap,\nAnd in my state and honour beauteous inn,\nWhy should I think be so deep a maidr'd her sweetor?\n\nKING RICHARD III:\nMadam, so I am a subject.\n\nQUEEN ELIZABETH:\nAnd shall I woo her?\n```\n\n\n## Todos\n- [ ] Add support for fine-tuning.\n- [ ] Actually reproduce GPT-2 (while I don't have the compute resources for this, I can at least run the model for a couple days on my 3090 and check that the train/val loss is what it's suppose to be).\n- [ ] Add support for resuming training.\n- [ ] Add support for top-p and top-k sampling.\n- [ ] Use flash attention.\n- [ ] Add support for lower precision training training.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaymody%2Fsimplegpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaymody%2Fsimplegpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaymody%2Fsimplegpt/lists"}