{"id":26298990,"url":"https://github.com/kanavgoyal898/echogpt","last_synced_at":"2025-03-15T06:39:27.849Z","repository":{"id":249544194,"uuid":"830773963","full_name":"kanavgoyal898/echoGPT","owner":"kanavgoyal898","description":" echoGPT is a minimal GPT implementation for character-level language modeling with 25.4M parameters. Built with PyTorch, it includes multi-head self-attention, feed-forward layers, and position embeddings. Trained on text like tiny_shakespeare.txt to predict the next character.","archived":false,"fork":false,"pushed_at":"2025-01-11T04:10:27.000Z","size":7714,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-11T05:19:13.015Z","etag":null,"topics":["deep-learning","generative-pretrained-transformers","natural-language-processing","neural-networks","python","pytorch","research-implementation","scratch-implementation","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kanavgoyal898.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-19T01:14:41.000Z","updated_at":"2025-01-11T04:10:31.000Z","dependencies_parsed_at":"2024-12-24T18:37:10.246Z","dependency_job_id":null,"html_url":"https://github.com/kanavgoyal898/echoGPT","commit_stats":null,"previous_names":["kanavgoyal898/echogpt"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kanavgoyal898%2FechoGPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kanavgoyal898%2FechoGPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kanavgoyal898%2FechoGPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kanavgoyal898%2FechoGPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kanavgoyal898","download_url":"https://codeload.github.com/kanavgoyal898/echoGPT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243695510,"owners_count":20332625,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","generative-pretrained-transformers","natural-language-processing","neural-networks","python","pytorch","research-implementation","scratch-implementation","transformers"],"created_at":"2025-03-15T06:39:27.222Z","updated_at":"2025-03-15T06:39:27.828Z","avatar_url":"https://github.com/kanavgoyal898.png","language":"Python","readme":"# echoGPT: Character-Level Large Language Model\n\n## Overview\n\n**echoGPT** is a simplified implementation of a Generative Pre-trained Transformer (GPT) model designed for character-level language modeling with **25.4M parameters**. This implementation leverages PyTorch and includes features such as multi-head self-attention, feed-forward networks, and position embeddings. The model is trained on a text dataset (in this case, 'tiny_shakespeare.txt') to predict the next character in a sequence.\n\n\u003cdiv style=\"text-align: center;\"\u003e\n  \u003cimg src=\"./transformer.png\" alt=\"Preview\" style=\"width: 64%;\"\u003e\n\u003c/div\u003e\n\n## References\n\n1. **Vaswani et al. (Google Research, 2017). [Attention is All You Need](https://arxiv.org/pdf/1706.03762)**: Introduced the transformer architecture, which utilizes self-attention mechanisms and parallel processing of input sequences, significantly improving the efficiency and scalability of deep learning models for NLP tasks.\n2. **He et al. (Microsoft Research, 2015). [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385)**: Proposed residual networks (ResNets), which introduced skip connections to solve the vanishing gradient problem, enabling the training of very deep neural networks.\n3. **Ba et al. (University of Toronto, 2016). [Layer Normalization](https://arxiv.org/pdf/1607.06450)**: Introduced layer normalization, a technique that normalizes the inputs across the features, stabilizing and speeding up the training of neural networks.\n4. **Hinton et al. (University of Toronto, 2012). [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/pdf/1207.0580)**: Proposed the use of dropout, a regularization technique that prevents overfitting by randomly dropping units during training, improving the generalization of neural networks.\n\n## Concepts\n\n1. **Transformer Architecture**: \n   - Uses self-attention mechanisms to process input sequences in parallel, making it more efficient for long-range dependencies compared to RNNs.\n   - Composed of multiple layers of self-attention and feedforward neural networks.\n\n2. **Self-Attention**:\n   - Allows the model to weigh the importance of different tokens in the input sequence when making predictions.\n   - Implemented using multi-head attention, where multiple self-attention mechanisms run in parallel.\n\n3. **Feedforward Networks**:\n   - Each self-attention layer is followed by a feedforward network that processes the output of the self-attention mechanism.\n\n4. **Layer Normalization**:\n   - Applied to stabilize and speed up the training process.\n\n5. **Positional Embeddings**:\n   - Since transformers do not inherently understand the order of tokens, positional embeddings are added to token embeddings to encode the position of each token in the sequence.\n\n## Hyperparameters\n\n- **batch_size**: Number of sequences processed in parallel.\n- **block_size**: Maximum context length for predictions.\n- **max_iters**: Number of training iterations.\n- **eval_intervals**: Interval for evaluating the model.\n- **eval_iters**: Number of evaluation iterations.\n- **learning_rate**: Learning rate for the optimizer.\n- **fwd_mul**: Multiplier for the feedforward network's hidden layer size.\n- **n_heads**: Number of attention heads.\n- **n_layers**: Number of transformer layers.\n- **n_embd**: Embedding size for tokens and positions.\n- **dropout**: Dropout rate for regularization.\n\n## Classes\n\n### 1. `Head`\nA single head of self-attention, which computes attention scores between tokens and produces weighted representations of values.\n\n### 2. `MultiHeadAttention`\nCombines multiple attention heads in parallel. Each head computes attention separately, and their outputs are concatenated and projected to form the final representation.\n\n### 3. `FeedForward`\nDefines a simple feed-forward neural network with one hidden layer, followed by a non-linear activation function (ReLU). This is applied independently to each token in the sequence.\n\n### 4. `Block`\nRepresents a transformer block, consisting of multi-head self-attention followed by a feed-forward network. Layer normalization is applied after each sub-layer.\n\n### 5. `BigramLanguageModel`\nThe main model class that encompasses the token and positional embedding layers, multiple transformer blocks, and the final output layer for character predictions. It includes methods for:\n- Forward propagation (calculating logits and loss).\n- Generating new text sequences based on a given context.\n\n## Training\n\n- **Data Preparation**: The text data is encoded into integers based on a character-level vocabulary. The dataset is split into training and testing sets.\n- **Batch Generation**: Batches of input and target sequences are generated for training.\n- **Loss Estimation**: The model's performance is periodically evaluated on the training and testing sets.\n- **Optimization**: The AdamW optimizer is used to update the model parameters based on the computed gradients.\n\n## Usage\n\n1. **Prepare Data**: Ensure the `tiny_shakespeare.txt` file is in the `models` directory.\n2. **Train the Model**: Run the training loop, which will print the training and testing losses at specified intervals.\n3. **Generate Text**: After training, run `sample.py` to produce new text based on a given context.\n\n## Requirements\n\n- Python 3.x\n- PyTorch (version compatible with CUDA if available)\n\n## Sample Text\n\n```\nEt tu, Brute? O what is that sad?\nAnd, King Richard, no longer peace and doth ever\nOnly bury king, this sits and writing,\nAs thus penceful to make offence and please.\nRichard is the name? What sayestilence of due?\nFarewell, that courtesy.\n\nKING RICHARD II:\nExcept for you our purpose.\n\nSecond Citizen:\nOnce can be daily; prepare your fire to merry.\n\nOXFORD:\nStops, go to them all you and hear it;\nNor fair of the thoughty and your affection\nBy your brother, for unpiection may go.\n\nBUCKINGHAM:\nYou had scarce when Baptista's deserted:\nFor your honour mark, we'll pardon it;\nFor all servants all their flatterers coal-night,\nYour lords are monister, as they dare.\nThus it could prove at Razaunt, then create\nDo with much the effect him. Weeping, well eaten'd.\nI hear, sir; you are up howly gone to Marcius:\nWho see\nWho call him he fix'd matcher; be not found\nSave her labour in thy sort.\n\nPOMPEY:\n\nSirrah, sir;\nOut of the men body members of those will move\nI had called to have their purchasage\nTo meet hundred.'\n\nFirst Murderer:\nAy, my lord;\nAnd in my father deserves here I dain.\n\nCAPULET:\nAnd thus I was, to give her by the suit.\naid me, let me name had no wear: sooth:\nWho call thy bend the household the clouds King Richard's blood,\nOr shall pluck thou to't do again.\nNot of the grave's prayer citizens,\nTo be fear it tell thee so, and deliver thee,\nTo quench him to his and loving to see:\nBehold we laugh a tock to him or a purpose,\nWho with these none hope hast done not know him,\nBut thou more approclaiment, do do undo a part,\nTo put the voice.\n```\n\n## Statistics\n```\ntiny_shakespeare.pt\nstep   200: train loss 2.4401, test loss 2.4653\nstep   400: train loss 2.0728, test loss 2.1762\nstep   600: train loss 1.6529, test loss 1.8009\nstep   800: train loss 1.4683, test loss 1.6798\nstep  1000: train loss 1.3641, test loss 1.5981\nstep  1200: train loss 1.2827, test loss 1.5401\nstep  1400: train loss 1.2247, test loss 1.5094\nstep  1600: train loss 1.1812, test loss 1.4891\nstep  1800: train loss 1.1368, test loss 1.4898\nstep  2000: train loss 1.1038, test loss 1.4972\n```\n\n## Conclusion\n\nechoGPT is a basic implementation of the transformer architecture, demonstrating key concepts in NLP and deep learning. It can be further enhanced by incorporating techniques such as learning rate scheduling, advanced sampling strategies, or more sophisticated data preprocessing.","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkanavgoyal898%2Fechogpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkanavgoyal898%2Fechogpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkanavgoyal898%2Fechogpt/lists"}