Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shivendrra/smalllanguagemodel-project
a LLM cookbook, for building your own from scratch, all the way from gathering data to training a model
https://github.com/shivendrra/smalllanguagemodel-project
bert-model decoder-model gpt llm-cookbook llm-training llms machine-learning neural-networks transformer
Last synced: 20 days ago
JSON representation
a LLM cookbook, for building your own from scratch, all the way from gathering data to training a model
- Host: GitHub
- URL: https://github.com/shivendrra/smalllanguagemodel-project
- Owner: shivendrra
- License: mit
- Created: 2023-12-03T11:55:19.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-06-25T11:18:43.000Z (5 months ago)
- Last Synced: 2024-06-26T10:12:43.936Z (5 months ago)
- Topics: bert-model, decoder-model, gpt, llm-cookbook, llm-training, llms, machine-learning, neural-networks, transformer
- Language: Jupyter Notebook
- Homepage:
- Size: 66.7 MB
- Stars: 111
- Watchers: 4
- Forks: 10
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SmallLanguageModel
This repository contains all the necessary items needed to build your own LLM from scratch. Just follow the instructions. Inspired from Karpathy's nanoGPT and Shakespeare generator, I made this repository to build my own LLM. It has everything from data collection for the Model to architecture file, tokenizer and train file.
## Repo Structure
This repo contains:
1. **Data Collector:** Web-Scrapper containing directory, in case you want to gather the data from scratch instead of downloading.
2. **Data Processing:** Directory that contains code to pre-process certain kinds of file like converting parquet files to .txt and .csv files and file appending codes.
3. **Models:** Contains all the necessary code to train a model of your own. A BERT model, GPT model & Seq-2-Seq model along with tokenizer and run files.## Prerequisites
Before setting up SmallLanguageModel, ensure that you have the following prerequisites installed:1. Python 3.8 or higher
2. pip (Python package installer)## How to use:
Follow these steps to train your own tokenizer or generate outputs from the trained model:
1. Clone this repository:
```shell
git clone https://github.com/shivendrra/SmallLanguageModel-project
cd SLM-clone
```2. Install Dependencies:
```shell
pip install requirements.txt
```3. Train:
Read the [training.md](https://github.com/shivendrra/SmallLanguageModel-project/blob/main/training.md) for more information. Follow it.## StarHistory
[![Star History Chart](https://api.star-history.com/svg?repos=shivendrra/SmallLanguageModel-project&type=Date)](https://star-history.com/#shivendrra/SmallLanguageModel-project&Date)
## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.## License
MIT License. Check out [License.md](https://github.com/shivendrra/SmallLanguageModel-project/blob/main/LICENSE) for more info.