Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/shivendrra/smalllanguagemodel-project

a LLM cookbook, for building your own from scratch, all the way from gathering data to training a model
https://github.com/shivendrra/smalllanguagemodel-project

bert-model decoder-model gpt llm-cookbook llm-training llms machine-learning neural-networks transformer

Last synced: 4 months ago
JSON representation

a LLM cookbook, for building your own from scratch, all the way from gathering data to training a model

Host: GitHub
URL: https://github.com/shivendrra/smalllanguagemodel-project
Owner: shivendrra
License: mit
Created: 2023-12-03T11:55:19.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-06-25T11:18:43.000Z (8 months ago)
Last Synced: 2024-06-26T10:12:43.936Z (8 months ago)
Topics: bert-model, decoder-model, gpt, llm-cookbook, llm-training, llms, machine-learning, neural-networks, transformer
Language: Jupyter Notebook
Homepage:
Size: 66.7 MB
Stars: 111
Watchers: 4
Forks: 10
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# SmallLanguageModel

This repository contains all the necessary items needed to build your own LLM from scratch. Just follow the instructions. Inspired from Karpathy's nanoGPT and Shakespeare generator, I made this repository to build my own LLM. It has everything from data collection for the Model to architecture file, tokenizer and train file.

## Repo Structure
This repo contains:
1. **Data Collector:** Web-Scrapper containing directory, in case you want to gather the data from scratch instead of downloading.
2. **Data Processing:** Directory that contains code to pre-process certain kinds of file like converting parquet files to .txt and .csv files and file appending codes.
3. **Models:** Contains all the necessary code to train a model of your own. A BERT model, GPT model & Seq-2-Seq model along with tokenizer and run files.

## Prerequisites
Before setting up SmallLanguageModel, ensure that you have the following prerequisites installed:

1. Python 3.8 or higher
2. pip (Python package installer)

## How to use:
Follow these steps to train your own tokenizer or generate outputs from the trained model:
1. Clone this repository:
```shell
git clone https://github.com/shivendrra/SmallLanguageModel-project
cd SLM-clone
```

2. Install Dependencies:
```shell
pip install requirements.txt
```

3. Train:
Read the [training.md](https://github.com/shivendrra/SmallLanguageModel-project/blob/main/training.md) for more information. Follow it.

## StarHistory

[![Star History Chart](https://api.star-history.com/svg?repos=shivendrra/SmallLanguageModel-project&type=Date)](https://star-history.com/#shivendrra/SmallLanguageModel-project&Date)

## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.

## License
MIT License. Check out [License.md](https://github.com/shivendrra/SmallLanguageModel-project/blob/main/LICENSE) for more info.