https://github.com/neemiasbsilva/developing-nanogpt2-fineweb
Developing a cusstom nano GPT-2 from scratch using PyTorch on the Fineweb dataset.
https://github.com/neemiasbsilva/developing-nanogpt2-fineweb
attention deep-learning gpt openai-gpt2 openai-gpt3 pytorch transformers
Last synced: 2 months ago
JSON representation
Developing a cusstom nano GPT-2 from scratch using PyTorch on the Fineweb dataset.
- Host: GitHub
- URL: https://github.com/neemiasbsilva/developing-nanogpt2-fineweb
- Owner: neemiasbsilva
- License: mit
- Created: 2024-06-14T04:42:28.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-06-19T02:15:42.000Z (10 months ago)
- Last Synced: 2025-01-05T13:13:38.396Z (4 months ago)
- Topics: attention, deep-learning, gpt, openai-gpt2, openai-gpt3, pytorch, transformers
- Language: Python
- Homepage:
- Size: 229 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Developing a nano GPT-2 from scratch using PyTorch and training using the Fineweb dataset

## Table of contents
- [About](#about);
- [Project Organization](#project-organization);
- [Train resources](#train-resources);
- [Usage for text generation](#usage-for-text-generation).## About
Developing a custom nano GPT-2 from scratch using PyTorch an train in the EduFineWeb dataset. This repository was based on reproduce the [Open AI GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) and using the hyper-parameters for trianing from [Open AI GPT-3 paper](https://arxiv.org/abs/2005.14165). The dataset used was the [FineWeb ๐ท](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (the smalest version around 10B gpt2 number of tokens).
![]()
Example of the dataset used for the train and evaluation phase. For more details about the dataset you can visite the HuggingFace FineWEB.
**Note**: This experiments was based on [Andrej Karpathy](https://karpathy.ai) works, called [nano GPT](https://github.com/karpathy/nanoGPT).
---
## Project Organization
```
โโโ LICENSE <- Open-source license if one is chosen
โโโ Makefile <- Makefile with convenience commands like `make data` or `make train`
โโโ README.md <- The top-level README for developers using this project.
โโโ data
โ โโโ external <- Data from third party sources.
โ โโโ interim <- Intermediate data that has been transformed.
โ โโโ processed <- The final, canonical data sets for modeling.
โ โโโ raw <- The original, immutable data dump.
โ
โโโ docs <- A default mkdocs project; see mkdocs.org for details
โ
โโโ models <- Trained and serialized models, model predictions, or model summaries
โ
โโโ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
โ
โ `1.0-nbs-initial-data-exploration`.
โ
โโโ pyproject.toml <- Project configuration file with package metadata for custom_nanogpt2_fineweb
โ and configuration for tools like black
โ
โโโ references <- Data dictionaries, manuals, and all other explanatory materials.
โ
โโโ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
โ โโโ figures <- Generated graphics and figures to be used in reporting
โ
โโโ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
โ generated with `pip freeze > requirements.txt`
โ
โโโ setup.cfg <- Configuration file for flake8
โ
โโโ src <- Source code for use in this project.
โ
โโโ __init__.py <- Makes src a Python module
โ
โโโ data <- Scripts to manager the datta
โ โโโ manager_data.py
โ
โโโ configs <- Get configs for data, train and GPT model
โ |โโ setup.py
โ โโโ config.yaml
โ
โโโ model <- Scripts to build the GPT-2 model
โ |โโ transformer_blocks.py
โ โโโ gpt2_model.py
โ
โโโ train.py <- Scripts to train the GPT-2 model
โโโ generate.py <- Scripts to generate answers from the GPT-2 custom
trained model
```--------
## Train Resources
The training process was conducted using a robust setup, featuring a system equipped with four NVIDIA GeForce RTX 3090 GPUs. The computational power was further supported by an Intelยฎ Coreโข i7-10700 CPU running at 2.90 GHz, complemented by a substantial 130 GB of RAM.
---
## Usage for text generation
Clone the repository and create a conda environment:
```
conda env create --name envname --file=environments.yml
```Download the model file available in the link below and put in the follow path `models/` ([download this models checkpoint](https://drive.google.com/file/d/1YZwtNmsrfY1zjMr1_38kF0Lo9bDicA7O/view?usp=share_link)).
After that, open the file config_inf.yaml (`src/config/config_inference.yaml`) and choose the message you want (e.g `message: "Hello GPT, can you explain what is machine learning?"`)
And finally, for run the inference (don't need a GPU for run), just type this command:
```
python generate.py
```The text generation will be stored in `reports/generation.json`:
```
```