https://github.com/gardner/gsd
GPU Swarm for Datasets
https://github.com/gardner/gsd
Last synced: 11 months ago
JSON representation
GPU Swarm for Datasets
- Host: GitHub
- URL: https://github.com/gardner/gsd
- Owner: gardner
- License: gpl-3.0
- Created: 2024-02-02T10:33:46.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-05T21:36:20.000Z (over 2 years ago)
- Last Synced: 2025-08-01T22:20:31.714Z (11 months ago)
- Language: Python
- Homepage:
- Size: 20.5 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GSD (GPU Swarm for Datasets) πΈοΈππ»π₯
## Join the Swarm
```shell
docker run -it --rm --shm-size=2g --gpus all gardner/gsd:latest --server https://q.llm.nz
```
## Introduction π
GSD (GPU Swarm for Datasets) is a FOSS initiative designed to democratize the improvement of datasets for broader R&D. This project brings together volunteers who contribute their GPU time to enhance text datasets, making them more useful for the community at large. Inspired by the paper "Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling" by Maini et al. (January 29, 2024) [paper](https://arxiv.org/abs/2401.16380), GSD is a community approach to processing and refining data efficiently and at scale.
GSD is similar to a decentralized version of Facebook's [cc_net](https://github.com/facebookresearch/cc_net).
## How It Works π
Participants (You!) contribute by running a Docker container, which connects to a work queue to fetch work packets, processes the text locally using [vLLM](https://github.com/vllm-project/vllm), and submits the results. The output is collated and batched before being published to HuggingFace.
## Getting Started π
### Prerequisites
- Docker π³
- A stable internet connection π
- A GPU that can run Mistral 7B (~12GB of VRAM) We are accepting pull requests for CPU workers!
- A commitment to Getting Shit Done β
### Join the Swarm
To join the swarm, simply run the docker container:
```shell
docker run -it --rm --gpus all gardner/gsd:latest --server https://q.llm.nz
```
### Contributing
1. **Clone the Repository**:
```bash
git clone https://github.com/gardner/gsd.git
```
2. **Run the Docker Container**: In the repository's directory, execute:
```bash
docker-compose up --build
```
## Acknowledgements π
A huge thank you to every volunteer and to the pioneers who inspired GSD, especially the authors of the influential work by Maini et al. Your contributions drive this project forward.
## License π
GSD is under the GNU General Public License v3.0 - see the [LICENSE](LICENSE) file for details.