https://github.com/huggingface/cosmopedia

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/huggingface/cosmopedia
Owner: huggingface
License: apache-2.0
Created: 2024-02-19T09:34:19.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-11-20T10:42:28.000Z (about 1 year ago)
Last Synced: 2025-09-30T18:02:26.743Z (3 months ago)
Language: Python
Size: 11.4 MB
Stars: 541
Watchers: 13
Forks: 50
Open Issues: 16
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-datacentric-llm - [code

README

# Cosmopedia

Image generated by DALL-E, the prompt was generated by Mixtral-8x7B-Instruct-v0.1.

[🤗 Cosmopedia dataset] | [🤖 1B-LLM trained on Cosmopedia] | [📰 Blog post]

blog post:

## Description
Here you can find the code used for creating [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia), a dataset of synthetic textbooks, blogposts, stories, posts and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1. It contains over **30 million files and 25 billion tokens**, making it the largest open synthetic dataset to date.

Cosmopedia covers a variety of topics; we tried to map world knowledge present in Web datasets like RefinedWeb and RedPajama, and generate synthetic content that covers them. This is the v0.1 of Cosmopedia, with ample room for improvement and topics to be more comprehensively covered. We hope this dataset will help the community's research efforts in the increasingly intriguing domain of synthetic data.

The clusters of Cosmopedia.

You can also find a files frequency plot of single topic clusters in `plots/topic_distpng.png`.

## Code structure
- `prompts`: the code for building the prompts in each `seed_data` in Cosmopedia. In `web_samples`, you can also find pointers for the topic clustering we did.
- `generation`: the code to run large scale synthetic generations with [llm-swarm](https://github.com/huggingface/llm-swarm) using the prompts you built. Cosmopedia consists of 25B tokens and was generated in > 10k H100 GPU hours.
- `deduplication`: the script we used to run MinHash deduplication with [datatrove](https://github.com/huggingface/datatrove).
- `decontamination`: the code we used to run n-gram decontamination against evaluation benchmarks, when training models on the dataset like [cosmopedian-1b](https://huggingface.co/HuggingFaceTB/cosmopedian-1b).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/huggingface/cosmopedia

Awesome Lists containing this project

README