https://github.com/NovaSky-AI/SkyThought
Sky-T1: Train your own O1 preview model within $450
https://github.com/NovaSky-AI/SkyThought
Last synced: about 2 months ago
JSON representation
Sky-T1: Train your own O1 preview model within $450
- Host: GitHub
- URL: https://github.com/NovaSky-AI/SkyThought
- Owner: NovaSky-AI
- License: apache-2.0
- Created: 2025-01-09T21:37:37.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-01-17T04:28:01.000Z (10 months ago)
- Last Synced: 2025-01-18T05:46:13.047Z (10 months ago)
- Language: Python
- Homepage: https://novasky-ai.github.io/
- Size: 8.83 MB
- Stars: 1,830
- Watchers: 28
- Forks: 193
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- ai-game-devtools - SkyThought - T1: Train your own O1 preview model within $450. | | | LLM | (Project List / <span id="tool">LLM (LLM & Tool)</span>)
- awesome-llm-strawberry - NovaSky
- StarryDivineSky - NovaSky-AI/SkyThought - T1。该项目利用低成本GPU集群,如4张RTX 3090,实现高效的模型训练。Sky-T1基于Transformer架构,专注于长文本理解和生成,支持高达100万token的上下文长度。项目包含详细的训练脚本、数据预处理流程和模型评估方法。SkyThought强调可复现性和低成本,降低了AI模型训练的门槛。用户可以根据提供的指南,定制自己的数据集并训练出满足特定需求的模型。项目目标是 democratize 大模型训练,让更多开发者和研究者能够参与到前沿AI技术的发展中。Sky-T1模型适用于多种任务,包括长文档摘要、代码生成和创意写作等。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
# SkyThought
[](https://github.com/NovaSky-AI/SkyThought) [](https://x.com/NovaSkyAI) [](https://huggingface.co/NovaSky-AI) [](https://discord.gg/kexQXy5yA3)
# News
- **[2025/02/21]** 🎉 We released S*: Test time scaling for code generation ([paper](https://arxiv.org/pdf/2502.14382), [code](https://github.com/NovaSky-AI/SkyThought/tree/main/skythought/test-time-scaling)), a simple and extensible test time scaling framework for code generation.
- **[2025/02/11]** 🎉 We released Sky-T1-7B ([model](https://huggingface.co/NovaSky-AI/Sky-T1-7B)) and Sky-T1-mini ([model](https://huggingface.co/NovaSky-AI/Sky-T1-mini)) to demonstrate the potential of RL in further enhancing model's capability beyond distillation.
- **[2025/01/23]** ⚡️ We released Sky-T1-32B-Flash ([model](https://huggingface.co/NovaSky-AI/Sky-T1-32B-Flash), [data](https://huggingface.co/datasets/NovaSky-AI/Sky-T1_preference_data_10k)) to tackle overthinking and reduce reasoning sequence lengths while maintaining accuracy.
- **[2025/01/19]** 🎉 [Chat demo](http://164.152.23.196:3000/) for Sky-T1-32B-Preview is alive! Please check it out!
- **[2025/01/10]** 🎉 We have released our Sky-T1-32B-Preview [model](https://huggingface.co/NovaSky-AI/Sky-T1-32B-Preview) and [data](https://huggingface.co/datasets/NovaSky-AI/Sky-T1_data_17k) through [HuggingFace](https://huggingface.co/NovaSky-AI)!
# Links
- 📜 [Sky-T1-7B and Sky-T1-mini Blog Post](https://novasky-ai.github.io/posts/sky-t1-7B/)
- 📜 [Sky-T1-32B-Flash Blog Post](https://novasky-ai.github.io/posts/reduce-overthinking/)
- 📜 [Sky-T1-32B-Preview model Blog Post](https://novasky-ai.github.io/posts/sky-t1/)
- 🤗 [Sky-T1-32B-Preview model](https://huggingface.co/NovaSky-AI)
# Getting Started
We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
- [`recipes`](./recipes/): Recipes - data curation steps and training strategies - for building our models `Sky-T1-32B-Flash`, `Sky-T1-32B-Preview` and `Sky-T1-7B` series.
- [`skythought/evals`](./skythought/evals/): Our data generation and evaluation library. We provide a convenient CLI for evaluation as well as a `Scorer` API for scoring during data curation and training ([example](./examples/scoring.ipynb)).
- [`skythought/train`](./skythought/train/): Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training.
- [`skythought/skythought-rl`](./skythought/skythought-rl/): RL training code for Sky-T1-7B and Sky-T1-mini.
# Evaluation
## Usage
You can install the latest release from PyPI or from [source](#installing-from-source):
```shell
pip install skythought
```
### Installing from source
```shell
# Clone the repository
git clone https://github.com/NovaSky-AI/SkyThought.git
cd SkyThought
# Create and activate a virtual environment (using uv here)
uv venv --python 3.10
source .venv/bin/activate
# Install the package in editable mode
uv pip install -e .
```
Running evaluation is as simple as:
```bash
skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task aime24
```
We support a wide variety of datasets in mathematics, science and coding:
- AIME'24
- MATH500
- GPQADiamond
- MMLU
- ARC-Challenge
- OlympiadBench
- AMC'23
- TACO
- APPS
- LiveCodeBench
- MMLU Pro
- MinervaMath
- GSM8K
- AIME'25
For more details, please refer to our [evaluation guide](examples/evaluate.ipynb) and the [evaluation README](skythought/evals/README.md).
### Evaluation results
Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.
| Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ | o1-preview |
|-----------------------|---------------------|--------|-------|------------|
| Math500 | 86.4 | 81.4 | 92.2 | 81.4 |
| AIME2024 | 43.3 | 16.7 | 50.0 | 40.0 |
| LiveCodeBench-Easy | 86.3 | 84.6 | 90.7 | 92.9 |
| LiveCodeBench-Medium | 56.8 | 40.8 | 56.3 | 54.9 |
| LiveCodeBench-Hard | 17.9 | 9.8 | 17.1 | 16.3 |
| GPQA-Diamond | 56.8 | 45.5 | 52.5 | 75.2 |
| OlympiadBench (Math, EN) | 59.79 | 46.74 | 62.17 | 59.2 |
#### Results on non-reasoning benchmarks
We also evaluate on non-reasoning benchmarks (these are benchmarks for instruction-following, QA, etc) to test whether the model has traded-off capability in other domains for better performance in reasoning-related benchmarks.
| Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ-32B-Preview | Eval Implementation |
|---------|-------------------|---------------------|-----------------|-------------------|
| MMLU (0 shot; no CoT) | **78.36** | 74.14 | 71.23 | [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness) |
| MMLU (5 shot; no CoT) | 82.46 | **82.62** | 82.32 | [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness) |
| ARC-C (0 shot; no CoT) | **49.49** | 49.4 | 49.66 | [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness) |
| IFEval | 75.79 | **78.74** | 42.51 | [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness) |
| LLM-as-a-Judge | 9.12 | **9.19** | 8.30 | [fastchat](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) |
| MGSM (0 shot; `direct`) | 33 | **42.3** | 19.07 | [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness) |
| MGSM (8-shot; `direct`) | 58.4 | **61.47** | 58.5 | [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness) |
| BFCL-v3 | 53.18 | **58.92** | 17.41 | [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard) |
| Arena-Hard | **74.79** | 66.51 | 52.6 | [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) |
For more details, refer [here](./skythought/evals/base_instruct_evals.md).
# Fully Open-source: Driving Progress Together
We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results *easily*:
Model
Sky-T1-32B-Preview
STILL-2
Journey
QwQ
o1
Data
✅
✅
❌
❌
❌
Code
✅
❌
❌
❌
❌
Report
✅
✅
✅
❌
❌
Math domain
✅
✅
✅
✅
✅
Coding domain
✅
❌
❌
✅
✅
Model Weights
✅
✅
❌
✅
❌
# Citation
The code in this repository is mostly described in the post below. Please consider citing this work if you find the repository helpful.
```bibtex
@misc{sky_t1_2025,
author = {NovaSky Team},
title = {Sky-T1: Train your own O1 preview model within $450},
howpublished = {https://novasky-ai.github.io/posts/sky-t1},
note = {Accessed: 2025-01-09},
year = {2025}
}
```
# Acknowledgement
This work is done at [Berkeley Sky Computing Lab](https://sky.cs.berkeley.edu/), with the amazing compute support from [Lambda Labs](https://lambdalabs.com/service/gpu-cloud?srsltid=AfmBOop5FnmEFTkavVtdZDsLWvHWNg6peXtat-OXJ9MW5GMNsk756PE5), [Anyscale](https://www.anyscale.com/), and [Databricks](https://www.databricks.com/). We would like to express our gratitude for the valuable academic feedback and support from the [Still-2 Team](https://arxiv.org/pdf/2412.09413), and Junyang Lin from the [Qwen Team](https://qwenlm.github.io/).