{"id":28561113,"url":"https://github.com/codelion/pts","last_synced_at":"2025-06-10T10:14:20.041Z","repository":{"id":291039860,"uuid":"976377835","full_name":"codelion/pts","owner":"codelion","description":"Pivotal Token Search","archived":false,"fork":false,"pushed_at":"2025-05-13T09:24:31.000Z","size":205,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-13T10:35:21.253Z","etag":null,"topics":["dataset-generation","direct-preference-optimization","dpo","llm","llm-inference","llm-steering","mech-interp","phi-4","phi-4-mini","phi4","phi4-mini","pivotal-token-search","pivotal-tokens","reasoning-agent","reasoning-language-models","reasoning-models","sae","sparse-autoencoder","steering-vector","tokens"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codelion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-02T02:17:51.000Z","updated_at":"2025-05-13T09:24:33.000Z","dependencies_parsed_at":"2025-05-02T03:40:00.013Z","dependency_job_id":null,"html_url":"https://github.com/codelion/pts","commit_stats":null,"previous_names":["codelion/pts"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelion%2Fpts","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelion%2Fpts/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelion%2Fpts/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelion%2Fpts/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codelion","download_url":"https://codeload.github.com/codelion/pts/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelion%2Fpts/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259054152,"owners_count":22798451,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset-generation","direct-preference-optimization","dpo","llm","llm-inference","llm-steering","mech-interp","phi-4","phi-4-mini","phi4","phi4-mini","pivotal-token-search","pivotal-tokens","reasoning-agent","reasoning-language-models","reasoning-models","sae","sparse-autoencoder","steering-vector","tokens"],"created_at":"2025-06-10T10:14:19.250Z","updated_at":"2025-06-10T10:14:20.026Z","avatar_url":"https://github.com/codelion.png","language":"Python","readme":"# PTS: Pivotal Token Search\n\nA tool for discovering pivotal tokens in large language model generations and creating DPO datasets and steering vectors from them.\n\n## Features\n\n- Identifies pivotal tokens in language model generations\n- Supports various dataset formats including GSM8k, MATH, and custom datasets\n- Handles chain-of-thought reasoning output with `\u003cthink\u003e\u003c/think\u003e` tags\n- Extracts answers from common formats like GSM8k's #### pattern and LaTeX's \\boxed{} notation\n\n## What is Pivotal Token Search?\n\nPivotal Token Search (PTS) is a technique described in the [Phi-4 Technical Report](https://arxiv.org/abs/2412.08905) that identifies tokens in a language model's generation that significantly impact the probability of success for the task at hand. These \"pivotal tokens\" are decision points where the model's choice can dramatically alter the course of the solution.\n\nKey features:\n- Identifies tokens that significantly increase or decrease the probability of a successful generation\n- Generates DPO (Direct Preference Optimization) pairs for fine-tuning\n- Creates steering vectors for activation-based steering during inference\n\n## Installation\n\n```bash\ngit clone https://github.com/codelion/pts.git\ncd pts\npip install -e .\n```\n\n## Quick Start\n\n```bash\n# Find pivotal tokens in a dataset and save to file\npts run --model=\"Qwen/Qwen3-0.6B\" --dataset=\"codelion/optillmbench\" --output-path=\"pivotal_tokens.jsonl\"\n\n# Convert pivotal tokens to DPO dataset\npts export --input-path=\"pivotal_tokens.jsonl\" --format=\"dpo\" --output-path=\"dpo_dataset.jsonl\" --model=\"Qwen/Qwen3-0.6B\" --find-rejected-tokens\n\n# Convert pivotal tokens to steering vectors\npts export --input-path=\"pivotal_tokens.jsonl\" --format=\"steering\" --output-path=\"steering_vectors.jsonl\" --model=\"Qwen/Qwen3-0.6B\"\n\n# Push dataset to Hugging Face (creates README by default)\npts push --input-path=\"dpo_dataset.jsonl\" --hf-repo=\"codelion/pts-dpo-dataset\" --model=\"Qwen/Qwen3-0.6B\"\n```\n\n## Try Now\n\n| Use Case | Dataset | Link |\n|----------|----------|-------|\n| Fine-tuning the model | dpo dataset | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1FggA9EQ1eFBjE0Qbsl0-EFzyWIxpdhlH?usp=sharing) |\n| Optimizing the inference | steering vectors | [optillm](https://github.com/codelion/optillm) |\n\nYou can also check out the [datasets](https://huggingface.co/datasets?other=pts) and [models](https://huggingface.co/models?other=pts) created with pts.\nIt was used for the `autothink` approach in [optillm](https://github.com/codelion/optillm) as described in this [paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5253327).\n\n## Core Concepts\n\n### Pivotal Tokens\n\nA pivotal token significantly changes the probability of success when it appears in a model's generation. By identifying these tokens, we can:\n1. Understand where the model makes critical decisions\n2. Create preference pairs for DPO fine-tuning\n3. Extract activation vectors for steering during inference\n\n### DPO Datasets\n\nPTS creates high-quality DPO datasets by isolating the specific token-level choices that lead to success or failure. This allows for more targeted and effective fine-tuning compared to using entire sequences.\n\n**Important:** When exporting to DPO format, you must provide a model using the `--model` parameter and enable the `--find-rejected-tokens` flag. This is necessary because DPO pairs require both a chosen token (the pivotal token that increases success probability) and a rejected token (an alternative token that decreases success probability).\n\n### Steering Vectors\n\nThe activation patterns associated with pivotal tokens can be used to guide models during generation, encouraging them to follow successful reasoning paths.\n\n## Dataset Field Customization\n\nDifferent datasets use different field names for questions and answers. PTS automatically detects appropriate field names for common datasets, but you can also specify them manually:\n\n```bash\npts run --model=\"Qwen/Qwen3-0.6B\" --dataset=\"your-dataset\" --query-key=\"question\" --answer-key=\"answer\"\n```\n\nFor example:\n- `codelion/optillmbench`: Uses \"question\" and \"answer\" fields\n- Other datasets may use fields like:\n  - \"instruction\"/\"output\"\n  - \"problem\"/\"solution\" \n  - \"prompt\"/\"canonical_solution\"\n\nIf not specified, PTS will attempt to automatically detect the appropriate fields based on common naming patterns.\n\n## Command Reference\n\n### `pts run`\n\nFind pivotal tokens in a dataset:\n\n```bash\npts run --model=\"MODEL_NAME\" --dataset=\"DATASET_NAME\" [options]\n```\n\nOptions:\n- `--model`: Model to use for generation\n- `--dataset`: Dataset to search (default: \"codelion/optillmbench\")\n- `--config`: Dataset configuration name (if applicable, e.g., \"main\" for openai/gsm8k)\n- `--output-path`: Path to save pivotal tokens (default: \"pivotal_tokens.jsonl\")\n- `--query-key`: Key for question/instruction field in dataset (auto-detected if not specified)\n- `--answer-key`: Key for answer/output field in dataset (auto-detected if not specified)\n- `--prob-threshold`: Probability threshold for pivotal tokens (default: 0.2)\n- `--temperature`: Sampling temperature (default: 0.6)\n- `--top-p`: Top-p (nucleus) sampling parameter (default: 0.95)\n- `--top-k`: Top-k sampling parameter (default: 20)\n- `--min-p`: Min-p sampling parameter (default: 0.0)\n- `--num-samples`: Number of samples for probability estimation (default: 10)\n- `--max-pairs`: Maximum number of pairs to generate (default: 1000)\n\n### `pts export`\n\nExport pivotal tokens to different formats:\n\n```bash\npts export --input-path=\"TOKENS_PATH\" --format=\"FORMAT\" [options]\n```\n\nOptions:\n- `--input-path`: Path to pivotal tokens file\n- `--format`: Export format (\"dpo\" or \"steering\")\n- `--output-path`: Path to save exported data\n- `--model`: Model to use for extracting steering vectors (required for \"steering\" format)\n\n### `pts push`\n\nPush dataset to Hugging Face:\n\n```bash\npts push --input-path=\"FILE_PATH\" --hf-repo=\"USERNAME/REPO_NAME\" [options]\n```\n\nOptions:\n- `--input-path`: Path to file to push\n- `--hf-repo`: Hugging Face repository name\n- `--private`: Make the repository private (default: False)\n- `--no-readme`: Skip creating a README file (a README is created by default)\n- `--model`: Model name to include in the README (optional)\n\n## Examples\n\n### Finding Pivotal Tokens with OptillmBench\n\n```bash\npts run --model=\"Qwen/Qwen3-0.6B\" \\\n    --dataset=\"codelion/optillmbench\" \\\n    --output-path=\"optillm_pivotal_tokens.jsonl\" \\\n    --prob-threshold=0.2 \\\n    --temperature=0.6 \\\n    --top-p=0.95 \\\n    --top-k=20 \\\n    --min-p=0.0\n```\n\n### Working with a Custom Dataset\n\n```bash\npts run --model=\"Qwen/Qwen3-0.6B\" \\\n    --dataset=\"my-custom-dataset\" \\\n    --query-key=\"input_text\" \\\n    --answer-key=\"target_text\" \\\n    --output-path=\"custom_pivotal_tokens.jsonl\" \\\n    --prob-threshold=0.2 \\\n    --temperature=0.6 \\\n    --top-p=0.95 \\\n    --top-k=20 \\\n    --min-p=0.0\n```\n\n### Working with a Dataset Requiring Configuration\n\n```bash\npts run --model=\"Qwen/Qwen3-0.6B\" \\\n    --dataset=\"openai/gsm8k\" \\\n    --config=\"main\" \\\n    --split=\"train\" \\\n    --output-path=\"gsm8k_pivotal_tokens.jsonl\" \\\n    --prob-threshold=0.2 \\\n    --temperature=0.6 \\\n    --max-examples=10\n```\n\n### Creating a DPO Dataset\n\n```bash\n# First find pivotal tokens\npts run --model=\"Qwen/Qwen3-0.6B\" \\\n    --dataset=\"codelion/optillmbench\" \\\n    --output-path=\"optillm_pivotal_tokens.jsonl\" \\\n    --temperature=0.6 \\\n    --top-p=0.95 \\\n    --top-k=20 \\\n    --min-p=0.0\n\n# Then export to DPO format - MUST provide a model and find-rejected-tokens flag\npts export --input-path=\"optillm_pivotal_tokens.jsonl\" \\\n    --format=\"dpo\" \\\n    --output-path=\"optillm_dpo_dataset.jsonl\" \\\n    --model=\"Qwen/Qwen3-0.6B\" \\\n    --find-rejected-tokens \\\n    --min-prob-delta=0.1\n```\n\n### Extracting Steering Vectors\n\n```bash\npts export --input-path=\"pivotal_tokens.jsonl\" \\\n    --format=\"steering\" \\\n    --output-path=\"steering_vectors.jsonl\" \\\n    --model=\"Qwen/Qwen3-0.6B\" \\\n    --layer-nums=19,23,27\n```\n\n## Citation\n\nIf you use this tool in your research, please cite:\n\n```bibtex\n@software{pts,\n  title = {PTS: Pivotal Token Search},\n  author = {Asankhaya Sharma},\n  year = {2025},\n  publisher = {GitHub},\n  url = {https://github.com/codelion/pts}\n}\n```\n","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodelion%2Fpts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodelion%2Fpts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodelion%2Fpts/lists"}