https://github.com/LostRuins/datasetexplorer
Easily view and modify JSON datasets for large language models
https://github.com/LostRuins/datasetexplorer
Last synced: about 1 month ago
JSON representation
Easily view and modify JSON datasets for large language models
- Host: GitHub
- URL: https://github.com/LostRuins/datasetexplorer
- Owner: LostRuins
- License: agpl-3.0
- Created: 2024-08-20T13:26:48.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-27T13:18:59.000Z (about 1 year ago)
- Last Synced: 2024-08-27T18:48:40.188Z (about 1 year ago)
- Language: HTML
- Size: 72.3 KB
- Stars: 42
- Watchers: 2
- Forks: 6
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Concedo's Dataset Explorer
Easily view and modify JSON and JSONL datasets for training large language models
## Features
- Easily **view and modify JSON and JSONL datasets** for training large language models
- Supports **Alpaca (Instruct)**, **ShareGPT**, and **Text** formats (and more)
- Runs **fully portable** from your web browser, as a **single file with zero other dependencies**
- Browse through your training datasets, with easy search and filter functions to segment your data
- Supports **searching and filtering** with regex search or simple substrings search
- Filter multiple samples by **contents, length, matches, and number of turns**. Allows combining multiple queries for composite results.
- Includes an **N-gram viewer** to inspect selected examples for word frequency and repetition (word cloud)
- Allows **splitting and merging datasets** by selecting desired subsets with different criteria.
- Allows easy **dataset deduplication**
- Includes a simple inline editor to modify individual samples or correct typos.
- Pick individual samples or bulk-combine groups of them to curate your dataset, and **save the results as a new JSON dataset**
- Fast and efficient, comfortably handles small to medium sized datasets of up to 400 MB. For larger datasets, it's recommended to split them first.
- Fully open source, capable of running completely offline (just save the HTML file)**Free and open source. Try now at [https://lostruins.github.io/datasetexplorer](https://lostruins.github.io/datasetexplorer)**
### Tips
- JSON > Parquet
- Alpaca > ChatML
- Kobo > !Kobo