An open API service indexing awesome lists of open source software.

https://github.com/LostRuins/datasetexplorer

Easily view and modify JSON datasets for large language models
https://github.com/LostRuins/datasetexplorer

Last synced: about 1 month ago
JSON representation

Easily view and modify JSON datasets for large language models

Awesome Lists containing this project

README

          

# Concedo's Dataset Explorer
Easily view and modify JSON and JSONL datasets for training large language models

![image](https://github.com/user-attachments/assets/db662879-dc61-4dbd-916e-fa5a9f325db8)

## Features
- Easily **view and modify JSON and JSONL datasets** for training large language models
- Supports **Alpaca (Instruct)**, **ShareGPT**, and **Text** formats (and more)
- Runs **fully portable** from your web browser, as a **single file with zero other dependencies**
- Browse through your training datasets, with easy search and filter functions to segment your data
- Supports **searching and filtering** with regex search or simple substrings search
- Filter multiple samples by **contents, length, matches, and number of turns**. Allows combining multiple queries for composite results.
- Includes an **N-gram viewer** to inspect selected examples for word frequency and repetition (word cloud)
- Allows **splitting and merging datasets** by selecting desired subsets with different criteria.
- Allows easy **dataset deduplication**
- Includes a simple inline editor to modify individual samples or correct typos.
- Pick individual samples or bulk-combine groups of them to curate your dataset, and **save the results as a new JSON dataset**
- Fast and efficient, comfortably handles small to medium sized datasets of up to 400 MB. For larger datasets, it's recommended to split them first.
- Fully open source, capable of running completely offline (just save the HTML file)

**Free and open source. Try now at [https://lostruins.github.io/datasetexplorer](https://lostruins.github.io/datasetexplorer)**

### Tips
- JSON > Parquet
- Alpaca > ChatML
- Kobo > !Kobo