Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/waikato-llm/llm-dataset-converter
For converting LLM datasets from one format into another.
https://github.com/waikato-llm/llm-dataset-converter
conversion dataset llm python3
Last synced: 2 days ago
JSON representation
For converting LLM datasets from one format into another.
- Host: GitHub
- URL: https://github.com/waikato-llm/llm-dataset-converter
- Owner: waikato-llm
- License: mit
- Created: 2023-08-11T03:05:53.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-04T23:26:03.000Z (4 months ago)
- Last Synced: 2024-11-08T21:19:27.740Z (6 days ago)
- Topics: conversion, dataset, llm, python3
- Language: Python
- Homepage: https://waikato-llm.github.io/llm-dataset-converter-examples/
- Size: 702 KB
- Stars: 12
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.rst
- License: LICENSE
Awesome Lists containing this project
README
# llm-dataset-converter
For converting large language model (LLM) datasets from one format into another.
Filters can be supplied as well, e.g., for cleaning up the data.## Installation
Via PyPI:
```bash
pip install llm-dataset-converter
```The latest code straight from the repository:
```bash
pip install git+https://github.com/waikato-llm/llm-dataset-converter.git
```## Docker
[Docker](https://github.com/waikato-llm/llm-dataset-converter-all/tree/main/docker) images are available from:
* Docker hub: [waikatodatamining/llm-dataset-converter](https://hub.docker.com/r/waikatodatamining/llm-dataset-converter)
* In-house registry: `public.aml-repo.cms.waikato.ac.nz:443/tools/llm-dataset-converter`## Datasets
The following repository contains a curated list of datasets for LLMs:
https://github.com/Zjh-819/LLMDataHub
The Hugging Face Hub has an abundance of datasets as well:
https://huggingface.co/datasets
## Dataset formats
The following dataset formats are supported:
| Domain | Format | Read | Write | Compression |
|:---------------|:--------------------------------------------------------------------------|:----------------------------------------------------|:------------------------------------------------| :---: |
| classification | CSV | [from-csv-cl](plugins/from-csv-cl.md) | [to-csv-cl](plugins/to-csv-cl.md) | Y |
| classification | [Jsonlines](https://jsonlines.org/) | [from-jsonlines-cl](plugins/from-jsonlines-cl.md) | [to-jsonlines-cl](plugins/to-jsonlines-cl.md) | Y |
| classification | [Parquet](https://arrow.apache.org/docs/python/parquet.html) | [from-parquet-cl](plugins/from-parquet-cl.md) | [to-parquet-cl](plugins/to-parquet-cl.md) | N |
| classification | TSV | [from-tsv-cl](plugins/from-tsv-cl.md) | [to-tsv-cl](plugins/to-tsv-cl.md) | Y |
| pairs | [Alpaca](https://github.com/tatsu-lab/stanford_alpaca#data-release) | [from-alpaca](plugins/from-alpaca.md) | [to-alpaca](plugins/to-alpaca.md) | Y |
| pairs | CSV | [from-csv-pr](plugins/from-csv-pr.md) | [to-csv-pr](plugins/to-csv-pr.md) | Y |
| pairs | [Jsonlines](https://jsonlines.org/) | [from-jsonlines-pr](plugins/from-jsonlines-pr.md) | [to-jsonlines-pr](plugins/to-jsonlines-pr.md) | Y |
| pairs | [Parquet](https://arrow.apache.org/docs/python/parquet.html) | [from-parquet-pr](plugins/from-parquet-pr.md) | [to-parquet-pr](plugins/to-parquet-pr.md) | N |
| pairs | TSV | [from-tsv-pr](plugins/from-tsv-pr.md) | [to-tsv-pr](plugins/to-tsv-pr.md) | Y |
| pairs | [XTuner](https://github.com/InternLM/xtuner/blob/v0.1.13/docs/en/user_guides/dataset_format.md#single-turn-dialogue-dataset-format) | [from-xtuner](plugins/from-xtuner.md) | [to-xtuner](plugins/to-xtuner.md) | Y |
| pretrain | CSV | [from-csv-pt](plugins/from-csv-pt.md) | [to-csv-pt](plugins/to-csv-pt.md) | Y |
| pretrain | [Jsonlines](https://jsonlines.org/) | [from-jsonlines-pt](plugins/from-jsonlines-pt.md) | [to-jsonlines-pt](plugins/to-jsonlines-pt.md) | Y |
| pretrain | [Parquet](https://arrow.apache.org/docs/python/parquet.html) | [from-parquet-pt](plugins/from-parquet-pt.md) | [to-parquet-pt](plugins/to-parquet-pt.md) | N |
| pretrain | TSV | [from-tsv-pt](plugins/from-tsv-pt.md) | [to-tsv-pt](plugins/to-tsv-pt.md) | Y |
| pretrain | TXT | [from-txt-pt](plugins/from-txt-pt.md) | [to-txt-pt](plugins/to-txt-pt.md) | Y 1 |
| translation | CSV | [from-csv-t9n](plugins/from-csv-t9n.md) | [to-csv-t9n](plugins/to-csv-t9n.md) | Y |
| translation | [Jsonlines](https://jsonlines.org/) 2 | [from-jsonlines-t9n](plugins/from-jsonlines-t9n.md) | [to-jsonlines-t9n](plugins/to-jsonlines-t9n.md) | Y |
| translation | [Parquet](https://arrow.apache.org/docs/python/parquet.html) 3 | [from-parquet-t9n](plugins/from-parquet-t9n.md) | [to-parquet-t9n](plugins/to-parquet-t9n.md) | N |
| translation | TSV | [from-tsv-t9n](plugins/from-tsv-t9n.md) | [to-tsv-t9n](plugins/to-tsv-t9n.md) | Y |
| translation | TXT | [from-txt-t9n](plugins/from-txt-t9n.md) | [to-txt-t9n](plugins/to-txt-t9n.md) | Y 1 |1 Compression not available when concatenating content in single file.
2 Format defined [here](https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/README.md).
3 Translation data itself is stored as JSON dictionary.
## Compression formats
In case a format supports compression, then the following compression formats
are automatically supported for loading/saving files:* [bzip2](https://en.wikipedia.org/wiki/Bzip2): `.bz2`
* [gzip](https://en.wikipedia.org/wiki/Gzip): `.gz`
* [xz](https://en.wikipedia.org/wiki/XZ_Utils): `.xz`
* [zstd](https://en.wikipedia.org/wiki/Zstd): `.zst`, `.zstd`## File encodings
Most readers offer the `--encoding` option to override the automatically determined
file encoding, as that can be wrong due to only inspecting a fixed number of bytes.
The number of bytes of a file inspected can be influenced via the following
environment variable:```
LDC_ENCODING_MAX_CHECK_LENGTH
```A value of `-1` means the complete file. However, that can be very slow and a smaller
value of <1MB is recommended.## Tools
### Dataset conversion
```
usage: llm-convert [-h|--help|--help-all|-help-plugin NAME] [-u INTERVAL]
[-c {None,bz2,gz,xz,zstd}]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
reader
[filter [filter [...]]]
[writer]Tool for converting between large language model (LLM) dataset formats.
readers (20):
from-alpaca, from-csv-cl, from-csv-pr, from-csv-pt, from-csv-t9n,
from-jsonlines-cl, from-jsonlines-pr, from-jsonlines-pt,
from-jsonlines-t9n, from-parquet-cl, from-parquet-pr,
from-parquet-pt, from-parquet-t9n, from-tsv-cl, from-tsv-pr,
from-tsv-pt, from-tsv-t9n, from-txt-pt, from-txt-t9n, from-xtuner
filters (38):
assemble-sentences, change-case, classification-label-map,
file-filter, find-substr, inspect, keyword, language,
llama2-to-pairs, max-length-pt, max-records, metadata,
metadata-from-name, pairs-to-llama2, pairs-to-pretrain,
pretrain-sentences-to-classification, pretrain-sentences-to-pairs,
randomize-records, record-files, record-window, remove-blocks,
remove-empty, remove-patterns, replace-patterns, require-languages,
reset-ids, sentences-pt, skip-duplicate-ids, skip-duplicate-text,
split-pt, split-records, tee, text-length, text-stats,
to-llama2-format, translation-to-pairs, translation-to-pretrain,
update-pair-data
writers (20):
to-alpaca, to-csv-cl, to-csv-pr, to-csv-pt, to-csv-t9n,
to-jsonlines-cl, to-jsonlines-pr, to-jsonlines-pt, to-jsonlines-t9n,
to-parquet-cl, to-parquet-pr, to-parquet-pt, to-parquet-t9n,
to-tsv-cl, to-tsv-pr, to-tsv-pt, to-tsv-t9n, to-txt-pt, to-txt-t9n,
to-xtuneroptional arguments:
-h, --help show basic help message and exit
--help-all show basic help message plus help on all plugins and exit
--help-plugin NAME show help message for plugin NAME and exit
-u INTERVAL, --update_interval INTERVAL
outputs the progress every INTERVAL records (default: 1000)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
the logging level to use (default: WARN)
-c {None,bz2,gz,xz,zstd}, --compression {None,bz2,gz,xz,zstd}
the type of compression to use when only providing an output
directory to the writer (default: None)
-b, --force_batch processes the data in batches
-U, --unescape_unicode unescape unicode characters in the command-line
```### Download
```
usage: llm-download [-h|--help|--help-all|-help-plugin NAME]
downloaderTool for downloading data for large language models (LLMs).
downloaders:
huggingfaceoptional arguments:
-h, --help show basic help message and exit
--help-all show basic help message plus help on all plugins and exit
--help-plugin NAME show help message for plugin NAME and exit
```### Combining multiple files (one-after-the-other)
```
usage: llm-append [-h] [-i [INPUT [INPUT ...]]]
[-I [INPUT_LIST [INPUT_LIST ...]]]
[-t {csv,json,jsonlines,plain-text,tsv}] [-o FILE] [-p]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]Tool for combining multiple text files by appending them.
optional arguments:
-h, --help show this help message and exit
-i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
Path to the text file(s) to append; glob syntax is
supported (default: None)
-I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
Path to the text file(s) listing the data files to
append (default: None)
-t {csv,json,jsonlines,plain-text,tsv}, --file_type {csv,json,jsonlines,plain-text,tsv}
The type of files that are being processed. (default:
plain-text)
-o FILE, --output FILE
The path of the file to store the combined data in;
outputs it to stdout if omitted or a directory
(default: None)
-p, --pretty_print Whether to output the JSON in more human-readable
format. (default: False)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
The logging level to use. (default: WARN)
```### Combining multiple files (side-by-side)
```
usage: llm-paste [-h] [-i [INPUT [INPUT ...]]]
[-I [INPUT_LIST [INPUT_LIST ...]]] [-o FILE]
[-s [SEP [SEP ...]]] [-l {DEBUG,INFO,WARN,ERROR,CRITICAL}]Tool for combining multiple text files by placing them side-by-side.
optional arguments:
-h, --help show this help message and exit
-i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
Path to the text file(s) to combine; glob syntax is
supported (default: None)
-I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
Path to the text file(s) listing the data files to
combine (default: None)
-o FILE, --output FILE
The path of the file to store the combined data in;
outputs it to stdout if omitted or a directory
(default: None)
-s [SEP [SEP ...]], --separator [SEP [SEP ...]]
The separators to use between the files; uses TAB if
not supplied; use '{T}' as placeholder for tab
(default: None)
-l {DEBUG,INFO,WARN,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARN,ERROR,CRITICAL}
The logging level to use (default: WARN)
```### File encodings
The following tool allows you to determine the encoding of text files.
```
usage: llm-file-encoding [-h] [-i [INPUT [INPUT ...]]]
[-I [INPUT_LIST [INPUT_LIST ...]]]
[-m MAX_CHECK_LENGTH] [-o FILE]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]Tool for determining the file encoding of text files.
optional arguments:
-h, --help show this help message and exit
-i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
Path to the text file(s) to check; glob syntax is
supported (default: None)
-I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
Path to the text file(s) listing the actual files to
check (default: None)
-m MAX_CHECK_LENGTH, --max_check_length MAX_CHECK_LENGTH
The maxmimum number of bytes to use for checking
(default: None)
-o FILE, --output FILE
The path of the file to store the determined encodings
in; outputs it to stdout if omitted or a directory
(default: None)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
The logging level to use. (default: WARN)
```### Locating files
Readers tend to support input via file lists. The `llm-find` tool can generate
these.```
usage: llm-find [-h] -i DIR [DIR ...] [-r] -o FILE [-m [REGEXP [REGEXP ...]]]
[-n [REGEXP [REGEXP ...]]]
[--split_ratios [SPLIT_RATIOS [SPLIT_RATIOS ...]]]
[--split_names [SPLIT_NAMES [SPLIT_NAMES ...]]]
[--split_name_separator SPLIT_NAME_SEPARATOR]
[-l {DEBUG,INFO,WARN,ERROR,CRITICAL}]Tool for locating files in directories that match certain patterns and store
them in files.optional arguments:
-h, --help show this help message and exit
-i DIR [DIR ...], --input DIR [DIR ...]
The dir(s) to scan for files. (default: None)
-r, --recursive Whether to search the directories recursively
(default: False)
-o FILE, --output FILE
The file to store the located file names in (default:
None)
-m [REGEXP [REGEXP ...]], --match [REGEXP [REGEXP ...]]
The regular expression that the (full) file names must
match to be included (default: None)
-n [REGEXP [REGEXP ...]], --not-match [REGEXP [REGEXP ...]]
The regular expression that the (full) file names must
match to be excluded (default: None)
--split_ratios [SPLIT_RATIOS [SPLIT_RATIOS ...]]
The split ratios to use for generating the splits
(int; must sum up to 100) (default: None)
--split_names [SPLIT_NAMES [SPLIT_NAMES ...]]
The split names to use as filename suffixes for the
generated splits (before .ext) (default: None)
--split_name_separator SPLIT_NAME_SEPARATOR
The separator to use between file name and split name
(default: -)
-l {DEBUG,INFO,WARN,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARN,ERROR,CRITICAL}
The logging level to use (default: WARN)
```### Generating help screens for plugins
```
usage: llm-help [-h] [-c [PACKAGE [PACKAGE ...]]] [-e EXCLUDED_CLASS_LISTERS]
[-p NAME] [-f FORMAT] [-L INT] [-o PATH] [-i FILE] [-t TITLE]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]Tool for outputting help for plugins in various formats.
optional arguments:
-h, --help show this help message and exit
-c [PACKAGE [PACKAGE ...]], --custom_class_listers [PACKAGE [PACKAGE ...]]
The names of the custom class listers, uses the
default ones if not provided. (default: None)
-e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
The comma-separated list of class listers to excluded.
(default: None)
-p NAME, --plugin_name NAME
The name of the plugin to generate the help for,
generates it for all if not specified (default: None)
-f FORMAT, --help_format FORMAT
The output format to generate (default: text)
-L INT, --heading_level INT
The level to use for the heading (default: 1)
-o PATH, --output PATH
The directory or file to store the help in; outputs it
to stdout if not supplied; if pointing to a directory,
automatically generates file name from plugin name and
help format (default: None)
-i FILE, --index_file FILE
The file in the output directory to generate with an
overview of all plugins, grouped by type (in markdown
format, links them to the other generated files)
(default: None)
-t TITLE, --index_title TITLE
The title to use in the index file (default: llm-
dataset-converter plugins)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
The logging level to use. (default: WARN)
```### Plugin registry
```
usage: llm-registry [-h] [-c CUSTOM_CLASS_LISTERS] [-e EXCLUDED_CLASS_LISTERS]
[-l {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}]For inspecting/querying the registry.
optional arguments:
-h, --help show this help message and exit
-c CUSTOM_CLASS_LISTERS, --custom_class_listers CUSTOM_CLASS_LISTERS
The comma-separated list of custom class listers to
use. (default: None)
-e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
The comma-separated list of class listers to excluded.
(default: None)
-l {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}, --list {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}
For outputting various lists on stdout. (default:
None)
```## Plugins
See [here](plugins/README.md) for an overview of all plugins.
## Examples
You can find examples for using the library (command-line and code) here:
https://waikato-llm.github.io/llm-dataset-converter-examples/
## Additional libraries
* [Audio transcription using faster-whisper](https://github.com/waikato-llm/ldc-faster-whisper)
* [Google integration](https://github.com/waikato-llm/ldc-google)
* [HTML handling](https://github.com/waikato-llm/ldc-html)
* [MS Word .doc integration](https://github.com/waikato-llm/ldc-doc)
* [MS Word .docx integration](https://github.com/waikato-llm/ldc-docx)
* [OpenAI integration](https://github.com/waikato-llm/ldc-openai)
* [PDF handling](https://github.com/waikato-llm/ldc-pdf)
* [TinT](https://github.com/waikato-llm/ldc-tint)## Class listers
The *llm-dataset-converter* uses the *class lister registry* provided
by the [seppl](https://github.com/waikato-datamining/seppl) library.Each module defines a function, typically called `list_classes` that returns
a dictionary of names of superclasses associated with a list of modules that
should be scanned for derived classes. Here is an example:```python
from typing import List, Dictdef list_classes() -> Dict[str, List[str]]:
return {
"ldc.api.Downloader": [
"mod.ule1",
],
"ldc.api.Reader": [
"mod.ule2",
"mod.ule3",
],
"ldc.api.Filter": [
"mod.ule4",
],
"seppl.io.Writer": [
"mod.ule5",
],
}
```Such a class lister gets referenced in the `entry_points` section of the `setup.py` file:
```python
entry_points={
"class_lister": [
"unique_string=module_name:function_name",
],
},
````:function_name` can be omitted if `:list_classes`.
The following environment variables can be used to influence the class listers:
* `LDC_CLASS_LISTERS`
* `LDC_CLASS_LISTERS_EXCL`Each variable is a comma-separated list of `module_name:function_name`, defining the class listers.