Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/zmedelis/hfds-clj

Access to HuggingFace datasets via Clojure
https://github.com/zmedelis/hfds-clj

clojure datasets huggingface

Last synced: 4 months ago
JSON representation

Access to HuggingFace datasets via Clojure

Awesome Lists containing this project

README

        

## 🪣 hfds-clj

[![Clojars Project](https://img.shields.io/clojars/v/io.github.zmedelis/hfds-clj.svg)](https://clojars.org/io.github.zmedelis/hfds-clj)

**hfds-clj** is a lib to help you get to the [HuggingFace datasets](https://huggingface.co/datasets). The lib provides seamless access to datasets via this process:
* *downloading* HF dataset,
* *caching* downloaded set locally, and
* *serving* it from there for subsequent requests.

It does not aim to replicate the full range of functionality found in the [HuggingFace datasets library](https://huggingface.co/docs/datasets/v2.14.5/en/index). Though as an immediate extension, it would be great to support [Dataset Features](https://huggingface.co/docs/datasets/v2.14.5/en/about_dataset_features).

## Usage

### CLI

Data sets can be downloaded from the command line
```
clojure -X:download :dataset "allenai/prosocial-dialog"
```
See next section for parameter description.

### Code

```clojure
(require '[hfds-clj.core :refer [load-dataset]])
```

Download HF datasets with this oneliner, where a single parameter is the dataset name as provided on the HF dataset page.

```clojure
(load-dataset "Anthropic/hh-rlhf")
```
The second call with `Anthropic/hh-rlhf` parameter will load it from the cache and return a lazy sequence of all the dataset records.

A more fine-grained data set request is supported via a parameterized call:

```clojure
(load-dataset {:dataset "allenai/prosocial-dialog"
:split "train"
:config "default"
:offset 0
:length 100}
{:hfds/download-mode :reuse-dataset-if-exists
:hfds/cache-dir "/data"
:hfds/limit 4000}))
```

## Notes

* This is extracted from [Bosquet](https://github.com/zmedelis/bosquet) where HuggingFace datasets are used for LLM related developments.
* Thanks to [TrueGrit](https://github.com/KingMob/TrueGrit) helping to rebustly fetch data from HF API