Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zmedelis/hfds-clj
Access to HuggingFace datasets via Clojure
https://github.com/zmedelis/hfds-clj
clojure datasets huggingface
Last synced: 4 months ago
JSON representation
Access to HuggingFace datasets via Clojure
- Host: GitHub
- URL: https://github.com/zmedelis/hfds-clj
- Owner: zmedelis
- Created: 2023-11-05T16:21:47.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-11T19:54:41.000Z (about 1 year ago)
- Last Synced: 2023-12-12T15:24:51.632Z (about 1 year ago)
- Topics: clojure, datasets, huggingface
- Language: Clojure
- Homepage:
- Size: 27.3 KB
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## 🪣 hfds-clj
[![Clojars Project](https://img.shields.io/clojars/v/io.github.zmedelis/hfds-clj.svg)](https://clojars.org/io.github.zmedelis/hfds-clj)
**hfds-clj** is a lib to help you get to the [HuggingFace datasets](https://huggingface.co/datasets). The lib provides seamless access to datasets via this process:
* *downloading* HF dataset,
* *caching* downloaded set locally, and
* *serving* it from there for subsequent requests.It does not aim to replicate the full range of functionality found in the [HuggingFace datasets library](https://huggingface.co/docs/datasets/v2.14.5/en/index). Though as an immediate extension, it would be great to support [Dataset Features](https://huggingface.co/docs/datasets/v2.14.5/en/about_dataset_features).
## Usage
### CLI
Data sets can be downloaded from the command line
```
clojure -X:download :dataset "allenai/prosocial-dialog"
```
See next section for parameter description.### Code
```clojure
(require '[hfds-clj.core :refer [load-dataset]])
```Download HF datasets with this oneliner, where a single parameter is the dataset name as provided on the HF dataset page.
```clojure
(load-dataset "Anthropic/hh-rlhf")
```
The second call with `Anthropic/hh-rlhf` parameter will load it from the cache and return a lazy sequence of all the dataset records.A more fine-grained data set request is supported via a parameterized call:
```clojure
(load-dataset {:dataset "allenai/prosocial-dialog"
:split "train"
:config "default"
:offset 0
:length 100}
{:hfds/download-mode :reuse-dataset-if-exists
:hfds/cache-dir "/data"
:hfds/limit 4000}))
```## Notes
* This is extracted from [Bosquet](https://github.com/zmedelis/bosquet) where HuggingFace datasets are used for LLM related developments.
* Thanks to [TrueGrit](https://github.com/KingMob/TrueGrit) helping to rebustly fetch data from HF API