https://github.com/zmedelis/hfds-clj
Access to HuggingFace datasets via Clojure
https://github.com/zmedelis/hfds-clj
clojure datasets huggingface
Last synced: 3 months ago
JSON representation
Access to HuggingFace datasets via Clojure
- Host: GitHub
- URL: https://github.com/zmedelis/hfds-clj
- Owner: zmedelis
- Created: 2023-11-05T16:21:47.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-12-30T08:57:11.000Z (6 months ago)
- Last Synced: 2026-01-02T21:21:50.410Z (6 months ago)
- Topics: clojure, datasets, huggingface
- Language: Clojure
- Homepage:
- Size: 43 KB
- Stars: 13
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## 🪣 hfds-clj
[](https://clojars.org/io.github.zmedelis/hfds-clj)
**hfds-clj** is a lib to help you get to the [HuggingFace datasets](https://huggingface.co/datasets) and [HuggingFace models](https://huggingface.co/models).
The lib provides seamless access to datasets via this process:
* *downloading* HF dataset,
* *caching* downloaded set locally, and
* *serving* it from there for subsequent requests.
It does not aim to replicate the full range of functionality found in the [HuggingFace datasets library](https://huggingface.co/docs/datasets/v2.14.5/en/index). Though as an immediate extension, it would be great to support [Dataset Features](https://huggingface.co/docs/datasets/v2.14.5/en/about_dataset_features).
## Download datasets
### CLI
Data sets can be downloaded from the command line
```
clojure -X:download-dataset :dataset "allenai/prosocial-dialog"
```
See next section for parameter description.
### Code
```clojure
(require '[hfds-clj.datasets :refer [load-dataset]])
```
Download HF datasets with this oneliner, where a single parameter is the dataset name as provided on the HF dataset page.
```clojure
(load-dataset "Anthropic/hh-rlhf")
```
The second call with `Anthropic/hh-rlhf` parameter will load it from the cache and return a lazy sequence of all the dataset records.
A more fine-grained data set request is supported via a parameterized call:
```clojure
(load-dataset {:dataset "allenai/prosocial-dialog"
:split "train"
:config "default"
:offset 0
:length 100}
{:hfds/download-mode :reuse-dataset-if-exists
:hfds/cache-dir "/data"
:hfds/limit 4000}))
```
## Downloads models
Models can be downloaded and stored on disk via this CLI call:
```
clojure -X:download-model :model '"nvidia/Gemma-2b-it-ONNX-INT4"' :hf-token "" :models-base-dir '"/tmp/models"'
```
# Usage as Clojure tool
'hfds-clj' can be used as well as a Clojure tool.
Installation as tool (latest GIT version)
```bash
clojure -Ttools install io.github.zmedelis/hfds-clj '{:git/url "https://github.com/zmedelis/hfds-clj" :git/sha "4a84254030fceca8bf3f5e8dce4226b4b8cdf48a"}' :as hfds-clj
```
Example to to call it as tool, to download data
```bash
clojure -Thfds-clj hfds-clj.datasets/download-cli :dataset "allenai/prosocial-dialog"
```
and model:
```bash
clojure -Thfds-clj hfds-clj.models/download-cli
:models-base-dir '"/tmp/models"'
:model '"nvidia/Gemma-2b-it-ONNX-INT4"'
:revision '"4fe167cca69847b5218e7cc37c7e1984056cf340"'
:hf-token ""
```
`:revision` is optional. If not specified, "main" (= latest) is used
`:hf-token` is optional. Most models are public, so no huggingface token is needed
## Notes
* This is extracted from [Bosquet](https://github.com/zmedelis/bosquet) where HuggingFace datasets are used for LLM related developments.
* Thanks to [TrueGrit](https://github.com/KingMob/TrueGrit) helping to robustly fetch data from HF API