https://github.com/xtdb/xtdb-kaggle

A small XTDB utility to download CSV datasets from Kaggle and turn them into XTDB transaction operations.
https://github.com/xtdb/xtdb-kaggle

clojure kaggle xtdb

Last synced: 9 months ago
JSON representation

A small XTDB utility to download CSV datasets from Kaggle and turn them into XTDB transaction operations.

Host: GitHub
URL: https://github.com/xtdb/xtdb-kaggle
Owner: xtdb
License: mit
Created: 2020-06-03T11:34:21.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2021-10-03T13:04:17.000Z (over 4 years ago)
Last Synced: 2025-04-17T01:36:55.305Z (10 months ago)
Topics: clojure, kaggle, xtdb
Language: Clojure
Homepage:
Size: 13.7 KB
Stars: 3
Watchers: 14
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.adoc
- License: LICENSE

Awesome Lists containing this project

README

          = XTDB Kaggle

A small XTDB utility to download CSV datasets from https://kaggle.com[Kaggle] and turn them into XTDB transaction operations.

At the moment, it's only got a transformer for one dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata.

If you do implement transformers for others, please do submit them as PRs!

== Setup

`xtdb-kaggle` is a REPL based tool at the moment. To get set up:

* Clone the repo

* Get yourself a Kaggle API key file - create an account, head to your account settings and download an API key JSON file

* Set a `KAGGLE_KEY_FILE` environment variable pointing to the key file

* Start a REPL and connect to it in your usual way.

Then, find yourself an interesting dataset on https://kaggle.com[Kaggle]

You need to tell `xtdb-kaggle` which files you'd like to download, and then how to turn each file into XTDB operations - this is done using multimethods.

Using that movie dataset as an example - we have an `:owner-slug` of `"tmdb"`, a `:dataset-slug` of `"tmdb-movie-metadata"`, and two files: `"tmdb_5000_movies.csv"` and `"tmdb_5000_credits.csv"`.

We define `dataset-file-names` to specify the files, and one instance of `csv-row->ops-fn` for each file:

[source,clojure]

----

(defmethod dataset-file-names ["tmdb" "tmdb-movie-metadata"] [_]

  #{"tmdb_5000_movies.csv" "tmdb_5000_credits.csv"})

(defmethod csv-row->ops-fn ["tmdb" "tmdb-movie-metadata" "tmdb_5000_movies.csv"] [_]

  (fn [{:strs [id title runtime budget revenue keywords genres] :as row}]

    [[::xt/put {:xt/id (keyword (name 'tmdb.movie) id)

                :tmdb/type :movie

                :tmdb.movie/id (Long/parseLong id)

                :tmdb.movie/title title

                :tmdb.movie/budget (some-> budget Long/parseLong)

                :tmdb.movie/revenue (some-> revenue Long/parseLong)

                :tmdb.movie/keywords (->> (json/read-value keywords)

                                          (into #{} (map #(get % "name"))))

                :tmdb.movie/genres (->> (json/read-value genres)

                                        (into #{} (map #(get % "name"))))}]]))

(defmethod csv-row->ops-fn ["tmdb" "tmdb-movie-metadata" "tmdb_5000_credits.csv"] [_]

  (fn [{:strs [movie_id cast] :as row}]

    (let [movie-id (Long/parseLong movie_id)]

      (->> (for [{cast-name "name", :strs [credit_id id character]} (json/read-value cast)]

             [[::xt/put {:xt/id (keyword (name 'tmdb.cast) (str id))

                         :tmdb/type :cast

                         :tmdb.cast/id id

                         :tmdb.cast/name cast-name}]

              [::xt/put {:xt/id (keyword (name 'tmdb.credit) credit_id)

                         :tmdb/type :credit

                         :tmdb.movie/id movie-id

                         :tmdb.cast/id id

                         :tmdb.cast/character character}]])

           (apply concat)))))

----

Then, we can stream the dataset to a local file of XTDB transaction ops using:

[source,clojure]

----

(->> (dataset->ops {:owner-slug "tmdb", :dataset-slug "tmdb-movie-metadata"})

     (ops->stream (io/output-stream (io/file "/tmp/movies.edn"))))

----

Have fun!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xtdb/xtdb-kaggle

Awesome Lists containing this project

README