https://github.com/xtdb/xtdb-kaggle
A small XTDB utility to download CSV datasets from Kaggle and turn them into XTDB transaction operations.
https://github.com/xtdb/xtdb-kaggle
clojure kaggle xtdb
Last synced: 9 months ago
JSON representation
A small XTDB utility to download CSV datasets from Kaggle and turn them into XTDB transaction operations.
- Host: GitHub
- URL: https://github.com/xtdb/xtdb-kaggle
- Owner: xtdb
- License: mit
- Created: 2020-06-03T11:34:21.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-10-03T13:04:17.000Z (over 4 years ago)
- Last Synced: 2025-04-17T01:36:55.305Z (10 months ago)
- Topics: clojure, kaggle, xtdb
- Language: Clojure
- Homepage:
- Size: 13.7 KB
- Stars: 3
- Watchers: 14
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.adoc
- License: LICENSE
Awesome Lists containing this project
README
= XTDB Kaggle
A small XTDB utility to download CSV datasets from https://kaggle.com[Kaggle] and turn them into XTDB transaction operations.
At the moment, it's only got a transformer for one dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata.
If you do implement transformers for others, please do submit them as PRs!
== Setup
`xtdb-kaggle` is a REPL based tool at the moment. To get set up:
* Clone the repo
* Get yourself a Kaggle API key file - create an account, head to your account settings and download an API key JSON file
* Set a `KAGGLE_KEY_FILE` environment variable pointing to the key file
* Start a REPL and connect to it in your usual way.
Then, find yourself an interesting dataset on https://kaggle.com[Kaggle]
You need to tell `xtdb-kaggle` which files you'd like to download, and then how to turn each file into XTDB operations - this is done using multimethods.
Using that movie dataset as an example - we have an `:owner-slug` of `"tmdb"`, a `:dataset-slug` of `"tmdb-movie-metadata"`, and two files: `"tmdb_5000_movies.csv"` and `"tmdb_5000_credits.csv"`.
We define `dataset-file-names` to specify the files, and one instance of `csv-row->ops-fn` for each file:
[source,clojure]
----
(defmethod dataset-file-names ["tmdb" "tmdb-movie-metadata"] [_]
#{"tmdb_5000_movies.csv" "tmdb_5000_credits.csv"})
(defmethod csv-row->ops-fn ["tmdb" "tmdb-movie-metadata" "tmdb_5000_movies.csv"] [_]
(fn [{:strs [id title runtime budget revenue keywords genres] :as row}]
[[::xt/put {:xt/id (keyword (name 'tmdb.movie) id)
:tmdb/type :movie
:tmdb.movie/id (Long/parseLong id)
:tmdb.movie/title title
:tmdb.movie/budget (some-> budget Long/parseLong)
:tmdb.movie/revenue (some-> revenue Long/parseLong)
:tmdb.movie/keywords (->> (json/read-value keywords)
(into #{} (map #(get % "name"))))
:tmdb.movie/genres (->> (json/read-value genres)
(into #{} (map #(get % "name"))))}]]))
(defmethod csv-row->ops-fn ["tmdb" "tmdb-movie-metadata" "tmdb_5000_credits.csv"] [_]
(fn [{:strs [movie_id cast] :as row}]
(let [movie-id (Long/parseLong movie_id)]
(->> (for [{cast-name "name", :strs [credit_id id character]} (json/read-value cast)]
[[::xt/put {:xt/id (keyword (name 'tmdb.cast) (str id))
:tmdb/type :cast
:tmdb.cast/id id
:tmdb.cast/name cast-name}]
[::xt/put {:xt/id (keyword (name 'tmdb.credit) credit_id)
:tmdb/type :credit
:tmdb.movie/id movie-id
:tmdb.cast/id id
:tmdb.cast/character character}]])
(apply concat)))))
----
Then, we can stream the dataset to a local file of XTDB transaction ops using:
[source,clojure]
----
(->> (dataset->ops {:owner-slug "tmdb", :dataset-slug "tmdb-movie-metadata"})
(ops->stream (io/output-stream (io/file "/tmp/movies.edn"))))
----
Have fun!