{"id":18994409,"url":"https://github.com/xtdb/xtdb-kaggle","last_synced_at":"2025-04-22T12:50:26.322Z","repository":{"id":108185100,"uuid":"269071130","full_name":"xtdb/xtdb-kaggle","owner":"xtdb","description":"A small XTDB utility to download CSV datasets from Kaggle and turn them into XTDB transaction operations.","archived":false,"fork":false,"pushed_at":"2021-10-03T13:04:17.000Z","size":14,"stargazers_count":3,"open_issues_count":1,"forks_count":2,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-04-17T01:36:55.305Z","etag":null,"topics":["clojure","kaggle","xtdb"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xtdb.png","metadata":{"files":{"readme":"README.adoc","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-06-03T11:34:21.000Z","updated_at":"2023-05-13T15:59:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"c090da28-ce7b-49e4-84b0-0be6dbdc5e33","html_url":"https://github.com/xtdb/xtdb-kaggle","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtdb%2Fxtdb-kaggle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtdb%2Fxtdb-kaggle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtdb%2Fxtdb-kaggle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtdb%2Fxtdb-kaggle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xtdb","download_url":"https://codeload.github.com/xtdb/xtdb-kaggle/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250243760,"owners_count":21398397,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","kaggle","xtdb"],"created_at":"2024-11-08T17:25:29.622Z","updated_at":"2025-04-22T12:50:26.303Z","avatar_url":"https://github.com/xtdb.png","language":"Clojure","readme":"= XTDB Kaggle\n\nA small XTDB utility to download CSV datasets from https://kaggle.com[Kaggle] and turn them into XTDB transaction operations.\n\nAt the moment, it's only got a transformer for one dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata.\nIf you do implement transformers for others, please do submit them as PRs!\n\n== Setup\n\n`xtdb-kaggle` is a REPL based tool at the moment. To get set up:\n\n* Clone the repo\n* Get yourself a Kaggle API key file - create an account, head to your account settings and download an API key JSON file\n* Set a `KAGGLE_KEY_FILE` environment variable pointing to the key file\n* Start a REPL and connect to it in your usual way.\n\nThen, find yourself an interesting dataset on https://kaggle.com[Kaggle]\n\nYou need to tell `xtdb-kaggle` which files you'd like to download, and then how to turn each file into XTDB operations - this is done using multimethods.\n\nUsing that movie dataset as an example - we have an `:owner-slug` of `\"tmdb\"`, a `:dataset-slug` of `\"tmdb-movie-metadata\"`, and two files: `\"tmdb_5000_movies.csv\"` and `\"tmdb_5000_credits.csv\"`.\n\nWe define `dataset-file-names` to specify the files, and one instance of `csv-row-\u003eops-fn` for each file:\n\n[source,clojure]\n----\n(defmethod dataset-file-names [\"tmdb\" \"tmdb-movie-metadata\"] [_]\n  #{\"tmdb_5000_movies.csv\" \"tmdb_5000_credits.csv\"})\n\n(defmethod csv-row-\u003eops-fn [\"tmdb\" \"tmdb-movie-metadata\" \"tmdb_5000_movies.csv\"] [_]\n  (fn [{:strs [id title runtime budget revenue keywords genres] :as row}]\n    [[::xt/put {:xt/id (keyword (name 'tmdb.movie) id)\n                :tmdb/type :movie\n                :tmdb.movie/id (Long/parseLong id)\n                :tmdb.movie/title title\n                :tmdb.movie/budget (some-\u003e budget Long/parseLong)\n                :tmdb.movie/revenue (some-\u003e revenue Long/parseLong)\n                :tmdb.movie/keywords (-\u003e\u003e (json/read-value keywords)\n                                          (into #{} (map #(get % \"name\"))))\n                :tmdb.movie/genres (-\u003e\u003e (json/read-value genres)\n                                        (into #{} (map #(get % \"name\"))))}]]))\n\n(defmethod csv-row-\u003eops-fn [\"tmdb\" \"tmdb-movie-metadata\" \"tmdb_5000_credits.csv\"] [_]\n  (fn [{:strs [movie_id cast] :as row}]\n    (let [movie-id (Long/parseLong movie_id)]\n      (-\u003e\u003e (for [{cast-name \"name\", :strs [credit_id id character]} (json/read-value cast)]\n             [[::xt/put {:xt/id (keyword (name 'tmdb.cast) (str id))\n                         :tmdb/type :cast\n                         :tmdb.cast/id id\n                         :tmdb.cast/name cast-name}]\n              [::xt/put {:xt/id (keyword (name 'tmdb.credit) credit_id)\n                         :tmdb/type :credit\n                         :tmdb.movie/id movie-id\n                         :tmdb.cast/id id\n                         :tmdb.cast/character character}]])\n           (apply concat)))))\n----\n\nThen, we can stream the dataset to a local file of XTDB transaction ops using:\n\n[source,clojure]\n----\n(-\u003e\u003e (dataset-\u003eops {:owner-slug \"tmdb\", :dataset-slug \"tmdb-movie-metadata\"})\n     (ops-\u003estream (io/output-stream (io/file \"/tmp/movies.edn\"))))\n----\n\nHave fun!\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxtdb%2Fxtdb-kaggle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxtdb%2Fxtdb-kaggle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxtdb%2Fxtdb-kaggle/lists"}