{"id":32191305,"url":"https://github.com/emiruz/dataset-tools","last_synced_at":"2025-10-22T01:37:05.296Z","repository":{"id":62432253,"uuid":"97402450","full_name":"emiruz/dataset-tools","owner":"emiruz","description":"Easy to use library for working with core.matrix datasets in Clojure: select, where, aggregate, join, order, cross-tab, from/to-dataset, etc","archived":false,"fork":false,"pushed_at":"2018-08-03T12:03:46.000Z","size":26,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-10-22T01:36:44.056Z","etag":null,"topics":["data-mining","data-science","dataset","dsl","matrix","sql"],"latest_commit_sha":null,"homepage":null,"language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"epl-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/emiruz.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-07-16T18:24:56.000Z","updated_at":"2019-07-15T13:13:10.000Z","dependencies_parsed_at":"2022-11-01T20:47:09.218Z","dependency_job_id":null,"html_url":"https://github.com/emiruz/dataset-tools","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/emiruz/dataset-tools","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emiruz%2Fdataset-tools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emiruz%2Fdataset-tools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emiruz%2Fdataset-tools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emiruz%2Fdataset-tools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/emiruz","download_url":"https://codeload.github.com/emiruz/dataset-tools/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emiruz%2Fdataset-tools/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280365575,"owners_count":26318385,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-21T02:00:06.614Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-mining","data-science","dataset","dsl","matrix","sql"],"created_at":"2025-10-22T01:37:03.264Z","updated_at":"2025-10-22T01:37:05.288Z","avatar_url":"https://github.com/emiruz.png","language":"Clojure","readme":"# dataset-tools\n\n[![Clojars Project](https://img.shields.io/clojars/v/dataset-tools.svg)](https://clojars.org/dataset-tools)\n\nAPI documentation is [here](https://emiruz.github.io/dataset-tools/index.html).\n\nAn easy to use library for working with [core.matrix.dataset](https://mikera.github.io/core.matrix/doc/clojure.core.matrix.dataset.html)\ndatasets in Clojure. Library includes the following functions:\n\n* select (column selection)\n* order (multi-field sorting)\n* where (filtering)\n* join (inner and left join datasets on arbitrary criteria)\n* aggregate (group by aggregates)\n* cross-tab (pivot-tables)\n* order-columns (column ordering)\n* capply, rapply (column and row apply)\n* to-dataset (list of maps to dataset)\n* from-dataset (dataset to list of maps).\n* from-csv (dataset from csv file)\n* reduce-dimensions (dimension reduction of a dataset).\n* various aux. functions.\n\nPlease report issues and contribute if you can.\n\n## Getting Started\n\n1. Add the library to your project file:\n\n[![Clojars Project](https://img.shields.io/clojars/v/dataset-tools.svg)](https://clojars.org/dataset-tools)\n\n2. Either *use* or *require* the library in your code:\n\n```clojure\n(require '[dataset-tools.core :as dt])\n```\n\n## Common Tasks\n\nEvery function other than *from-dataset* returns a dataset. The parameters of the functions\nare designed such that they are easy to thread together to elegantly compose complex\nprocessing tasks. Here is some example of common tasks.\n\n\n```clojure\n(def test-data\n  [{:a 1 :b 4 :c \"X\" :d \"A\"}\n   {:a 41 :b 33 :c \"Y\" :d \"A\"}\n   {:a 12 :b 19 :c \"X\" :d \"B\"}])\n\n(def test-data2\n  [{:a 1 :e 9 :x \"X\"}\n   {:a 1 :e 9 :x \"X2\"}\n   {:a 41 :e 99 :x \"A\"}\n   {:a 13 :e 999 :x \"X\"}])\n```\n\n### Convert a list of maps to a dataset\n\n```clojure\n(-\u003e\u003e test-data\n     (dt/to-dataset [:a :b :c :d]))\n```\n\n### Convert a dataset to a lazy sequence of maps\n\n```clojure\n(def test-dataset (dt/to-dataset [:a :b :c :d] test-data))\n\n(-\u003e\u003e test-dataset\n     dt/from-dataset)\n```\n\n### Get a column vector\n\n```clojure\n(dt/select :a test-dataset)\n```\n\n### Filter the dataset to show rows where *c* = \"Y\", order the result by *a*,\nthen only show columns *a* and *b*\n\n```clojure\n(-\u003e\u003e test-dataset\n     (dt/where (comp #(= % \"Y\") :c))\n     (dt/order :a)\n     (dt/select [:a :b]))\n```\n\n### Produce sum(a) and mean(b), grouped by columns *c* and *d*\n\n```clojure\n(-\u003e\u003e test-dataset\n     (dt/aggregate [:c :d]\n     \t\t   {:sum (fn[v](reduce + 0 (map :c v)))\n\t\t    :mean (fn[v](/ (reduce + 0 (map :c v))) (count v))}))\n```\n\n### Inner join two datasets on column *a*\n\n```clojure\n(def test-dataset2 (dt/to-dataset [:a :e :x] test-data))\n\n(dt/join :a\n\t :a\n\t test-dataset\n\t test-dataset2)\n```\n\n### Produce a pivot of *c* versus *d* with sum(b) at the intersections\n\n```clojure\n(dt/cross-tab :c :d\n\t      #(if (nil? %) nil (reduce + 0 (map :b %)))\n\t      test-dataset)\n```\n\n## License\n\nDistributed under the Eclipse Public License either version 1.0 or (at\nyour option) any later version.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femiruz%2Fdataset-tools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Femiruz%2Fdataset-tools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femiruz%2Fdataset-tools/lists"}