{"id":13600730,"url":"https://github.com/techascent/tech.ml.dataset","last_synced_at":"2025-05-15T07:03:44.069Z","repository":{"id":37506380,"uuid":"170727713","full_name":"techascent/tech.ml.dataset","owner":"techascent","description":"A Clojure high performance data processing system","archived":false,"fork":false,"pushed_at":"2025-05-05T21:38:59.000Z","size":9891,"stargazers_count":703,"open_issues_count":27,"forks_count":34,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-05-05T22:36:41.314Z","etag":null,"topics":["clojure","csv","dataframe","datascience","dataset","etl-pipeline","java","machine-learning","xlsx"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"epl-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/techascent.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":"docs/supported-datatypes.html","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":"cnuernber"}},"created_at":"2019-02-14T17:07:02.000Z","updated_at":"2025-05-05T21:39:03.000Z","dependencies_parsed_at":"2023-12-19T02:31:29.597Z","dependency_job_id":"9c933b3b-5099-42b5-b999-f431d782a005","html_url":"https://github.com/techascent/tech.ml.dataset","commit_stats":{"total_commits":1663,"total_committers":22,"mean_commits":75.5909090909091,"dds":0.08238123872519543,"last_synced_commit":"471b526552bf3bc3f58de9ebba860008b8d44431"},"previous_names":[],"tags_count":401,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techascent%2Ftech.ml.dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techascent%2Ftech.ml.dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techascent%2Ftech.ml.dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techascent%2Ftech.ml.dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/techascent","download_url":"https://codeload.github.com/techascent/tech.ml.dataset/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254291961,"owners_count":22046424,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","csv","dataframe","datascience","dataset","etl-pipeline","java","machine-learning","xlsx"],"created_at":"2024-08-01T18:00:47.397Z","updated_at":"2025-05-15T07:03:40.581Z","avatar_url":"https://github.com/techascent.png","language":"Clojure","readme":"[![Clojars Project](https://img.shields.io/clojars/v/techascent/tech.ml.dataset.svg)](https://clojars.org/techascent/tech.ml.dataset)\n![CI](https://github.com/techascent/tech.ml.dataset/actions/workflows/test.yml/badge.svg)\n![CI devcontainer](https://github.com/techascent/tech.ml.dataset/actions/workflows/test-devcontainer.yml/badge.svg)\n\n# tech.ml.dataset\n\n![TMD Logo](logo.png \"TMD\")\n\n`tech.ml.dataset` (TMD) is a Clojure library for tabular data processing similar to Python's Pandas, or R's `data.table`. It supports pragmatic data-intensive work on the JVM by providing powerful abstractions that simplify implementing efficient solutions to real problems. Datasets [shrink in memory](https://gist.github.com/cnuernber/26b88ed259dd1d0dc6ac2aa138eecf37) through columnar storage and the use of primitive arrays, packed datetime types, and string tables.\n\nUnlike in Python or R, TMD datasets are _functional_, which means they're easier to reason about.\n\n## Installing\n\nInstallation instructions for your favorite build system (lein, deps.edn, etc...) can be found at Clojars, where the library is hosted:\n\n[![Clojars Project](https://img.shields.io/clojars/v/techascent/tech.ml.dataset.svg)](https://clojars.org/techascent/tech.ml.dataset)\n\n - [https://clojars.org/techascent/tech.ml.dataset](https://clojars.org/techascent/tech.ml.dataset)\n\n## Verifying Installation\n\n```clojure\nuser\u003e (require 'tech.v3.dataset)\nnil\nuser\u003e (-\u003e\u003e (System/getProperties)\n           (map (fn [[k v]] {:k k :v (apply str (take 40 (str v)))}))\n           (tech.v3.dataset/-\u003e\u003edataset {:dataset-name \"My Truncated System Properties\"}))\n\nMy Truncated System Properties [53 2]:\n\n|                         :k |                                       :v |\n|----------------------------|------------------------------------------|\n|                sun.desktop |                                    gnome |\n|                awt.toolkit |                     sun.awt.X11.XToolkit |\n| java.specification.version |                                       11 |\n|            sun.cpu.isalist |                                          |\n|           sun.jnu.encoding |                                    UTF-8 |\n|            java.class.path | src:resources:target/classes:/home/harol |\n|             java.vm.vendor |                                   Ubuntu |\n|        sun.arch.data.model |                                       64 |\n|            java.vendor.url |                      https://ubuntu.com/ |\n|              user.timezone |                           America/Denver |\n|                        ... |                                      ... |\n|                    os.arch |                                    amd64 |\n| java.vm.specification.name |       Java Virtual Machine Specification |\n|        java.awt.printerjob |                   sun.print.PSPrinterJob |\n|         sun.os.patch.level |                                  unknown |\n|          java.library.path | /usr/java/packages/lib:/usr/lib/x86_64-l |\n|               java.vm.info |                      mixed mode, sharing |\n|                java.vendor |                                   Ubuntu |\n|            java.vm.version |      11.0.17+8-post-Ubuntu-1ubuntu222.04 |\n|    sun.io.unicode.encoding |                            UnicodeLittle |\n|        apple.awt.UIElement |                                     true |\n|         java.class.version |                                     55.0 |\n```\n\n## 📚 Documentation 📚\n\nThe best place to start is the \"Getting Started\" topic in the documentation: [https://techascent.github.io/tech.ml.dataset/000-getting-started.html](https://techascent.github.io/tech.ml.dataset/000-getting-started.html)\n\nThe \"Walkthrough\" topic provides long-form examples of processing real data: [https://techascent.github.io/tech.ml.dataset/100-walkthrough.html](https://techascent.github.io/tech.ml.dataset/100-walkthrough.html)\n\nThe \"Quick Reference\" topic summarizes many of the most frequently used functions: [https://techascent.github.io/tech.ml.dataset/200-quick-reference.html](https://techascent.github.io/tech.ml.dataset/200-quick-reference.html)\n\nThe API docs document every available function: [https://techascent.github.io/tech.ml.dataset/](https://techascent.github.io/tech.ml.dataset/)\n\nThe provided Java API ([javadoc](https://techascent.github.io/tech.ml.dataset/javadoc/tech/v3/TMD.html) / [with frames](https://techascent.github.io/tech.ml.dataset/javadoc/index.html)) and sample program ([source](java_test/java/jtest/TMDDemo.java)) show how to use TMD from Java.\n\n## Questions / Community\n\n* Log an [issue](https://github.com/techascent/tech.ml.dataset/issues)!\n* Visit the [zulip stream](https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev).\n* Or the [slack data science channel](https://clojurians.slack.com/archives/C0BQDEJ8M).\n\n-----\n\n### Related Projects and Notes\n\n* An alternative cutting-edge api with some important extra features is available via [tablecloth](https://github.com/scicloj/tablecloth).\n* [tech.v3.datatype](https://github.com/cnuernber/dtype-next) provides the underlying numeric subsystem to TMD.\n* Simple regression/classification machine learning pathways are available in [tech.ml](https://github.com/techascent/tech.ml).\n* Some [independent benchmarks](https://github.com/zero-one-group/geni-performance-benchmark/) indicating TMD's _speed_.\n* Bindings to a [high performance in-process SQL database](https://github.com/techascent/tmducken).\n* A Graal native [example project](https://github.com/cnuernber/ds-graal).\n* The [scicloj.ml tutorials](https://github.com/scicloj/scicloj.ml-tutorials) may be a good way to jump straight into data science.\n* [Comparison](https://github.com/genmeblog/techtest/blob/master/src/techtest/datatable_dplyr.clj) between R's `data.table`, R's `dplyr`, and an older version of TMD.\n* Another overview of some of the available functions from genme: [Some Functions](https://github.com/genmeblog/techtest/wiki/Summary-of-functions)\n\n### License\n\nCopyright © 2023 Complements of TechAscent, LLC\n\nDistributed under the Eclipse Public License either version 1.0 or (at\nyour option) any later version.\n","funding_links":["https://github.com/sponsors/cnuernber"],"categories":["Clojure","数据科学","Libraries"],"sub_categories":["[Tools](#tools-1)"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftechascent%2Ftech.ml.dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftechascent%2Ftech.ml.dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftechascent%2Ftech.ml.dataset/lists"}