{"id":19288347,"url":"https://github.com/techascent/tmducken","last_synced_at":"2025-03-16T09:07:34.921Z","repository":{"id":66590137,"uuid":"439114800","full_name":"techascent/tmducken","owner":"techascent","description":"tech.ml.dataset integration with duckdb","archived":false,"fork":false,"pushed_at":"2024-09-15T17:01:03.000Z","size":216,"stargazers_count":68,"open_issues_count":5,"forks_count":5,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-03T05:39:45.949Z","etag":null,"topics":["clojure","dataanalytics","duckdb"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/techascent.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-16T20:10:36.000Z","updated_at":"2025-02-22T01:48:29.000Z","dependencies_parsed_at":"2024-02-18T01:22:41.786Z","dependency_job_id":"0fb3358e-a812-4b43-9080-f00621af6d73","html_url":"https://github.com/techascent/tmducken","commit_stats":{"total_commits":82,"total_committers":7,"mean_commits":"11.714285714285714","dds":"0.20731707317073167","last_synced_commit":"ef26b95652fa03fd09671feb3b5d5a6ede775c20"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techascent%2Ftmducken","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techascent%2Ftmducken/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techascent%2Ftmducken/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techascent%2Ftmducken/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/techascent","download_url":"https://codeload.github.com/techascent/tmducken/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243847062,"owners_count":20357317,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","dataanalytics","duckdb"],"created_at":"2024-11-09T22:08:46.222Z","updated_at":"2025-03-16T09:07:34.881Z","avatar_url":"https://github.com/techascent.png","language":"Clojure","readme":"# tech.ml.dataset Integration with DuckDB\n\n[![Clojars Project](https://clojars.org/com.techascent/tmducken/latest-version.svg)](https://clojars.org/com.techascent/tmducken)\n\n\n[DuckDB](https://duckdb.org/) is a high performance in-process database system.  It is a\nnatural pairing for [tech.ml.dataset](https://github.com/techascent/tech.ml.dataset) which is\na high performance column-major in-memory dataframe system similar to pandas or R's data table.\nDuckDB provides perhaps [cleaner pathways](https://duckdb.org/docs/data/overview) to load/save\nparquet files as you don't need to navigate the\n[minefield of dependencies](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.parquet.html)\nrequired to use parquet from Clojure.\n\n```clojure\nuser\u003e (require '[tech.v3.dataset :as ds])\nnil\nuser\u003e (def stocks\n        (ds/-\u003edataset \"https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv\"\n                      {:key-fn keyword\n                       :dataset-name :stocks}))\n#'user/stocks\nuser\u003e (ds/head stocks)\n:stocks [5 3]:\n\n| :symbol |      :date | :price |\n|---------|------------|-------:|\n|    MSFT | 2000-01-01 |  39.81 |\n|    MSFT | 2000-02-01 |  36.35 |\n|    MSFT | 2000-03-01 |  43.22 |\n|    MSFT | 2000-04-01 |  28.37 |\n|    MSFT | 2000-05-01 |  25.45 |\nuser\u003e (require '[tmducken.duckdb :as duckdb])\nnil\nuser\u003e (duckdb/initialize!)\nAug 31, 2023 8:50:21 AM clojure.tools.logging$eval5800$fn__5803 invoke\nINFO: Attempting to load duckdb from \"./binaries/libduckdb.so\"\ntrue\nuser\u003e (def db (duckdb/open-db))\n#'user/db\nuser\u003e (def conn (duckdb/connect db))\n#'user/conn\nuser\u003e (duckdb/create-table! conn stocks)\n\"stocks\"\nuser\u003e (duckdb/insert-dataset! conn stocks)\n560\nuser\u003e (ds/head (duckdb/sql-\u003edataset conn \\\"select * from stocks\\\"))\n\n_unnamed [5 3]:\n\n| symbol |       date | price |\n|--------|------------|------:|\n|   MSFT | 2000-01-01 | 39.81 |\n|   MSFT | 2000-02-01 | 36.35 |\n|   MSFT | 2000-03-01 | 43.22 |\n|   MSFT | 2000-04-01 | 28.37 |\n|   MSFT | 2000-05-01 | 25.45 |\nuser\u003e (def stmt (duckdb/prepare conn \\\"select * from stocks\\\"))\nAug 31, 2023 8:52:25 AM clojure.tools.logging$eval5800$fn__5803 invoke\nINFO: Reference thread starting\n#'user/stmt\nuser\u003e stmt\n#duckdb-prepared-statement-0[\\\"select * from stocks\\\"]\nuser\u003e (stmt)\n#duckdb-streaming-result[\\\"select * from stocks\\\"]\nuser\u003e (ds/head (first *1))\n_unnamed [5 3]:\n\n| symbol |       date | price |\n|--------|------------|------:|\n|   MSFT | 2000-01-01 | 39.81 |\n|   MSFT | 2000-02-01 | 36.35 |\n|   MSFT | 2000-03-01 | 43.22 |\n|   MSFT | 2000-04-01 | 28.37 |\n|   MSFT | 2000-05-01 | 25.45 |\n```\n\n\n* [API Documentation](https://techascent.github.io/tmducken/)\n\n\nThis system requires the user to install/manage the duckdb binary dependency.  DuckDB also supplies a\n[JDBC driver](https://search.maven.org/artifact/org.duckdb/duckdb_jdbc) ([documentation](https://duckdb.org/docs/api/java))\nso this library is entirely optional as you can use the jdbc driver with tmd's existing\n[jdbc integration](https://github.com/techascent/tech.ml.dataset.sql).\n\n\nWhat this library provides beyond the JDBC pathway is very fast transfer from the\nduckdb query result, which is stored already in column-major form, to a dataset.\nThe JDBC driver forces you to go row by row and essentially use the\nsequence-of-maps-\u003edataset pathway.  This library is also another example of how to\nuse [dtype-next's ffi system to bind to a C library](src/tmducken/duckdb/ffi.clj).\n\n\nOne thing to note is that because duckdb already supplies the query result columns to the\nuser in data form there is no advantage to using apache arrow as a query result.\n\n\n## Usage\n\n### Preferred method (Nix)\n\nYou can get a jar for your own OS (Linux/MacOS) with duckdb included by [installing Nix](https://github.com/DeterminateSystems/nix-installer),\nthen running:\n\n``` console\n$ nix build github:techascent/tmducken\n\n$ uname -s -p\nDarwin arm\n\n$ jar tf result/tmducken.jar | grep libduckdb.dylib\ndarwin-aarch64/libduckdb.dylib\n```\n\nYou can include this in your deps.edn via: `{:local/root \"result/tmducken.jar\"}`\nWith this method, you can invoke `(duckdb/initialize!)` normally and it will load the shared library automatically.\n\n### Manual method\n\nFirst, download binaries and set either install then into your\nsystem path or set the DUCKDB_HOME environment variable to where\nthe shared library is installed - see [enable-duckdb](scripts/enable-duckdb)\nscript.  Linux users can simply type:\n\n```console\nsource scripts/enable-duckdb\n```\n\nNext, you should be able to call [initialize!](https://techascent.github.io/tmducken/tmducken.duckdb.html#var-initialize.21)\nin the duckdb namespace.  Be sure to read the [namespace documentation](https://techascent.github.io/tmducken/tmducken.duckdb.html)\nand perhaps peruse the [unit tests](test/tmducken/duckdb_test.clj).\n\n## Developing\n\nWhenever the `deps.edn` file changes, you have to run `deps-lock`, as provided by the `nix shell`.\nFor more details see: https://jlesquembre.github.io/clj-nix/lock-file/\n\nNot doing so fails local Nix builds (`nix build`) by trying to fetch dependencies in a non-network build environment.\n\nAlternative CI-based solution to the above: https://jlesquembre.github.io/clj-nix/github-action/\n\n## License\n\n * [MIT License](LICENSE)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftechascent%2Ftmducken","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftechascent%2Ftmducken","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftechascent%2Ftmducken/lists"}