{"id":20135747,"url":"https://github.com/replikativ/mesalog","last_synced_at":"2025-12-12T01:07:01.591Z","repository":{"id":58405580,"uuid":"531248683","full_name":"replikativ/mesalog","owner":"replikativ","description":"CSV data loader for Datalog databases","archived":false,"fork":false,"pushed_at":"2024-06-24T14:12:59.000Z","size":1222,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-23T19:39:14.007Z","etag":null,"topics":["clojure","csv","csv-import","csv-parser","datahike","datalog"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"epl-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/replikativ.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-31T20:30:33.000Z","updated_at":"2024-11-06T22:58:57.000Z","dependencies_parsed_at":"2023-07-20T05:52:25.496Z","dependency_job_id":"bd491e32-c4fd-40c9-9f50-f55a57236059","html_url":"https://github.com/replikativ/mesalog","commit_stats":{"total_commits":71,"total_committers":2,"mean_commits":35.5,"dds":"0.014084507042253502","last_synced_commit":"bf68367994b11621d95765cef57a4ad322b5268a"},"previous_names":["replikativ/datahike-csv-loader","replikativ/mesalog","replikativ/tablehike"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/replikativ%2Fmesalog","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/replikativ%2Fmesalog/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/replikativ%2Fmesalog/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/replikativ%2Fmesalog/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/replikativ","download_url":"https://codeload.github.com/replikativ/mesalog/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248077607,"owners_count":21043992,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","csv","csv-import","csv-parser","datahike","datalog"],"created_at":"2024-11-13T21:16:23.299Z","updated_at":"2025-12-12T01:07:01.519Z","avatar_url":"https://github.com/replikativ.png","language":"Clojure","funding_links":[],"categories":[],"sub_categories":[],"readme":"## TL;DR\n\nLoads CSV data into Datalog databases with (for now) a single function\ncall.\n\n- Handles arbitrarily large files by default\n- Offers both automatic inference and user specification of\n  - Parsers (types)\n  - Schema\n- Automatic type inference and parsing of relatively simple datetime\n  values\n- Automatic vector/tuple value detection and parsing\n  - E.g. `\"[1,2,3]\"` -\\\u003e `[1 2 3]`\n- Not too slow (improvements soon with any luck): ~45s per million rows\n  to parse and infer schema.\n  - This is mostly for the record; it likely still leaves database\n    transactions of the data as the performance bottleneck for most\n    backends (though only\n    [Datahike](https://github.com/replikativ/datahike) is currently\n    supported).\n  - See `mesalog.demo` namespace for details.\n- Scalable: parser and schema inference on [35.5-million-row,\n  41-column](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/data)\n  dataset in 15 minutes, and load in 10 hours 20 minutes with consistent\n  resource usage and performance (see note above on DB transaction speed\n  as bottleneck), on up to 10 3.0GHz cores.\n\nPlease note that the API is not yet fully stable. Breaking changes may\noccur, though they will be communicated through the changelog and\nrelease notes.\n\n### Presentations\n\nNote: substantial overlap in content, though the earlier talk is\nsomewhat more conceptually comprehensive.  \n[Clojure Berlin lightning talk, December\n2023](https://docs.google.com/presentation/d/10mCViOX9Lkmxi8t0V7vnTLNIdoToPfVMZIUxW48onUQ/edit?usp=sharing)  \n[Austin Clojure Meetup, November\n2023](https://docs.google.com/presentation/d/1LotuOmUVs5bVAhMiCt8xHyQoI-CfsB2gCaYkPmvZx4k/edit?usp=sharing)\n\n## Acknowledgements\n\nMuch of the code in `src/mesalog/parse` is adapted from the library\n[tech.ml.dataset](https://github.com/techascent/tech.ml.dataset).\n\n## Quickstart\n\n[![Clojars Project](https://img.shields.io/clojars/v/io.replikativ/mesalog.svg)](https://clojars.org/io.replikativ/mesalog)\n[![cljdoc badge](https://cljdoc.org/badge/io.replikativ/mesalog)](https://cljdoc.org/d/io.replikativ/mesalog)\n\nReads, parses, and loads data from `filename` into a Datahike database\nvia connection `conn`. The remaining arguments are optional: parser\ndescriptions, schema description, and options for other relevant\nspecifications.\n\n``` clojure\n(require '[datahike.api :as d]\n         '[mesalog.api :as m])\n\n(def cfg (d/create-database))\n(def conn (d/connect cfg))\n\n;; 2-ary (basic)\n(m/load-csv filename conn)\n;; 3-ary\n(m/load-csv filename conn parser-descriptions)\n;; 4-ary\n(m/load-csv filename conn parser-descriptions schema-description)\n;; 5-ary\n(m/load-csv filename conn parser-descriptions schema-description options)\n```\n\nWhere `parser-descriptions` can be:\n\n``` clojure\n{}\n;; or\n[]\n;; or a map of valid column identifiers (nonnegative indices, strings, or keywords) to\n;; default parser function identifiers (generally though not always database type identifiers)\n;; or two-element tuples of database type identifier and parser function.\n;; Example, referring to `data/stops-sample.csv`:\n{0 :db.type/string\n ; 1. Would default to `:db.type/double` otherwise. 2. Maps to default parser for floats.\n 4 :db.type/float\n 5 :db.type/float\n 8 [:db.type/boolean #(identical? % 0)]}\n;; or equivalently\n{\"stop/id\" :db.type/string\n \"stop/lat\" :db.type/float\n \"stop/lon\" :db.type/float\n \"stop/wheelchair-boarding\" [:db.type/boolean #(identical? % 0)]}\n;; or equivalently (since each column corresponds to an attribute in this case)\n{:stop/id :db.type/string\n :stop/lat :db.type/float\n :stop/lon :db.type/float\n :stop/wheelchair-boarding [:db.type/boolean #(identical? % 0)]}\n;; or a vector specifying parsers for consecutive columns, starting from the 1st\n;; (though not necessarily ending at the last)\n;; E.g. based on `data/shapes.csv` with header and first row as follows:\n;; shape/id,shape/pt-lat,shape/pt-lon,shape/pt-sequence\n;; 185,52.296719,13.631534,0\n(let [parse-fn #(-\u003e (.setScale (bigdec %) 3 java.math.RoundingMode/HALF_EVEN)\n                    float)\n      dtype-parser [:db.type/float parse-fn]]\n  [:db.type/long dtype-parser dtype-parser :db.type/bigint])\n```\n\nAnd `schema-description` can be (again referring to the same sample\ndata):\n\n``` clojure\n{}\n;; or\n[]\n;; or\n{; `:stop/name` has a schema `:db/index` value of `true`\n :db/index #{:stop/name}\n ; note: redundant / synonymous with parser specifications above\n :db.type/float #{:stop/lat :stop/lon}\n :db.unique/identity #{:stop/id}\n ; parent-station attr references id attr\n :db.type/ref {:stop/parent-station :stop/id}}\n```\n\nAnd `options` can be:\n\n``` clojure\n{}\n;; or (context-free example)\n{:batch-size 50000\n :num-rows 1000000\n :separator \\tab\n :parser-sample-size 20000\n :include-cols #{0 2 3} ; can also be (valid, column-referencing) strings or keywords\n :header-row? false}\n```\n\n## Columns and attributes\n\nEach column represents either of the following:\n\n1.  An attribute, with keywordized column name as default attribute\n    ident.\n2.  An element in a heterogeneous or homogeneous tuple.\n\n## Column identifiers\n\nColumns can be identified by (nonnegative, 0-based) index, name\n(string-valued), or keyword (\"ident\").\n\n- String-valued name: Defaults to the value at the same index of the\n  column header if present, otherwise `(str \"column-\" index)`. A custom\n  index-to-name function can be specified via the option\n  `:idx-\u003ecolname`.\n- Keyword: Based on the convention of each column representing an\n  attribute, and keywordized column name as default attribute ident.\n  Defaults to the keywordized column name, with consecutive spaces\n  replaced by a single hyphen. A custom name-to-keyword function can be\n  specified via the option `:colname-\u003eident`.\n\nAll three forms of identifier are supported in parser descriptions and\nthe `:include-cols` option. Unfortunately, that isn't yet the case for\nthe schema description; apologies.\n\n## Including and excluding columns\n\nBy default, data from all columns are loaded. If not, whether a column\nshould be included or excluded can be specified via a predicate in the\n`:include-cols` option.\n\n## Supported column data types\n\n``` clojure\nmesalog.parse.parser/supported-dtypes\n;; i.e.\n#{:db.type/number\n  :db.type/instant\n  :db.type/tuple\n  :db.type/boolean\n  :db.type/uuid\n  :db.type/string\n  :db.type/keyword\n  :db.type/ref\n  :db.type/bigdec\n  :db.type/float\n  :db.type/bigint\n  :db.type/double\n  :db.type/long\n  :db.type/symbol\n  :local-date-time\n  :zoned-date-time\n  :instant\n  :offset-date-time\n  :local-date}\n```\n\n## Parsers vs. schema\n\n**Parser**: Interprets the values in a CSV column (field). Each included\ncolumn has a parser, whether specified or inferred. **Schema** (on\nwrite): Explicitly defines data model.\n\nNote that some databases (including Datahike) support both\n*schema-on-read* (no explicitly defined data model) and\n*schema-on-write* (the default, described above). The schema description\n(4th) argument to `load-csv` is only relevant with schema-on-write, and\nirrelevant to schema-on-read.\n\n## Parser descriptions\n\nColumn data types (and their corresponding parsers) can be automatically\ninferred, except where the column:\n\n- Is not self-contained, and corresponds to an attribute with\n  `:db/valueType` being one of these:\n  - `:db.type/ref`: column values belong to another attribute\n    - E.g. each value in column `\"station/parent-station\"` references\n      another (parent) station via the latter's `:station/id` attribute\n      value\n  - `:db.type/tuple`: column values belong to a tuple\n    - E.g. attribute `:abc` is tuple-valued, with the elements of each\n      tuple coming from columns `\\\"a\\\"`, `\\\"b\\\"`, and `\\\"c\\\"`\n- Has values that are otherwise too non-standard for automatic type\n  inference.\n\n`load-csv` accepts parser descriptions as its 3rd argument, with the\ndescription for each column containing its data type(s) as well as\nparser function(s). For a scalar-valued column, this takes the form\n`[dtype fn]`, which can (currently) be specified in one of these two\nways:\n\n- A default data type, say `d`, as shorthand for\n  `[d (d mesalog.parse.parser/default-coercers)]`, with the 2nd element\n  being its corresponding default parser function. The value of `d` must\n  come from:\n\n  ``` clojure\n  (set (keys mesalog.parse.parser/default-coercers))\n  ;; i.e.\n  #{:db.type/number\n    :db.type/instant\n    :db.type/boolean\n    :db.type/uuid\n    :db.type/string\n    :db.type/keyword\n    :db.type/float\n    :db.type/bigint\n    :db.type/double\n    :db.type/long\n    :db.type/symbol\n    :local-date-time\n    :zoned-date-time\n    :instant\n    :offset-date-time\n    :local-date}\n  ```\n\n- In full, as a two-element tuple of type and (custom) parser, e.g.\n  `[:db.type/long #(long (Float/parseFloat %))]`.\n\nParser descriptions can be specified as:\n\n- A map with each element consisting of the following:\n  - Key: a valid column identifier (see above)\n  - Value: a parser description taking the form described above.\n- A vector specifying parsers for consecutive columns, starting from the\n  1st (though not necessarily ending at the last), with each element\n  again being a parser description taking the form above, just like one\n  given as a map value.\n\nSee the section [Vector-valued\ncolumns](https://github.com/replikativ/mesalog#vector-valued-columns)\nfor details on specifying parser descriptions for vector-valued columns.\n\n## Schema description\n\nSchema can be fully or partially specified for attributes introduced by\nthe input CSV, via the 4th argument to `load-csv`. (It can also be\nspecified for existing attributes, but any conflict with the existing\nschema, whether specified or inferred, will currently result in an\nerror, even if the connected database supports the corresponding\nupdates.)\n\nThe primary form currently supported for providing a schema description\nis a map, with each key-value pair having the following possible forms:\n\n1.  **Key:** Schema attribute, e.g. `:db/index` **Value:** Set of\n    attribute idents **E.g.:** `{:db/index #{:name}}`\n2.  **Key:** Schema attribute value, e.g. `:db.type/keyword`,\n    `:db.cardinality/many` **Value:** Set of attribute idents **E.g.:**\n    `{:db.cardinality/many #{:orders}}`\n3.  **Key:** `:db.type/ref` **Value:** Map of ref-type attribute idents\n    to referenced attribute idents **E.g.**:\n    `{:db.type/ref {:station/parent-station :station/id}}`\n4.  **Key:** `:db.type/tuple` **Value:** Map of tuple attribute ident to\n    sequence of keywordized column names **E.g.:**\n    `{:db.type/tuple {:abc [:a :b :c]}}`\n5.  **Key:** `:db.type/compositeTuple` (a keyword not used in Datahike,\n    but that serves here as a shorthand to distinguish composite and\n    ordinary tuples) **Value:** Map of composite tuple attribute ident\n    to constituent attribute idents (keywordized column names) **E.g.:**\n    `{:db.type/compositeTuple {:abc [:a :b :c]}}`\n\n(3), (4), and (5) are specifically type-related, but seem more easily\nspecified as part of the schema description instead of parser\ndescriptions.\n\nPlease see `load-csv` docstring for further detail.\n\n## Schema-on-read\n\nMesalog supports schema-on-read databases, though not thoroughly, as\nnoted in [Current\nlimitations](https://github.com/replikativ/mesalog#current-limitations)\nbelow.\n\n## Cardinality inference\n\nNote that cardinality many can only be inferred in the presence of a\nseparate attribute marked as unique (`:db.unique/identity` or\n`:db.unique/value`).\n\n## Attributes already in schema\n\nMesalog currently supports loading data for existing attributes, as long\nas their schema remains the same; unfortunately, it doesn't yet support\nschema updates even where allowed by the connected database. As stated\nabove, any conflict with the existing schema, whether specified or\ninferred, will currently result in an error.\n\n## Reference-type attributes (with `:db/valueType` `:db.type/ref`)\n\nExamples above illustrate one way reference-type attributes can be\nrepresented in CSV. Another way is possible, via a tuple-valued field\n(column), e.g. the column `\"station/parent-station\"` could have values\nlike `[:station/id 12345]` instead of `12345`. In this case, the column\nwould be self-contained, and assuming valid tuple-valued references\nthroughout the parser inference row sample:\n\n- `:db.type/ref` would be inferred as its `:db/valueType`.\n- Type specification is unnecessary:\n  `{:db.type/ref {:station/parent-station :station/id}}` can be dropped.\n\n## Vector-valued columns\n\nThe parser description for a vector-valued column (whatever the\n`:db/valueType` of its corresponding attribute, if any) can be specified\nin one of a few ways:\n\n- `[dtype parse-fn]` (not supported for tuples)\n- `[[dt1 dt2 ...]]`, if `dt1` etc. are all data types having default\n  parsers\n- `[[dt1 dt2 ...] [pfn1 pfn2 ...]]`, to specify custom parser functions.\n\nA shorthand form for homogeneous vectors, e.g. `[[dt] [pfn]]`, `[[dt]]`,\nor maybe even `[dt]`, isn't yet supported.\n\n## Tuples\n\nFor the uninitiated: an\n[introduction](https://docs.datomic.com/on-prem/schema/schema.html#tuples)\nto tuples.\n\nInstead of being represented across columns as illustrated above,\n(homogeneous and heterogeneous, but not composite) tuples can also be\nrepresented by vector values. For example, a value of `[1 2 3]` for\ntuple `:abc` can be represented as such within a single column, say\n`\"abc\"`, instead of across 3 columns, 1 for each element. In this case:\n\n1.  Its specification as tuple, e.g.\n    `{:db.type/tuple {:abc [:a :b :c]}}`, can be dropped from the schema\n    description.\n2.  Its type and parser may be inferred or specified:\n    - If `:abc` is a homogeneous tuple of uniform length, its type and\n      parser can be automatically inferred.\n    - The parser description for `\"abc\"` can take one of the forms\n      described above for [Vector-valued\n      columns](https://github.com/replikativ/mesalog#vector-valued-columns),\n      except `[dtype parse-fn]` as noted.\n\nNote: Type and parser can also be inferred for heterogeneous tuples, but\nthey must have uniform length (regardless of type inference needs).\n\n## Options\n\nSupported options: `:batch-size`, `:num-rows`, `:separator`,\n`:parser-sample-size`, `:include-cols`, and `:header-row?`. See\n`load-csv` docstring for more, including `:idx-\u003ecolname`,\n`:colname-\u003eident`, and vector-related options.\n\n## More examples\n\nSee test namespaces and the `mesalog.demo` namespace for more examples.\n\n## Current limitations\n\nMany if not most of the remaining major limitations of Mesalog are due\nto the continuing (even if much decreased) presence of coupling between\nparsers and schema, and current lack of a clean separation and coherent\ninterface between them. For example:\n\n- The parser descriptions argument to `load-csv` still requires column\n  type specification, even when it is irrelevant because the connected\n  database has schema-on-read.\n- More importantly:\n  - *Consistency between the parsers and schema ultimately used for data\n    load and transaction is not checked*.\n  - The current API only supports a single-step workflow, without a\n    multi-step option as well, that would allow verification of inferred\n    parsers and schema before data transaction.\n\nHowever, at least one such limitation not attributable to the lacking\nparser-schema interface exists: currently, only\n[Datahike](https://datahike.io) (see also\n[GitHub](https://github.com/replikativ/datahike)) is supported, though\nthat shall be extended to other databases once the API and\nimplementation have matured.\n\n## License\n\nCopyright © 2022-2023 Yee Fay Lim\n\nDistributed under the Eclipse Public License version 1.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freplikativ%2Fmesalog","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Freplikativ%2Fmesalog","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freplikativ%2Fmesalog/lists"}