{"id":16003073,"url":"https://github.com/luposlip/nd-db","last_synced_at":"2025-12-12T01:16:10.028Z","repository":{"id":47259026,"uuid":"173439112","full_name":"luposlip/nd-db","owner":"luposlip","description":"Clojure library exposing newline delimited files as lightning fast databases","archived":false,"fork":false,"pushed_at":"2025-02-27T09:18:21.000Z","size":222,"stargazers_count":14,"open_issues_count":14,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-11T05:02:23.006Z","etag":null,"topics":["clojure","csv","database","document","edn","json","ndnippy","newline-delimited"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luposlip.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-02T11:22:54.000Z","updated_at":"2025-01-06T13:30:47.000Z","dependencies_parsed_at":"2023-09-26T12:03:17.775Z","dependency_job_id":"de247ddf-6c64-4f74-9f14-c7125da6a009","html_url":"https://github.com/luposlip/nd-db","commit_stats":{"total_commits":64,"total_committers":1,"mean_commits":64.0,"dds":0.0,"last_synced_commit":"7941ecad0cca5704113a038c7789bd623357dc9f"},"previous_names":["luposlip/ndjson-db"],"tags_count":39,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luposlip%2Fnd-db","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luposlip%2Fnd-db/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luposlip%2Fnd-db/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luposlip%2Fnd-db/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luposlip","download_url":"https://codeload.github.com/luposlip/nd-db/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243713357,"owners_count":20335564,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","csv","database","document","edn","json","ndnippy","newline-delimited"],"created_at":"2024-10-08T10:06:10.366Z","updated_at":"2025-12-12T01:16:09.982Z","avatar_url":"https://github.com/luposlip.png","language":"Clojure","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Clojure CI](https://github.com/luposlip/nd-db/workflows/Clojure%20CI/badge.svg?branch=main) [![Clojars Project](https://img.shields.io/clojars/v/com.luposlip/nd-db.svg)](https://clojars.org/com.luposlip/nd-db) [![Dependencies Status](https://versions.deps.co/luposlip/nd-db/status.svg)](https://versions.deps.co/luposlip/nd-db) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\n# nd-db\n\n```clojure\n[com.luposlip/nd-db \"0.9.0-beta20\"]\n```\n\n_Newline Delimited (read-only) Databases!_\n\nClojure library that treats lines in newline delimited (potentially humongous) files as simple (thus lightening fast) databases.\n\n`nd-db` currently works with JSON documents in [.ndjson](http://ndjson.org/) files, and EDN documents in `.ndedn`. It also supports binary [nippy](https://github.com/ptaoussanis/nippy) encoded EDN documents.\n\n## Usage\n\nA very tiny *test database* resides in `resources/test/test.ndjson`.\n\nIt contains the following 3 documents, that has `\"id\"` as their unique IDs:\n\n``` json\n{\"id\":1, \"data\": [\"some\", \"semi-random\", \"data\"]}\n{\"id\":222, \"data\": 42}\n{\"id\":333333,\"data\": {\"datakey\": \"datavalue\"}}\n```\n\n### Create Database\n\nSince version `0.2.0` you need to create a database var before you can use the\ndatabase. Behind the scenes this creates an index for you in a background thread.\n\n```clojure\n(def db\n  (nd-db.core/db\n    :id-fn #(Integer. ^String (second (re-find #\"^\\{\\\"id\\\":(\\d+)\" %))))\n    :filename \"resources/test/test.ndjson\"))\n```\n\nIf you want a default `:id-fn` created for you, use the `:id-name` together with `:id-type` and/or `:source-type`. Both `:id-type` and `:source-type` can be `:string` or `:integer`. `:id-type` is the target type of the indexed ID, whereas `:source-type` is the type in the source `.ndjson` database file. `:source-type` defaults to `:id-type`, and `:id-type` defaults to `:string`:\n\n```clojure\n(def db\n  (nd-db.core/db\n    :id-name \"id\"\n    :id-type :integer\n    :filename \"resources/test/test.ndjson\"}))\n```\n\n### EDN\n\nIf you want to read a database of EDN documents, just use `:doc-type :edn`. Please note that the standard `:id-name` and `:id-type` parameters doesn't (as of v0.3.0) work with EDN, hence you need to implement the `:id-fn` accordingly.\n\n### Nippy\n\nNippy can be used also. Since this is a binary standard, you'd probably start out with a `.ndjson` or `.ndedn` file, and convert it to `.ndnippy` via the `nd-db.convert` namespace. \"Why?\" you ask. Because of speed. Especially for big documents (10-100s of KBs) the parsing make a huge difference.\n\nBecause the nippy-serialized documents are \"just\" EDN, you can simply give a path for the ID with the `:id-path` parameter. Or of course use the mighty `:id-fn` instead.\n\nThere's a sample `resources/test/test.ndnippy` database representing the same data as the `.ndjson` and `.ndedn` samples.\n\n### Query Single Document\n\nWith a reference to the `db` you can query the database.\n\nTo find the data for the document with ID `222`, you can perform a `query-single`:\n\n```clojure\n(nd-db.core/q db 222)\n```\n\n### Query Multiple Documents\n\nYou can also perform multiple queries at once. This is ideal for a pipelined scenario,\nsince the return value is a lazy seq:\n\n```clojure\n(nd-db.core/q db [333333 1 77])\n```\n\n### It keeps!\n\nNB: The above query for multiple documents, returns only 2 documents, since there is\nno document with ID 77. This is a design decision, as the documents themselves still\ncontain the ID.\n\nIn a pipeline you'll be able to give lots of IDs to `q`, and filter down on documents\nthat are actually represented in the database.\n\nIf you want to have an option to return `nil` in this case, let me know by\ncreating an issue (or a PR).\n\n### The ID function\n\nThe ID functions adds \"unlimited\" flexibility as how to uniquely identify each\ndocument. You can choose a single attribute, or a composite. It's entirely up to\nyou when you implement your ID function.\n\nIn the example above, the value of `\"id\"` is used as a unique ID to\nbuilt up the database index.\n\n#### Parsing JSON documents\n\nIf you use very large databases, it makes sense to think about performance in\nyour ID function. In the above example a regular expression is used to find\nthe value of `\"id\"`, since this is faster than parsing JSON objects to EDN and\nquerying them as maps.\n\n#### Return value\n\nFurthermore the return value of the function is (almost) the only thing being\nstored in memory. Because of that you should opt for as simple data values\nas possible. In the above example this is the reason for the parsing to `Integer`\ninstead of keeping the `String` value.\n\nAlso note that the return value is the same you should use to query the\ndatabase. Which is why the input to `q` are `Integer` instances.\n\nRefer to the test for more details.\n\n## Laziness!\n\nSince v0.9.0 both the index and the documents can be retrieved in a truly lazy fashion.\n\nGetting the those lazy seqs are possible, since v0.9.0 introduces a new meta+index file format,\nthat makes lazy traversal of the index possible.\n\nThe downside to the support for laziness is the size of the meta+index files,\nwhich in my tested scenarios have grown with 100%. This means a `.ndnippy`\ndatabase of 16.8GB containing ~300k huge documents (of 200-300Kb each in raw\nJSON/EDN form) has grown form ~5MB to ~10MB.\n\nThis is not a problem at all in real life, since when you need the realized\nin-memory index (for ad-hoc querying by ID), it still consumes the same amount\nof memory as before (in the above example ~3MB).\n\n### Lazy IDs\n\nHere's an example on how to get and use the the lazy IDs:\n\n``` clojure\n(with-open [r (nd-db.index/reader my-db)]\n  (-\u003e\u003e r\n       nd-db.core/lazy-ids\n       (drop 100000)\n       (take 100)\n       (sort \u003e)\n       first))\n```\n\n### Lazy Documents\n\nNB: This currently (v0.9.0) only works for `.ndnippy` databases!\n\nGetting the documents contained in a `nippy` document based database, is just as\nsimple as getting the IDs:\n\n``` clojure\n(with-open [r (nd-db.index/reader my-db)]\n  (-\u003e\u003e r\n       (nd-db.core/lazy-docs my-db)\n       (drop 100000)\n       (take 100)\n       (sort-by :some-value \u003e)\n       first))\n```\n\nThe above example spends ~1ms on my laptop (mbp m1 pro) per dropped document, which isn't a lot.\nBut if you don't actually need (most of) the documents, it's much faster to use the IDs as entry,\nlike in this example:\n\n``` clojure\n(with-open [r (nd-db.index/reader my-db)]\n  (-\u003e\u003e r\n       nd-db.core/lazy-ids\n       (drop 100000)\n       (take 100)\n       (q my-db)\n       (sort-by :some-value \u003e)\n       first))\n```\n\n## Persisting the database index\n\nFrom v0.5.0 the generated index will be persisted to the temporary system folder on disk.\nThis is a huge benefit if you need to use the same database multiple times, after throwing\naway the reference to the parsed database, since it takes much less time to read in the index\nas compared to parsing the database file.\n\nFor small files (like the sample databases found in this repository) it doesn't really make a difference.\nBut for huge files, it makes an immense difference. The bigger the databases, the bigger the individual\ndocuments and the more complex the parsing of these documents are (to find the unique ID), the bigger\nthe difference. For a database file of 4.7GB the difference is 47s vs 90ms, or **~500 times faster**!!\n\nIf you want to keep the serialized meta+index file (`*.nddbmeta`) between system reboots, you should move it\nto another folder. You do that by passing `:index-folder` to the `db` function.\n\nIf for some reason you don't want to persist the index - e.g. there's no storage attached to a docker\ncontainer or serverless system - you can inhibit the persistence by setting param `:index-persist?`\nto `false`.\n\nFrom `v0.9.0` onwards the index is generated in parallel. This cuts two thirds of the processing time.\n\nFor more information on these and other parameters, see the source code for the `db` function in the\n`core` namespace.\n\n## Real use case: Verified Twitter Accounts\n\nTo test with a real database, download all verified Twitter users from here:\nhttps://files.pushshift.io/twitter/TU_verified.ndjson.xz\n\nPut the file somewhere, i.e. `path/to/TU_verified.ndjson`, and run the\nfollowing in a repl:\n\n```clojure\n(time\n   (def katy-gaga-gates-et-al\n     (doall\n      (nd-db.core/q\n       (nd-db.core/db {:id-name \"screen_name\"\n\t                   :id-type :string\n                       :filename \"path/to/TU_verified.ndjson\"})\n       [\"katyperry\" \"ladygaga\" \"BillGates\" \"ByMikeWilson\"]))))\n```\n\n## Performance\n\nThe extracted .ndjson files is 513 MB (297,878 records).\n\nOn my laptop the initial build of the index takes around 3 seconds, and the subsequent\nquery of the above 3 verified Twitter users takes around 1 millisecond\n(specs: Intel® Core™ i7-8750H CPU @ 2.20GHz × 6 cores with 31,2 GB RAM, SSD HD).\n\nIn real usage scenarios, I've used 2 databases simultaneously of sizes 1.6 GB and\n43.0 GB, with no problem or performance penalties at all (except for the relatively small\nsize of the in-memory indexes of course). Indexing the biggest database of 43GB took less\nthan 2 minutes (NB: This is with a single core, and BEFORE 0.4.0).\n\nSince the database uses disk random access, SSD speed up the database significantly.\n\n**Update**\n\nOn a MacBook Pro M1 Pro with 32 GB memory, the querying takes around 0.5 ms!\n\n**Another update with version `0.9.0-beta4`**\n\nOn the same M1 Pro as mentioned above, the index creation takes around 1 second, and querying the 3 documents, 1000 times, takes less than 0.3ms!\n\n## Copyright \u0026 License\n\nCopyright (C) 2019-2023 Henrik Mohr\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n            http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluposlip%2Fnd-db","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluposlip%2Fnd-db","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluposlip%2Fnd-db/lists"}