{"id":19345924,"url":"https://github.com/tolitius/cbass","last_synced_at":"2025-04-23T04:36:36.937Z","repository":{"id":47865197,"uuid":"41490817","full_name":"tolitius/cbass","owner":"tolitius","description":"adding \"simple\" to HBase","archived":false,"fork":false,"pushed_at":"2021-08-12T15:03:44.000Z","size":98,"stargazers_count":24,"open_issues_count":4,"forks_count":11,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-15T08:08:51.407Z","etag":null,"topics":["clojure","hbase"],"latest_commit_sha":null,"homepage":null,"language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"epl-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tolitius.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-08-27T14:16:22.000Z","updated_at":"2023-01-10T22:24:13.000Z","dependencies_parsed_at":"2022-09-14T09:40:17.137Z","dependency_job_id":null,"html_url":"https://github.com/tolitius/cbass","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tolitius%2Fcbass","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tolitius%2Fcbass/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tolitius%2Fcbass/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tolitius%2Fcbass/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tolitius","download_url":"https://codeload.github.com/tolitius/cbass/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250372501,"owners_count":21419719,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","hbase"],"created_at":"2024-11-10T04:08:22.215Z","updated_at":"2025-04-23T04:36:36.590Z","avatar_url":"https://github.com/tolitius.png","language":"Clojure","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cbass\n\n* Databases are for storing and finding data \n* HBase is great at that\n* Clojure is great at \"simple\"\n\n---\n\n[![Clojars Project](http://clojars.org/cbass/latest-version.svg)](http://clojars.org/cbass)\n\n\u003c!-- START doctoc generated TOC please keep comment here to allow auto update --\u003e\n\u003c!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --\u003e\n**Table of Contents**  *generated with [DocToc](https://github.com/thlorenz/doctoc)*\n\n- [Show me](#show-me)\n- [Connecting to HBase](#connecting-to-hbase)\n  - [Custom Serializers](#custom-serializers)\n- [Storing data](#storing-data)\n  - [Storing a single row](#storing-a-single-row)\n  - [Storing multiple rows](#storing-multiple-rows)\n- [Finding it](#finding-it)\n  - [Finding by the row key](#finding-by-the-row-key)\n  - [Finding by \"anything\"](#finding-by-anything)\n    - [Scanning the whole table](#scanning-the-whole-table)\n    - [Scanning with a row key function](#scanning-with-a-row-key-function)\n    - [Scanning families and columns](#scanning-families-and-columns)\n    - [Scanning by row key prefix](#scanning-by-row-key-prefix)\n      - [:starts-with](#starts-with)\n    - [Scanning by time range](#scanning-by-time-range)\n    - [Scanning in reverse](#scanning-in-reverse)\n    - [Scanning with the limit](#scanning-with-the-limit)\n    - [Scanning with filter](#scanning-with-filter)\n    - [Scanning with the last updated](#scanning-with-the-last-updated)\n    - [Getting Lazy](#getting-lazy)\n    - [Scanning by \"anything\"](#scanning-by-anything)\n- [Deleting it](#deleting-it)\n    - [Deleting specific columns](#deleting-specific-columns)\n    - [Deleting a column family](#deleting-a-column-family)\n    - [Deleting a whole row](#deleting-a-whole-row)\n  - [Deleting by anything](#deleting-by-anything)\n  - [Delete row key function](#delete-row-key-function)\n- [Serialization](#serialization)\n  - [When Connecting](#when-connecting)\n- [Using increment mutations](#using-increment-mutations)\n- [License](#license)\n\n\u003c!-- END doctoc generated TOC please keep comment here to allow auto update --\u003e\n\n## Show me\n\n```clojure\n(require '[cbass :refer [new-connection store find-by scan delete]])\n```\n\n## Connecting to HBase\n\n```clojure\n(def conf {\"hbase.zookeeper.quorum\" \"127.0.0.1:2181\" \"zookeeper.session.timeout\" 30000})\n(def conn (new-connection conf))\n```\n\n### Custom Serializers\n\nBy default `cbass` uses [nippy](https://github.com/ptaoussanis/nippy) for serialization / deserialization. There are more details about it in the [Serialization](#serialization) section. This can be changed by providing your own, optional, `pack` / `unpack` functions when creating an HBase connection:\n\n```clojure\n(def conn (new-connection conf :pack identity \n                               :unpack identity))\n```\n\nIn this example we are just _muting_ \"packing\" and \"unpacking\" relying on the custom serialization being done _prior_ to calling `cbass`, so the data is a byte array, and deserialization is done _after_ the value is returned from cbass, since it will just return a byte array back in this case (i.e. `identity` function for both).\n\n## Storing data\n\n### Storing a single row\n\n```clojure \n;; args:      conn, table, row key, family, data, [timestamp]\n\nuser=\u003e (store conn \"galaxy:planet\" \"earth\" \"galaxy\" {:inhabited? true \n                                                     :population 7125000000 \n                                                     :age \"4.543 billion years\"})\n```\n\nDepending on a key strategy/structure sometimes it makes sense to only store row-keys / families witout values:\n\n```clojure\nuser=\u003e (store conn \"galaxy:planet\" \"pluto\" \"galaxy\")\n```\n\nIt is possible to pass a custom timestamp to hbase:\n\n```clojure \nuser=\u003e (store conn \"galaxy:planet\" \"earth\" \"galaxy\" {:inhabited? true \n                                                     :population 7125000000 \n                                                     :age \"4.543 billion years\"}\n                                                     1000)\n```\n\n\n### Storing multiple rows\n\nIn case there are multiple rows to store in the same table, `store-batch` can help out:\n\n```clojure\n(store-batch conn \"galaxy:planet\" \n             [[\"mars\" \"galaxy\" {:inhabited? true :population 3 :age \"4.503 billion years\"}]\n              [\"earth\" \"galaxy\" {:inhabited? true :population 7125000000 :age \"4.543 billion years\"}]\n              [\"pluto\" \"galaxy\"]\n              [\"neptune\" \"galaxy\" {:inhabited? :unknown :age \"4.503 billion years\"}]]))\n```\n\nnotice the \"pluto\", it has no columns, which is also fine.\n\nYou can pass a custom timestamp on each row:\n\n```clojure\n(store-batch conn \"galaxy:planet\" \n             [[\"mars\" \"galaxy\" {:inhabited? true :population 3 :age \"4.503 billion years\"} 1000]\n              [\"earth\" \"galaxy\" {:inhabited? true :population 7125000000 :age \"4.543 billion years\"} 2000]\n              [\"pluto\" \"galaxy\" nil 3000]\n              [\"neptune\" \"galaxy\" {:inhabited? :unknown :age \"4.503 billion years\"} 4000]]))\n```\n\n\n## Finding it\n\nThere are two primary ways data is found in HBase:\n\n* by the row key: [HBase Get](http://hbase.apache.org/book.html#_get)\n* by \"anything\": [HBase Scan](http://hbase.apache.org/book.html#scan)\n\n### Finding by the row key\n\n```clojure\n;; args:        conn, table, row key, [family, columns, [time-range]]\n\nuser=\u003e (find-by conn \"galaxy:planet\" \"earth\")\n{:age \"4.543 billion years\", :inhabited? true, :population 7125000000}\n\nuser=\u003e (find-by conn \"galaxy:planet\" \"earth\" \"galaxy\")\n{:age \"4.543 billion years\", :inhabited? true, :population 7125000000}\n\nuser=\u003e (find-by conn \"galaxy:planet\" \"earth\" \"galaxy\" #{:age :population})\n{:age \"4.543 billion years\", :population 7125000000}\n```\n\n### Finding by \"anything\"\n\nHBase calls them scanners, hence the `scan` function name.\n\nLet's first look directly at HBase (shell) to see the data we are going to scan over:\n\n```clojure\nhbase(main):002:0\u003e scan 'galaxy:planet'\nROW         COLUMN+CELL\n earth      column=galaxy:age, timestamp=1440880021543, value=NPY\\x00i\\x134.543 billion years\n earth      column=galaxy:inhabited?, timestamp=1440880021543, value=NPY\\x00\\x04\\x01\n earth      column=galaxy:population, timestamp=1440880021543, value=NPY\\x00+\\x00\\x00\\x00\\x01\\xA8\\xAE\\xDF@\n mars       column=galaxy:age, timestamp=1440880028315, value=NPY\\x00i\\x134.503 billion years\n mars       column=galaxy:inhabited?, timestamp=1440880028315, value=NPY\\x00\\x04\\x01\n mars       column=galaxy:population, timestamp=1440880028315, value=NPY\\x00d\\x03\n neptune    column=galaxy:age, timestamp=1440880036629, value=NPY\\x00i\\x134.503 billion years\n neptune    column=galaxy:inhabited?, timestamp=1440880036629, value=NPY\\x00j\\x07unknown\n3 row(s) in 0.0230 seconds\n```\n\n### Finding a specific version of a row\n\nBy default, `find-by` returns the latest version of a row. If you want to retrieve an earlier version of the cell, you need to pass a `:time-range` to `find-by`:\n\n```clojure\nuser=\u003e (store conn \"galaxy:planet\" \"earth\" \"galaxy\" {:population 3} 1000)\nuser=\u003e (store conn \"galaxy:planet\" \"earth\" \"galaxy\" {:population 7125000000} 2000)\n\nuser=\u003e (find-by conn \"galaxy:planet\" \"earth\" #{:population} :time-range {:from-ms 500 :to-ms 1500})\n{:last-updated 1000, :population 3}\n```\n\n\n\nHBase scanning is pretty flexible: by row key from/to prefixes, by time ranges, by families/columns, etc..\n\nHere are some examples:\n\n#### Scanning the whole table\n\n```clojure\n;; args:        conn, table, {:row-key-fn, :family, :columns, :from, :to, :time-range {:from-ms :to-ms}}\n\nuser=\u003e (scan conn \"galaxy:planet\")\n\n{\"earth\"\n {:age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n \"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3},\n \"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\n#### Scanning with a row key function\n\nBy default cbass will assume row keys are strings, but in practice keys are prefixed and/or hashed.\nHence to read a row key from HBase, a custom row key function may come handy:\n\n```clojure\n;; args:        conn, table, {:row-key-fn, :family, :columns, :from, :to, :time-range {:from-ms :to-ms}}\n\nuser=\u003e (scan conn \"galaxy:planet\" :row-key-fn #(keyword (String. %)))\n\n{:earth\n {:age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n :mars {:age \"4.503 billion years\", :inhabited? true, :population 3},\n :neptune {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\n#### Scanning families and columns\n\nby family\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :family \"galaxy\")\n\n{\"earth\"\n {:age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n \"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3},\n \"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\nspecifying columns (qualifiers)\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :family \"galaxy\" \n                                  :columns #{:age :inhabited?})\n\n{\"earth\" {:age \"4.543 billion years\", :inhabited? true},\n \"mars\" {:age \"4.503 billion years\", :inhabited? true},\n \"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\n#### Scanning by row key prefix\n\nData can be scanned by a row key prefix using `:from` and/or `:to` keys:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :from \"ma\")\n\n{\"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3},\n \"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\n`:to` is exclusive:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :from \"ea\" \n                                  :to \"ma\")\n\n{\"earth\" {:age \"4.543 billion years\", :inhabited? true, :population 7125000000}}\n```\n\nnotice, no Neptune:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :to \"nep\")\n\n{\"earth\"\n {:age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n \"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3}}\n```\n\n##### :starts-with\n\nStarting from hbase-client `0.99.1`, cbass can just do `:starts-with`, in case no `:to` is needed.\n\nNotice, we added `saturday` and `saturn` for a better example:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\")\n\n{\"earth\"\n {:age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n \"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3},\n \"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown},\n \"pluto\" {},\n \"saturday\" {:age \"24 hours\", :inhabited? :sometimes},\n \"saturn\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\nusing `:starts-with`:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :starts-with \"sa\")\n\n{\"saturday\" {:age \"24 hours\", :inhabited? :sometimes},\n \"saturn\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\n\n\n#### Scanning by time range\n\nIf you look at the data from HBase shell (above), you'll see that every row has a timestamp associated with it.\n\nThese timestamps can be used to scan data within a certain time range:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :time-range {:from-ms 1440880021544 \n                                               :to-ms 1440880036630})\n\n{\"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3},\n \"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\nin case `:from-ms` is missing, it defauts to `0`:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :time-range {:to-ms 1440880036629})\n\n{\"earth\"\n {:age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n \"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3}}\n```\n\nsame analogy with `:to-ms`, if it is mising, it defaults to `Long/MAX_VALUE`:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :time-range {:from-ms 1440880036629})\n\n{\"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\n#### Scanning in reverse\n\nHere is a regular table scan with all the defaults:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\")\n\n{\"earth\" {:age \"4.543 billion years\", :inhabited? true, :population 7125000000},\n \"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3},\n \"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\nmany times it makes sense to scan table in reverse order \nto have access to the latest updates first without scanning the whole search space:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :reverse? true)\n\n{\"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown},\n \"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3},\n \"earth\" {:age \"4.543 billion years\", :inhabited? true, :population 7125000000}}\n```\n\n#### Scanning with the limit\n\nSince scanning partially gets its name from a \"table scan\", in many cases it may return quite large result sets. \nOften we'd like to limit the number of rows returned, but HBase does not make it simple for [various reasons](http://www.dotkam.com/2015/10/08/hbase-scan-let-me-cache-it-for-you/).\n\ncbass makes it quite simple to limit the number of rows returned by using a `:limit` key:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :limit 2)\n\n{\"earth\" {:age \"4.543 billion years\", :inhabited? true, :population 7125000000},\n \"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3}}\n```\n\nFor example to get the latest 3 planets added, we can scan in reverse (latest) with a limit of 3:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :limit 3 :reverse? true)\n```\n\n#### Scanning with filter\n\nFor a maximum flexibility an HBase [Filter](https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html) can be passed directly to `scan` via a `:filter` param.\n\nHere is an example of [ColumnPrefixFilter](https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnPrefixFilter.html), all other HBase filters will work the same.\n\nThe data we work with:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\")\n\n{\"earth\"\n {:age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n \"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3},\n \"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown},\n \"pluto\" {},\n \"saturday\" {:age \"24 hours\", :inhabited? :sometimes},\n \"saturn\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\nCreating a filter that would only look the rows where columns start with \"ag\", and scanning with it:\n\n```clojure\nuser=\u003e (def f (ColumnPrefixFilter. (.getBytes \"ag\")))\n#'user/f\nuser=\u003e (scan conn \"galaxy:planet\" :filter f)\n\n{\"earth\" {:age \"4.543 billion years\"},\n \"mars\" {:age \"4.503 billion years\"},\n \"neptune\" {:age \"4.503 billion years\"},\n \"saturday\" {:age \"24 hours\"},\n \"saturn\" {:age \"4.503 billion years\"}}\n\n```\n\nSimilarly creating a filter that would only look the rows where columns start with \"pop\", and scanning with it:\n\n```clojure\n\nuser=\u003e (def f (ColumnPrefixFilter. (.getBytes \"pop\")))\n#'user/f\nuser=\u003e (scan conn \"galaxy:planet\" :filter f)\n\n{\"earth\" {:population 7125000000}, \n \"mars\" {:population 3}}\n```\n\n#### Scanning with the last updated\n\nIn order to get more intel on _when_ the results were updated last, you can add `:with-ts? true` to scan.\nIt will look at _all_ the cells in the result row, and will return the latest timestamp.\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\")\n{\"earth\"\n {:age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n \"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3},\n \"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown},\n \"pluto\" {:one 1, :three 3, :two 2},\n \"saturday\" {:age \"24 hours\", :inhabited? :sometimes},\n \"saturn\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n```\n\nand this is what the result of `:with-ts? true` will look like:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :with-ts? true)\n{\"earth\"\n {:last-updated 1449681589719,\n  :age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n \"mars\"\n {:last-updated 1449681589719,\n  :age \"4.503 billion years\",\n  :inhabited? true,\n  :population 3},\n \"neptune\"\n {:last-updated 1449681589719,\n  :age \"4.503 billion years\",\n  :inhabited? :unknown},\n \"pluto\" {:last-updated 1449681589719, :one 1, :three 3, :two 2},\n \"saturday\"\n {:last-updated 1449681589719,\n  :age \"24 hours\",\n  :inhabited? :sometimes},\n \"saturn\"\n {:last-updated 1449681589719,\n  :age \"4.503 billion years\",\n  :inhabited? :unknown}}\n```\n\nnot exactly interesting, since all the rows were stored in batch at the same exact millisecond. Let's spice it up.\n\nHave you heard the latest news about life at Saturn? Let's record it:\n\n```clojure\nuser=\u003e (store conn \"galaxy:planet\" \"saturn\" \"galaxy\" {:inhabited? true})\n```\n\nand scan again:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :with-ts? true)\n{\"earth\"\n {:last-updated 1449681589719,\n  :age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n \"mars\"\n {:last-updated 1449681589719,\n  :age \"4.503 billion years\",\n  :inhabited? true,\n  :population 3},\n \"neptune\"\n {:last-updated 1449681589719,\n  :age \"4.503 billion years\",\n  :inhabited? :unknown},\n \"pluto\" {:last-updated 1449681589719, :one 1, :three 3, :two 2},\n \"saturday\"\n {:last-updated 1449681589719,\n  :age \"24 hours\",\n  :inhabited? :sometimes},\n \"saturn\"\n {:last-updated 1449682282217,\n  :age \"4.503 billion years\",\n  :inhabited? true}}\n```\n\nnotice the Saturn's last update timestamp: it is now `1449682282217`.\n\n\n#### Get only the row keys\nIn some cases, we need only to find the row keys without the associated data.\nIn that case, you pass `:keys-only? true` to `scan`.\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :from \"ea\" \n                                  :to \"ma\"\n                                  :keys-only? true)\n\n{\"earth\" {}}\n```\n\n#### Scanning by \"anything\"\n\nOf course _all_ of the above can be combined together, and that's the beauty or scanners:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\" :family \"galaxy\" \n                                  :columns #{:age}\n                                  :from \"ma\" \n                                  :to \"z\" \n                                  :time-range {:to-ms 1440880036630})\n\n{\"mars\" {:age \"4.503 billion years\"},\n \"neptune\" {:age \"4.503 billion years\"}}\n```\n\nThere are lots of other ways to \"scan the cat\", but for now here are several.\n\n#### Getting Lazy\n\nBy default `scan` will return a realized (not lazy) result as a map. In case too much data is expected to\ncome back or the problem is best solved in batches, `scan` can be asked to return a lazy sequence of result \nmaps instead by calling `lazy-scan`.\n\nIMPORTANT: It's the responsibility of the caller to close table and scanner.\n\n```clojure\nuser=\u003e (lazy-scan conn \"galaxy:planet\")\n{:table \u003ctable\u003e\n :scanner \u003cscanner\u003e\n :rows ([\"earth\"\n  {:age \"4.543 billion years\",\n   :inhabited? true,\n   :population 7125000000}]\n [\"mars\" {:age \"4.503 billion years\", :inhabited? true, :population 3}]\n [\"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown}]\n [\"pluto\" {}]\n [\"saturday\" {:age \"24 hours\", :inhabited? :sometimes}]\n [\"saturn\" {:age \"4.503 billion years\", :inhabited? true}])}\n```\n\nit is really a LazySeq:\n\n```clojure\nuser=\u003e (type (:rows (scan conn \"galaxy:planet\" :lazy? true)))_\nclojure.lang.LazySeq\n```\n\nwhereas by default it is a map:\n\n```clojure\nuser=\u003e (type (scan conn \"galaxy:planet\"))\nclojure.lang.PersistentArrayMap\n```\n\n## Deleting it\n\n#### Deleting specific columns\n\n```clojure\n;; args:       conn, table, row key, [family, columns]\n\nuser=\u003e (delete conn \"galaxy:planet\" \"earth\" \"galaxy\" #{:age :population})\n\nuser=\u003e (find-by conn \"galaxy:planet\" \"earth\")\n{:inhabited true}\n```\n\n#### Deleting a column family\n\n```clojure\n;; args:       conn, table, row key, [family, columns]\n\nuser=\u003e (delete conn \"galaxy:planet\" \"earth\" \"galaxy\")\n\nuser=\u003e (find-by conn \"galaxy:planet\" \"earth\")\nnil\n```\n\n#### Deleting a whole row\n\n```clojure\n;; args:       conn, table, row key, [family, columns]\n\nuser=\u003e (delete conn \"galaxy:planet\" \"mars\")\n\nuser=\u003e (find-by conn \"galaxy:planet\" \"mars\")\nnil\n```\n\n### Deleting by anything\n\nThere is often a case where rows need to be deleted by a filter, that is similar to the one used in [scan](https://github.com/tolitius/cbass#scanning-by-anything) (i.e. by row key prefix, time range, etc.)\nHBase does not really help there besides providing a [BulkDeleteEndpoint](http://archive.cloudera.com/cdh5/cdh/5/hbase/apidocs/org/apache/hadoop/hbase/coprocessor/example/BulkDeleteEndpoint.html) coprocessor.\n\nThis is not ideal as it delegates work to HBase \"stored procedures\" (effectively that is what coprocessors are).\nIt really pays off during massive data manipulation since it does happen _directly_ on the server,\nbut in simpler cases, which are many, coprocessors are less than ideal.\n\n**cbass** achives \"deleting by anything\" by a trivial flow: \"scan + multi delete\" packed in a \"delete-by\" function\nwhich preserves the \"scan\"'s syntax:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\")\n{\"earth\"\n {:age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n \"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown},\n \"pluto\" {},\n \"saturday\" {:age \"24 hours\", :inhabited? :sometimes},\n \"saturn\" {:age \"4.503 billion years\", :inhabited? :unknown}}\n\nuser=\u003e (delete-by conn \"galaxy:planet\" :from \"sat\" :to \"saz\")\n;; deleting [saturday saturn], since they both match the 'from/to' criteria\n```\n\nlook ma, no saturn, no saturday:\n\n```clojure\nuser=\u003e (scan conn \"galaxy:planet\")\n{\"earth\"\n {:age \"4.543 billion years\",\n  :inhabited? true,\n  :population 7125000000},\n \"neptune\" {:age \"4.503 billion years\", :inhabited? :unknown},\n \"pluto\" {}}\n```\n\nand of course any other criteria that is available in \"scan\" is available in \"delete-by\".\n\n### Delete row key function\n\nMost of the time HBase keys are prefixed (salted with a prefix).\nThis is done to avoid [\"RegionServer hotspotting\"](http://hbase.apache.org/book.html#rowkey.design).\n\n\"delete-by\" internally does a \"scan\" and returns keys that matched. Hence in order to delete these keys\nthey have to be \"re-salt-ed\" according to the custom key design.\n\n**cbass** addresses this by taking an optional `delete-key-fn`, which allows to \"put some salt back\" on those keys.\n\nHere is a real world example:\n\n```clojure\n;; HBase data\n\nuser=\u003e (scan conn \"table:name\")\n{\"���|8276345793754387439|transfer\" {...},\n \"���|8276345793754387439|match\" {...},\n \"���|8276345793754387439|trade\" {...},\n \"�d\\k^|28768787578329|transfer\" {...},\n \"�d\\k^|28768787578329|match\" {...},\n \"�d\\k^|28768787578329|trade\" {...}}\n```\n\na couple observations about the key:\n\n* it is prefixed with salt\n* it is piped delimited \n\nIn order to delete, say, all keys that start with `8276345793754387439`,\nbesides providing `:from` and `:to`, we would need to provide a `:row-key-fn` \nthat would _de_ salt and split, and then a `delete-key-fn` that can reassemble it back:\n\n```clojure\n(delete-by conn progress :row-key-fn (comp split-key without-salt)\n                         :delete-key-fn (fn [[x p]] (with-salt x p))\n                         :from (-\u003e \"8276345793754387439\" salt-pipe)\n                         :to   (-\u003e \"8276345793754387439\" salt-pipe+))))\n```\n\n`*salt`, `*split` and `*pipe` functions are not from **cbass**, \nthey are here to illustrate the point of how \"delete-by\" can be used to take on the real world.\n\n```clojure\n;; HBase data after the \"delete-by\"\n\nuser=\u003e (scan conn \"table:name\")\n{\"�d\\k^|28768787578329|transfer\" {...},\n \"�d\\k^|28768787578329|match\" {...},\n \"�d\\k^|28768787578329|trade\" {...}}\n```\n\n## Serialization\n\nHBase requires all data to be stored as bytes, i.e. byte arrays. Hence some serialization / deserialzation _defaults_ are good to have.\n\n### Defaults\n\ncbass uses a great [nippy](https://github.com/ptaoussanis/nippy) serialization library by default, but of course not everyone uses nippy, plus there are cases where the work needs to be on a pre existing dataset.\n\n### Plug it in\n\nSerialization in cbass is pluggable via `pack-up-pack` function that takes two functions, the one to pack and the one to unpack:\n\n```clojure\n(pack-un-pack {:p identity :u identity})\n```\n\nIn the case above we are just muting packing unpacking relying on the custom serialization being done _prior_ to calling cbass, so the data is a byte array, and deserialization is done on the return value from cbass, since it will just return a byte array back in this case (i.e. `identity` for both).\n\nBut of course any other pack/unpack fuctions can be provided to let cbass know how to serialize and deserialize.\n\ncbass keeps an internal state of pack/unpack functions, so `pack-un-pack` would usually be called just once when an application starts.\n\n### When Connecting\n\nWhile calling `pack-un-pack` works great, in the future, it would be better to specify serializers locally per connection. A `new-connection` function takes `pack` and `unpack` as optional arguments, and this would be a _prefered way_ to plug in serializers vs. `pack-un-pack`:\n\n```clojure\n(def conn (new-connection conf :pack identity \n                               :unpack identity))\n```\n\n## Using increment mutations\n\nHBase offers counters in the form of the mutation API. One caveat is that the data isn't serialized with nippy so we have to \nmanage deserialization ourselves:\n\n```clojure\n=\u003e (cbass/pack-un-pack {:p #(cbass.tools/to-bytes %) :u identity})\n=\u003e (require '[cbass.mutate :as cmut])\n```\n\n```clojure\n=\u003e (cmut/increment conn \"galaxy:planet\" \"mars\" \"galaxy\" :landers 7)\n\n#object[org.apache.hadoop.hbase.client.Result\n        0x7017e957\n        \"keyvalues={mars/galaxy:landers/1543441160950/Put/vlen=8/seqid=0}\"\n```\n```clojure\n=\u003e (find-by conn \"galaxy:planet\" \"mars\" \"galaxy\")\n\n{:last-updated 1543441160950,\n :age #object[\"[B\" 0x2207b2e6 \"[B@2207b2e6\"],\n :inhabited? #object[\"[B\" 0x618e78f7 \"[B@618e78f7\"],\n :landers #object[\"[B\" 0xd63e8e6 \"[B@d63e8e6\"],\n :population #object[\"[B\" 0x644599bb \"[B@644599bb\"]}\n \n=\u003e (cbass.tools/bytes-\u003enum (:landers (find-by conn \"galaxy:planet\" \"mars\" \"galaxy\")))\n7\n```\n\nThere's support for batch processing of increments as well as for using the async BufferedMutator for high throughput. See the [source](src/cbass/mutate.clj) for more info.\n\n## License\n\nCopyright © 2018 tolitius\n\nDistributed under the Eclipse Public License either version 1.0 or (at\nyour option) any later version.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftolitius%2Fcbass","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftolitius%2Fcbass","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftolitius%2Fcbass/lists"}