https://github.com/igrishaev/pg-bin

Parse binary Postgres COPY output
https://github.com/igrishaev/pg-bin
binary clojure copy postgres
Last synced: about 2 months ago
JSON representation
Parse binary Postgres COPY output
Host: GitHub
URL: https://github.com/igrishaev/pg-bin
Owner: igrishaev
License: unlicense
Created: 2025-09-08T15:56:43.000Z (10 months ago)
Default Branch: master
Last Pushed: 2025-09-15T16:14:13.000Z (10 months ago)
Last Synced: 2025-12-13T05:00:19.744Z (7 months ago)
Topics: binary, clojure, copy, postgres
Language: Clojure
Homepage: https://github.com/igrishaev/pg-bin
Size: 68.4 KB
Stars: 5
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

          # PG.bin

A library to parse Postgres COPY dumps made in binary format.

Postgres has a great API to transfer data into and out from a database called

COPY. What is special about it is that it supports three different formats: CSV,

text and binary. Both CSV and text are trivial: values are passed using their

text representation. Only quoting rules and separating characters differ.

Binary format is special in that direction that values are not text. They're

passed exactly how they're stored in Postgres. Thus, binary format is more

compact: it's 30% less in size than CSV or text. The same applies to

performance: COPY-ing a binary data back and forth takes about 15-25% less time.

To parse a binary dump, one must know its structure. This is what the library

does: it knows how to parse such dumps. It supports most of the built-in

Postgres types including JSON(b). The API is simple an extensible.

## Installation

Add this to your project:

~~~clojure

;; lein

[com.github.igrishaev/pg-bin "0.1.0"]

;; deps

com.github.igrishaev/pg-bin {:mvn/version "0.1.0"}

~~~

## Usage

Let's prepare a binary dump as follows:

~~~sql

create temp table test(

    f_01 int2,

    f_02 int4,

    f_03 int8,

    f_04 boolean,

    f_05 float4,

    f_06 float8,

    f_07 text,

    f_08 varchar(12),

    f_09 time,

    f_10 timetz,

    f_11 date,

    f_12 timestamp,

    f_13 timestamptz,

    f_14 bytea,

    f_15 json,

    f_16 jsonb,

    f_17 uuid,

    f_18 numeric(12,3),

    f_19 text null,

    f_20 decimal

);

insert into test values (

    1,

    2,

    3,

    true,

    123.456,

    654.321,

    'hello',

    'world',

    '10:42:35',

    '10:42:35+0030',

    '2025-11-30',

    '2025-11-30 10:42:35',

    '2025-11-30 10:42:35.123567+0030',

    '\xDEADBEEF',

    '{"foo": [1, 2, 3, {"kek": [true, false, null]}]}',

    '{"foo": [1, 2, 3, {"kek": [true, false, null]}]}',

    '4bda6037-1c37-4051-9898-13b82f1bd712',

    '123456.123456',

    null,

    '123999.999100500'

);

\copy test to '/Users/ivan/dump.bin' with (format binary);

~~~

Let's peek what's inside:

~~~text

xxd -d /Users/ivan/dump.bin

00000000: 5047 434f 5059 0aff 0d0a 0000 0000 0000  PGCOPY..........

00000016: 0000 0000 1400 0000 0200 0100 0000 0400  ................

00000032: 0000 0200 0000 0800 0000 0000 0000 0300  ................

00000048: 0000 0101 0000 0004 42f6 e979 0000 0008  ........B..y....

00000064: 4084 7291 6872 b021 0000 0005 6865 6c6c  @.r.hr.!....hell

00000080: 6f00 0000 0577 6f72 6c64 0000 0008 0000  o....world......

00000096: 0008 fa0e 9cc0 0000 000c 0000 0008 fa0e  ................

00000112: 9cc0 ffff f8f8 0000 0004 0000 24f9 0000  ............$...

00000128: 0008 0002 e7cc 4a0a fcc0 0000 0008 0002  ......J.........

00000144: e7cb dec3 0d6f 0000 0004 dead beef 0000  .....o..........

00000160: 0030 7b22 666f 6f22 3a20 5b31 2c20 322c  .0{"foo": [1, 2,

00000176: 2033 2c20 7b22 6b65 6b22 3a20 5b74 7275   3, {"kek": [tru

00000192: 652c 2066 616c 7365 2c20 6e75 6c6c 5d7d  e, false, null]}

00000208: 5d7d 0000 0031 017b 2266 6f6f 223a 205b  ]}...1.{"foo": [

00000224: 312c 2032 2c20 332c 207b 226b 656b 223a  1, 2, 3, {"kek":

00000240: 205b 7472 7565 2c20 6661 6c73 652c 206e   [true, false, n

00000256: 756c 6c5d 7d5d 7d00 0000 104b da60 371c  ull]}]}....K.`7.

00000272: 3740 5198 9813 b82f 1bd7 1200 0000 0e00  7@Q..../........

00000288: 0300 0100 0000 0300 0c0d 8004 ceff ffff  ................

00000304: ff00 0000 1000 0400 0100 0000 0900 0c0f  ................

00000320: 9f27 0700 32ff ff                        .'..2..

~~~

Now the library comes into play:

~~~clojure

(ns some.ns

  (:require

   [clojure.java.io :as io]

   [pg-bin.core :as copy]

   taggie.core))

(def FIELDS

  [:int2

   :int4

   :int8

   :boolean

   :float4

   :float8

   :text

   :varchar

   :time

   :timetz

   :date

   :timestamp

   :timestamptz

   :bytea

   :json

   :jsonb

   :uuid

   :numeric

   :text

   :decimal])

(copy/parse "/Users/ivan/dump.bin" FIELDS)

[[1

  2

  3

  true

  (float 123.456)

  654.321

  "hello"

  "world"

  #LocalTime "10:42:35"

  #OffsetTime "10:42:35+00:30"

  #LocalDate "2025-11-30"

  #LocalDateTime "2025-11-30T10:42:35"

  #OffsetDateTime "2025-11-30T10:12:35.123567Z"

  (=bytes [-34, -83, -66, -17])

  "{\"foo\": [1, 2, 3, {\"kek\": [true, false, null]}]}"

  "{\"foo\": [1, 2, 3, {\"kek\": [true, false, null]}]}"

  #uuid "4bda6037-1c37-4051-9898-13b82f1bd712"

  123456.123M

  nil

  123999.999100500M]]

~~~

[taggie]: https://github.com/igrishaev/taggie

Here and below: I use [Taggie][taggie] to render complex values like date &

time, byte arrays and so on. Really useful!

This is what is going on here: we parse a source pointing to a dump using the

`parse` function. A source might be a file, a byte array, an input stream and so

on -- anything that can be coerced to an input stream using the

`clojure.java.io/input-stream` function.

Binary files produced by Postgres don't know their structure. Unfortunately,

there is no information about types, only data. One should help the library

traverse a binary dump by specifying a vector of types. The `FIELDS` variable

declares the structure of the file. See below what types are supported.

## API

There are two functions to parse, namely:

- `pg-bin.core/parse` accepts any source and returns a vector of parsed

  lines. This function is eager meaning it consumes the whole source and

  accumulates lines in a vector.

- `pg-bin.core/parse-seq` accepts an `InputStream` and returns a lazy sequence

  of parsed lines. It must be called under the `with-open` macro as follows:

~~~clojure

(with-open [in (io/input-stream "/Users/ivan/dump.bin")]

  (let [lines (copy/parse-seq in FIELDS)]

    (doseq [line lines]

      ...)))

~~~

Both functions accept a list of fields as the second argument.

## Skipping fields

When parsing, it's likely that you don't need all fields to be parsed. You may

keep only the leading ones:

~~~clojure

(copy/parse DUMP_PATH [:int2 :int4 :int8])

[[1 2 3]]

~~~

To skip fields located in the middle, use either `:skip` or an underscore:

~~~clojure

(copy/parse DUMP_PATH [:int2 :skip :_ :boolean])

[[1 true]]

~~~

## Raw fields

If, for any reason, you have a type in your dump that the library is not aware

about, or you'd like to examine its binary representation, specify `:raw` or

`:bytes`. Each value will be a byte array then. It's up to you how to deal with

those bytes:

~~~clojure

(copy/parse DUMP_PATH [:raw :raw :bytes])

[[#bytes [0, 1]

  #bytes [0, 0, 0, 2]

  #bytes [0, 0, 0, 0, 0, 0, 0, 3]]]

~~~

## Handling JSON

Postgres is well-known for its vast JSON capabilities, and sometimes tables that

we dump have json(b) columns. Above, you saw that by default, they're parsed as

plain strings. This is because there is no a built-in JSON parser in Java and I

don't want to tie this library to a certain JSON implementation.

But the library provides a number of macros to extend undelrying

multi-methods. With a line of code, you can enable parsing json(b) types with

Chesire, Jsonista, Clojure.data.json, Charred, and JSam. This is how to do it:

~~~clojure

(ns some.ns

  (:require

   [pg-bin.core :as copy]

   [pg-bin.json :as json]))

(json/set-cheshire keyword) ;; overrides multimethods

(copy/parse DUMP_PATH FIELDS)

[[...

  {:foo [1 2 3 {:kek [true false nil]}]}

  {:foo [1 2 3 {:kek [true false nil]}]}

  ...]]

~~~

The `set-cheshire` macro extends multimethods assuming you have Cheshire

installed. Now the `parse` function, when facing json(b) types, will decode them

properly.

The `pg-bin.json` namespace provides the following macros:

- `set-string`: parse json(b) types as strings again;

- `set-cheshire`: parse using Cheshire;

- `set-data-json`: parse using clojure.data.json;

- `set-jsonista`: parse using Jsonista;

- `set-charred`: parse using Charred;

- `set-jsam`: parse using JSam.

All of them accept optional parameters that are passed into the underlying

parsing function.

PG.Bin doesn't introduce any JSON-related dependencies. Each macro assumes you

have added a required library into the classpath.

## Metadata

Each parsed line tracks its length in bytes, offset from the beginning of a file

(or a stream) and a unique index:

~~~clojure

(-> (copy/parse DUMP_PATH FIELDS)

    first

    meta)

#:pg{:length 306, :index 0, :offset 19}

~~~

Knowing these values might help reading a dump by chunks.

## Supported types

- `:raw :bytea :bytes` for raw access and `bytea`

- `:skip :_ nil` to skip a certain field

- `:uuid` to parse UUIDs

- `:int2 :short :smallint :smallserial` 2-byte integer (short)

- `:int4 :int :integer :oid :serial` 4-byte integer (integer)

- `:int8 :bigint :long :bigserial` 8-byte integer (long)

- `:numeric :decimal` numeric type (becomes `BigDecimal`)

- `:float4 :float :real` 4-byte float (float)

- `:float8 :double :double-precision` 8-byte float (double)

- `:boolean :bool` boolean

- `:text :varchar :enum :name :string` text values

- `:date` becomes `java.time.LocalDate`

- `:time :time-without-time-zone` becomes `java.time.LocalTime`

- `:timetz :time-with-time-zone` becomes `java.time.OffsetTime`

- `:timestamp :timestamp-without-time-zone` becomes `java.time.LocalDateTime`

- `:timestamptz :timestamp-with-time-zone` becomes `java.time.OffsetDateTime`

Ping me for more types, if needed.

## On Writing

At the moment, the library only parses binary dumps. Writing them is possible

yet requires extra work. Ping me if you really need writing binary files.

## Scenarios

Why using this library ever? Imagine you have to fetch a mas-s-s-ive chunk of

rows from a database, say 2-3 million to build a report. That might be an issue:

you don't want to saturate memory, neither you want to paginate using

LIMIT/OFFSET as it's slow. A simple solution would be to dump the data you need

into a file and process it. You won't keep the database constantly busy as

you're working with a dump! Here is a small demo:

~~~clojure

(ns some.ns

  (:require

   [pg-bin.core :as copy]

   [pg-bin.json :as json]))

(defn make-copy-manager

  "

  Build an instance of CopyManager from a connection.

  "

  ^CopyManager [^Connection conn]

  (new CopyManager (.unwrap conn BaseConnection)))

(let [conn (jdbc/get-connection data-source)

      mgr (make-copy-manager conn)

      sql "copy table_name(col1, col2...) to stdout with (format binary)"

      ;; you can use a query without parameters as well

      sql "copy (select... from... where...) to stdout with (format binary)"

      ]

  (with-open [out (io/output-stream "/path/to/dump.bin")]

    (.copyOut mgr sql out)))

(with-open [in (io/input-stream "/path/to/dump.bin")]

  (let [lines (copy/parse-seq in [:int2 :text ...])]

    (doseq [line lines]

      ...)))

~~~

Above, we dump the data into a file and then process it. There is a way to

process lines on the fly using another thread. The second demo:

~~~clojure

(let [conn

      (jdbc/get-connection data-source)

      mgr

      (make-copy-manager conn)

      sql

      "copy table_name(col1, col2...) to stdout with (format binary)"

      in

      (new PipedInputStream)

      started? (promise)

      fut ;; a future to process the output

      (future

        (with-open [_ in] ;; must close it afterward

          (deliver started? true) ;; must report we have started

          (let [lines (copy/parse-seq in [:int2 :text ...])]

            (doseq [line lines] ;; process on the fly

              ;; without touching the disk

              ...))))]

  ;; ensure the future has started

  @started?

  ;; drain down to the piped output stream

  (with-open [out (new PipedOutputStream in)]

    (.copyOut mgr sql out))

  @fut ;; wait for the future to complete

  )

~~~

## Misc

~~~

©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©

Ivan Grishaev, 2025. © UNLICENSE ©

©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©

~~~
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/igrishaev/pg-bin

Awesome Lists containing this project

README