An open API service indexing awesome lists of open source software.

https://github.com/igrishaev/pg-bin

Parse binary Postgres COPY output
https://github.com/igrishaev/pg-bin

binary clojure copy postgres

Last synced: about 1 month ago
JSON representation

Parse binary Postgres COPY output

Awesome Lists containing this project

README

          

# PG.bin

A library to parse Postgres COPY dumps made in binary format.

Postgres has a great API to transfer data into and out from a database called
COPY. What is special about it is that it supports three different formats: CSV,
text and binary. Both CSV and text are trivial: values are passed using their
text representation. Only quoting rules and separating characters differ.

Binary format is special in that direction that values are not text. They're
passed exactly how they're stored in Postgres. Thus, binary format is more
compact: it's 30% less in size than CSV or text. The same applies to
performance: COPY-ing a binary data back and forth takes about 15-25% less time.

To parse a binary dump, one must know its structure. This is what the library
does: it knows how to parse such dumps. It supports most of the built-in
Postgres types including JSON(b). The API is simple an extensible.

## Installation

Add this to your project:

~~~clojure
;; lein
[com.github.igrishaev/pg-bin "0.1.0"]

;; deps
com.github.igrishaev/pg-bin {:mvn/version "0.1.0"}
~~~

## Usage

Let's prepare a binary dump as follows:

~~~sql
create temp table test(
f_01 int2,
f_02 int4,
f_03 int8,
f_04 boolean,
f_05 float4,
f_06 float8,
f_07 text,
f_08 varchar(12),
f_09 time,
f_10 timetz,
f_11 date,
f_12 timestamp,
f_13 timestamptz,
f_14 bytea,
f_15 json,
f_16 jsonb,
f_17 uuid,
f_18 numeric(12,3),
f_19 text null,
f_20 decimal
);

insert into test values (
1,
2,
3,
true,
123.456,
654.321,
'hello',
'world',
'10:42:35',
'10:42:35+0030',
'2025-11-30',
'2025-11-30 10:42:35',
'2025-11-30 10:42:35.123567+0030',
'\xDEADBEEF',
'{"foo": [1, 2, 3, {"kek": [true, false, null]}]}',
'{"foo": [1, 2, 3, {"kek": [true, false, null]}]}',
'4bda6037-1c37-4051-9898-13b82f1bd712',
'123456.123456',
null,
'123999.999100500'
);

\copy test to '/Users/ivan/dump.bin' with (format binary);
~~~

Let's peek what's inside:

~~~text
xxd -d /Users/ivan/dump.bin

00000000: 5047 434f 5059 0aff 0d0a 0000 0000 0000 PGCOPY..........
00000016: 0000 0000 1400 0000 0200 0100 0000 0400 ................
00000032: 0000 0200 0000 0800 0000 0000 0000 0300 ................
00000048: 0000 0101 0000 0004 42f6 e979 0000 0008 ........B..y....
00000064: 4084 7291 6872 b021 0000 0005 6865 6c6c @.r.hr.!....hell
00000080: 6f00 0000 0577 6f72 6c64 0000 0008 0000 o....world......
00000096: 0008 fa0e 9cc0 0000 000c 0000 0008 fa0e ................
00000112: 9cc0 ffff f8f8 0000 0004 0000 24f9 0000 ............$...
00000128: 0008 0002 e7cc 4a0a fcc0 0000 0008 0002 ......J.........
00000144: e7cb dec3 0d6f 0000 0004 dead beef 0000 .....o..........
00000160: 0030 7b22 666f 6f22 3a20 5b31 2c20 322c .0{"foo": [1, 2,
00000176: 2033 2c20 7b22 6b65 6b22 3a20 5b74 7275 3, {"kek": [tru
00000192: 652c 2066 616c 7365 2c20 6e75 6c6c 5d7d e, false, null]}
00000208: 5d7d 0000 0031 017b 2266 6f6f 223a 205b ]}...1.{"foo": [
00000224: 312c 2032 2c20 332c 207b 226b 656b 223a 1, 2, 3, {"kek":
00000240: 205b 7472 7565 2c20 6661 6c73 652c 206e [true, false, n
00000256: 756c 6c5d 7d5d 7d00 0000 104b da60 371c ull]}]}....K.`7.
00000272: 3740 5198 9813 b82f 1bd7 1200 0000 0e00 7@Q..../........
00000288: 0300 0100 0000 0300 0c0d 8004 ceff ffff ................
00000304: ff00 0000 1000 0400 0100 0000 0900 0c0f ................
00000320: 9f27 0700 32ff ff .'..2..
~~~

Now the library comes into play:

~~~clojure
(ns some.ns
(:require
[clojure.java.io :as io]
[pg-bin.core :as copy]
taggie.core))

(def FIELDS
[:int2
:int4
:int8
:boolean
:float4
:float8
:text
:varchar
:time
:timetz
:date
:timestamp
:timestamptz
:bytea
:json
:jsonb
:uuid
:numeric
:text
:decimal])

(copy/parse "/Users/ivan/dump.bin" FIELDS)

[[1
2
3
true
(float 123.456)
654.321
"hello"
"world"
#LocalTime "10:42:35"
#OffsetTime "10:42:35+00:30"
#LocalDate "2025-11-30"
#LocalDateTime "2025-11-30T10:42:35"
#OffsetDateTime "2025-11-30T10:12:35.123567Z"
(=bytes [-34, -83, -66, -17])
"{\"foo\": [1, 2, 3, {\"kek\": [true, false, null]}]}"
"{\"foo\": [1, 2, 3, {\"kek\": [true, false, null]}]}"
#uuid "4bda6037-1c37-4051-9898-13b82f1bd712"
123456.123M
nil
123999.999100500M]]
~~~

[taggie]: https://github.com/igrishaev/taggie

Here and below: I use [Taggie][taggie] to render complex values like date &
time, byte arrays and so on. Really useful!

This is what is going on here: we parse a source pointing to a dump using the
`parse` function. A source might be a file, a byte array, an input stream and so
on -- anything that can be coerced to an input stream using the
`clojure.java.io/input-stream` function.

Binary files produced by Postgres don't know their structure. Unfortunately,
there is no information about types, only data. One should help the library
traverse a binary dump by specifying a vector of types. The `FIELDS` variable
declares the structure of the file. See below what types are supported.

## API

There are two functions to parse, namely:

- `pg-bin.core/parse` accepts any source and returns a vector of parsed
lines. This function is eager meaning it consumes the whole source and
accumulates lines in a vector.

- `pg-bin.core/parse-seq` accepts an `InputStream` and returns a lazy sequence
of parsed lines. It must be called under the `with-open` macro as follows:

~~~clojure
(with-open [in (io/input-stream "/Users/ivan/dump.bin")]
(let [lines (copy/parse-seq in FIELDS)]
(doseq [line lines]
...)))
~~~

Both functions accept a list of fields as the second argument.

## Skipping fields

When parsing, it's likely that you don't need all fields to be parsed. You may
keep only the leading ones:

~~~clojure
(copy/parse DUMP_PATH [:int2 :int4 :int8])
[[1 2 3]]
~~~

To skip fields located in the middle, use either `:skip` or an underscore:

~~~clojure
(copy/parse DUMP_PATH [:int2 :skip :_ :boolean])
[[1 true]]
~~~

## Raw fields

If, for any reason, you have a type in your dump that the library is not aware
about, or you'd like to examine its binary representation, specify `:raw` or
`:bytes`. Each value will be a byte array then. It's up to you how to deal with
those bytes:

~~~clojure
(copy/parse DUMP_PATH [:raw :raw :bytes])
[[#bytes [0, 1]
#bytes [0, 0, 0, 2]
#bytes [0, 0, 0, 0, 0, 0, 0, 3]]]
~~~

## Handling JSON

Postgres is well-known for its vast JSON capabilities, and sometimes tables that
we dump have json(b) columns. Above, you saw that by default, they're parsed as
plain strings. This is because there is no a built-in JSON parser in Java and I
don't want to tie this library to a certain JSON implementation.

But the library provides a number of macros to extend undelrying
multi-methods. With a line of code, you can enable parsing json(b) types with
Chesire, Jsonista, Clojure.data.json, Charred, and JSam. This is how to do it:

~~~clojure
(ns some.ns
(:require
[pg-bin.core :as copy]
[pg-bin.json :as json]))

(json/set-cheshire keyword) ;; overrides multimethods

(copy/parse DUMP_PATH FIELDS)

[[...
{:foo [1 2 3 {:kek [true false nil]}]}
{:foo [1 2 3 {:kek [true false nil]}]}
...]]
~~~

The `set-cheshire` macro extends multimethods assuming you have Cheshire
installed. Now the `parse` function, when facing json(b) types, will decode them
properly.

The `pg-bin.json` namespace provides the following macros:

- `set-string`: parse json(b) types as strings again;
- `set-cheshire`: parse using Cheshire;
- `set-data-json`: parse using clojure.data.json;
- `set-jsonista`: parse using Jsonista;
- `set-charred`: parse using Charred;
- `set-jsam`: parse using JSam.

All of them accept optional parameters that are passed into the underlying
parsing function.

PG.Bin doesn't introduce any JSON-related dependencies. Each macro assumes you
have added a required library into the classpath.

## Metadata

Each parsed line tracks its length in bytes, offset from the beginning of a file
(or a stream) and a unique index:

~~~clojure
(-> (copy/parse DUMP_PATH FIELDS)
first
meta)

#:pg{:length 306, :index 0, :offset 19}
~~~

Knowing these values might help reading a dump by chunks.

## Supported types

- `:raw :bytea :bytes` for raw access and `bytea`
- `:skip :_ nil` to skip a certain field
- `:uuid` to parse UUIDs
- `:int2 :short :smallint :smallserial` 2-byte integer (short)
- `:int4 :int :integer :oid :serial` 4-byte integer (integer)
- `:int8 :bigint :long :bigserial` 8-byte integer (long)
- `:numeric :decimal` numeric type (becomes `BigDecimal`)
- `:float4 :float :real` 4-byte float (float)
- `:float8 :double :double-precision` 8-byte float (double)
- `:boolean :bool` boolean
- `:text :varchar :enum :name :string` text values
- `:date` becomes `java.time.LocalDate`
- `:time :time-without-time-zone` becomes `java.time.LocalTime`
- `:timetz :time-with-time-zone` becomes `java.time.OffsetTime`
- `:timestamp :timestamp-without-time-zone` becomes `java.time.LocalDateTime`
- `:timestamptz :timestamp-with-time-zone` becomes `java.time.OffsetDateTime`

Ping me for more types, if needed.

## On Writing

At the moment, the library only parses binary dumps. Writing them is possible
yet requires extra work. Ping me if you really need writing binary files.

## Scenarios

Why using this library ever? Imagine you have to fetch a mas-s-s-ive chunk of
rows from a database, say 2-3 million to build a report. That might be an issue:
you don't want to saturate memory, neither you want to paginate using
LIMIT/OFFSET as it's slow. A simple solution would be to dump the data you need
into a file and process it. You won't keep the database constantly busy as
you're working with a dump! Here is a small demo:

~~~clojure
(ns some.ns
(:require
[pg-bin.core :as copy]
[pg-bin.json :as json]))

(defn make-copy-manager
"
Build an instance of CopyManager from a connection.
"
^CopyManager [^Connection conn]
(new CopyManager (.unwrap conn BaseConnection)))

(let [conn (jdbc/get-connection data-source)
mgr (make-copy-manager conn)
sql "copy table_name(col1, col2...) to stdout with (format binary)"
;; you can use a query without parameters as well
sql "copy (select... from... where...) to stdout with (format binary)"
]
(with-open [out (io/output-stream "/path/to/dump.bin")]
(.copyOut mgr sql out)))

(with-open [in (io/input-stream "/path/to/dump.bin")]
(let [lines (copy/parse-seq in [:int2 :text ...])]
(doseq [line lines]
...)))
~~~

Above, we dump the data into a file and then process it. There is a way to
process lines on the fly using another thread. The second demo:

~~~clojure
(let [conn
(jdbc/get-connection data-source)

mgr
(make-copy-manager conn)

sql
"copy table_name(col1, col2...) to stdout with (format binary)"

in
(new PipedInputStream)

started? (promise)

fut ;; a future to process the output
(future
(with-open [_ in] ;; must close it afterward
(deliver started? true) ;; must report we have started
(let [lines (copy/parse-seq in [:int2 :text ...])]
(doseq [line lines] ;; process on the fly
;; without touching the disk
...))))]

;; ensure the future has started
@started?

;; drain down to the piped output stream
(with-open [out (new PipedOutputStream in)]
(.copyOut mgr sql out))

@fut ;; wait for the future to complete
)
~~~

## Misc

~~~
©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©
Ivan Grishaev, 2025. © UNLICENSE ©
©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©©
~~~