https://github.com/pgflo/pg_flo
Stream, transform, and route PostgreSQL data in real-time.
https://github.com/pgflo/pg_flo
data database etl go golang logical-replication postgres postgresql stream
Last synced: 7 months ago
JSON representation
Stream, transform, and route PostgreSQL data in real-time.
- Host: GitHub
- URL: https://github.com/pgflo/pg_flo
- Owner: pgflo
- License: apache-2.0
- Created: 2024-09-02T17:13:01.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-16T14:35:15.000Z (about 1 year ago)
- Last Synced: 2024-11-16T15:22:43.882Z (about 1 year ago)
- Topics: data, database, etl, go, golang, logical-replication, postgres, postgresql, stream
- Language: Go
- Homepage: https://pgflo.io
- Size: 13.9 MB
- Stars: 647
- Watchers: 2
- Forks: 11
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-repositories - pgflo/pg_flo - Stream, transform, and route PostgreSQL data in real-time. (Go)
README
#
pg_flo
[](https://github.com/pgflo/pg_flo/actions/workflows/ci.yml)
[](https://github.com/pgflo/pg_flo/actions/workflows/integration.yml)
[](https://github.com/pgflo/pg_flo/releases/latest)
[](https://hub.docker.com/r/pgflo/pg_flo/tags)
> The easiest way to move and transform data between PostgreSQL databases using Logical Replication.
âšī¸ `pg_flo` is in active development. The design and architecture is continuously improving. PRs/Issues are very much welcome đ
## Key Features
- **Real-time Data Streaming** - Capture inserts, updates, deletes, and DDL changes in near real-time
- **Fast Initial Loads** - Parallel copy of existing data with automatic follow-up continuous replication
- **Powerful Transformations** - Filter and transform data on-the-fly ([see rules](pkg/rules/README.md))
- **Flexible Routing** - Route to different tables and remap columns ([see routing](pkg/routing/README.md))
- **Production Ready** - Supports resumable streaming, DDL tracking, and more
## Common Use Cases
- Real-time data replication between PostgreSQL databases
- ETL pipelines with data transformation
- Data re-routing, masking and filtering
- Database migration with zero downtime
- Event streaming from PostgreSQL
[View detailed examples â](internal/examples/README.md)
## Quick Start
### Prerequisites
- Docker
- PostgreSQL database with `wal_level=logical`
### 1. Install
```shell
docker pull pgflo/pg_flo:latest
```
### 2. Configure
Choose one:
- Environment variables
- YAML configuration file ([example](internal/pg-flo.yaml))
- CLI flags
### 3. Run
```shell
# Start NATS server
docker run -d --name pg_flo_nats \
--network host \
-v /path/to/nats-server.conf:/etc/nats/nats-server.conf \
nats:latest \
-c /etc/nats/nats-server.conf
# Start replicator (using config file)
docker run -d --name pg_flo_replicator \
--network host \
-v /path/to/config.yaml:/etc/pg_flo/config.yaml \
pgflo/pg_flo:latest \
replicator --config /etc/pg_flo/config.yaml
# Start worker
docker run -d --name pg_flo_worker \
--network host \
-v /path/to/config.yaml:/etc/pg_flo/config.yaml \
pgflo/pg_flo:latest \
worker postgres --config /etc/pg_flo/config.yaml
```
#### Example Configuration (config.yaml)
```yaml
# Replicator settings
host: "localhost"
port: 5432
dbname: "myapp"
user: "replicator"
password: "secret"
group: "users"
tables:
- "users"
# Worker settings (postgres sink)
target-host: "dest-db"
target-dbname: "myapp"
target-user: "writer"
target-password: "secret"
# Common settings
nats-url: "nats://localhost:4222"
```
[View full configuration options â](internal/pg-flo.yaml)
## Core Concepts
### Architecture
pg_flo uses two main components:
- **Replicator**: Captures PostgreSQL changes via logical replication
- **Worker**: Processes and routes changes through NATS
[Learn how it works â](internal/how-it-works.md)
### Groups
Groups are used to:
- Identify replication processes
- Isolate replication slots and publications
- Run multiple instances on same database
- Maintain state for resumability
- Enable parallel processing
```shell
# Example: Separate groups for different tables
pg_flo replicator --group users_orders --tables users,orders
pg_flo replicator --group products --tables products
```
### Streaming Modes
1. **Stream Only** (default)
- Real-time streaming of changes
```shell
pg_flo replicator --stream
```
2. **Copy Only**
- One-time parallel copy of existing data
```shell
pg_flo replicator --copy --max-copy-workers-per-table 4
```
3. **Copy and Stream**
- Initial parallel copy followed by continuous streaming
```shell
pg_flo replicator --copy-and-stream --max-copy-workers-per-table 4
```
### Destinations
- **stdout**: Console output
- **file**: File writing
- **postgres**: Database replication
- **webhook**: HTTP endpoints
[View destination details â](pkg/sinks/README.md)
## Advanced Features
### Message Routing
Routing configuration is defined in a separate YAML file:
```yaml
# routing.yaml
users:
source_table: users
destination_table: customers
column_mappings:
- source: id
destination: customer_id
```
```shell
# Apply routing configuration
pg_flo worker postgres --routing-config /path/to/routing.yaml
```
[Learn about routing â](pkg/routing/README.md)
### Transformation Rules
Rules are defined in a separate YAML file:
```yaml
# rules.yaml
users:
- type: exclude_columns
columns: [password, ssn]
- type: mask_columns
columns: [email]
```
```shell
# Apply transformation rules
pg_flo worker file --rules-config /path/to/rules.yaml
```
[View transformation options â](pkg/rules/README.md)
### Combined Example
```shell
pg_flo worker postgres --config /etc/pg_flo/config.yaml --routing-config routing.yaml --rules-config rules.yaml
```
## Scaling Guide
Best practices:
- Run one worker per group
- Use groups to replicate different tables independently
- Scale horizontally using multiple groups
Example scaling setup:
```shell
# Group: sales
pg_flo replicator --group sales --tables sales
pg_flo worker postgres --group sales
# Group: inventory
pg_flo replicator --group inventory --tables inventory
pg_flo worker postgres --group inventory
```
## Limits and Considerations
- NATS message size: 8MB (configurable)
- One worker per group recommended
- PostgreSQL logical replication prerequisites required
- Tables must have one of the following for replication:
- Primary key
- Unique constraint with `NOT NULL` columns
- `REPLICA IDENTITY FULL` set
Example table configurations:
```sql
-- Using primary key (recommended)
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email TEXT,
name TEXT
);
-- Using unique constraint
CREATE TABLE orders (
order_id TEXT NOT NULL,
customer_id TEXT NOT NULL,
data JSONB,
CONSTRAINT orders_unique UNIQUE (order_id, customer_id)
);
ALTER TABLE orders REPLICA IDENTITY USING INDEX orders_unique;
-- Using all columns (higher overhead in terms of performance)
CREATE TABLE audit_logs (
id SERIAL,
action TEXT,
data JSONB
);
ALTER TABLE audit_logs REPLICA IDENTITY FULL;
```
## Development
```shell
make build
make test
make lint
# E2E tests
./internal/scripts/e2e_local.sh
```
## Contributing
Contributions welcome! Please open an issue or submit a pull request.
## License
Apache License 2.0. [View license â](LICENSE)