{"id":19791368,"url":"https://github.com/BemiHQ/BemiDB","last_synced_at":"2025-05-01T01:32:18.194Z","repository":{"id":261488679,"uuid":"883391427","full_name":"BemiHQ/BemiDB","owner":"BemiHQ","description":"Single-binary Postgres read replica optimized for analytics","archived":false,"fork":false,"pushed_at":"2025-04-24T11:38:30.000Z","size":5417,"stargazers_count":1359,"open_issues_count":7,"forks_count":32,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-26T18:15:10.598Z","etag":null,"topics":["analytics","data-lakehouse","data-movement","data-warehouse","duckdb","iceberg","olap","parquet","postgresql","replication","zero-etl"],"latest_commit_sha":null,"homepage":"https://bemidb.com","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BemiHQ.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-04T22:09:10.000Z","updated_at":"2025-04-26T16:56:48.000Z","dependencies_parsed_at":"2024-11-06T20:45:27.968Z","dependency_job_id":"a4a3e015-e076-47f4-bdf2-58a24e15a621","html_url":"https://github.com/BemiHQ/BemiDB","commit_stats":null,"previous_names":["bemihq/bemidb"],"tags_count":127,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BemiHQ%2FBemiDB","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BemiHQ%2FBemiDB/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BemiHQ%2FBemiDB/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BemiHQ%2FBemiDB/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BemiHQ","download_url":"https://codeload.github.com/BemiHQ/BemiDB/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251808627,"owners_count":21647314,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","data-lakehouse","data-movement","data-warehouse","duckdb","iceberg","olap","parquet","postgresql","replication","zero-etl"],"created_at":"2024-11-12T07:01:38.086Z","updated_at":"2025-05-01T01:32:18.174Z","avatar_url":"https://github.com/BemiHQ.png","language":"Go","funding_links":[],"categories":["Go","Libraries Powered by DuckDB","analytics"],"sub_categories":[],"readme":"# BemiDB\n\nBemiDB is a Postgres read replica optimized for analytics.\nIt consists of a single binary that seamlessly connects to a Postgres database, replicates the data in a compressed columnar format, and allows you to run complex queries using its Postgres-compatible analytical query engine.\n\n![BemiDB](/img/BemiDB.gif)\n\n## Contents\n\n- [Highlights](#highlights)\n- [Use cases](#use-cases)\n- [Quickstart](#quickstart)\n- [Configuration](#configuration)\n- [Architecture](#architecture)\n- [Benchmark](#benchmark)\n- [Data type mapping](#data-type-mapping)\n- [Alternatives](#alternatives)\n- [Development](#development)\n- [License](#license)\n\n## Highlights\n\n- **Performance**: runs analytical queries up to 2000x faster than Postgres.\n- **Single Binary**: consists of a single binary that can be run on any machine.\n- **Postgres Replication**: automatically syncs data from Postgres databases.\n- **Compressed Data**: uses an open columnar format for tables with 4x compression.\n- **Scalable Storage**: storage is separated from compute and supports a local disk or S3.\n- **Query Engine**: leverages a query engine optimized for analytical workloads.\n- **Postgres-Compatible**: integrates with services and tools in the Postgres ecosystem.\n- **Open-Source**: released under an OSI-approved license.\n\n## Use cases\n\n- **Run complex analytical queries like it's your Postgres database**. Without worrying about performance impact and indexing.\n- **Simplify your data stack down to a single binary**. No complex setup, no data movement, no weird acronyms like CDC, ETL, DW.\n- **Integrate with Postgres-compatible tools and services**. Querying and visualizing data with BI tools, notebooks, and ORMs.\n- **Automatically centralize all data in a data lakehouse**. Using Iceberg tables with Parquet data files in object storage.\n- **Continuously archive data from your Postgres database**. Keeping and querying historical data without affecting the main database.\n\n## Quickstart\n\nInstall BemiDB:\n\n```sh\ncurl -sSL https://raw.githubusercontent.com/BemiHQ/BemiDB/refs/heads/main/scripts/install.sh | bash\n```\n\nSync data from a Postgres database:\n\n```sh\n./bemidb --pg-database-url postgres://postgres:postgres@localhost:5432/dbname sync\n```\n\nThen run BemiDB database:\n\n```sh\n./bemidb start\n```\n\nRun Postgres queries on top of the BemiDB database:\n\n```sh\n# List all tables\npsql postgres://localhost:54321/bemidb -c \"SELECT table_schema, table_name FROM information_schema.tables\"\n\n# Query a table\npsql postgres://localhost:54321/bemidb -c \"SELECT COUNT(*) FROM [table_name]\"\n```\n\n\u003ca name=\"docker\"\u003e\u003c/a\u003e\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eRunning in a Docker container\u003c/b\u003e\u003c/summary\u003e\n\n```sh\n# Download the latest Docker image\ndocker pull ghcr.io/bemihq/bemidb:latest\n\n# Sync data from a Postgres database\ndocker run \\\n  -e PG_DATABASE_URL=postgres://postgres:postgres@host.docker.internal:5432/dbname \\\n  ghcr.io/bemihq/bemidb:latest sync\n\n# Start the BemiDB database\ndocker run ghcr.io/bemihq/bemidb:latest start\n```\n\u003c/details\u003e\n\n\u003ca name=\"kubernetes\"\u003e\u003c/a\u003e\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eRunning in a Kubernetes cluster\u003c/b\u003e\u003c/summary\u003e\n\nYou can run 2 deployments in parallel, one for data syncing from a Postgres database and another for running the BemiDB database.\n\nIn that case, you'd need to set up a shared volume or a shared S3 bucket between the two deployments for Iceberg data.\nSee the [Configuration](#configuration) section for more details.\n\n```yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: bemidb-sync\n  namespace: default\n  labels:\n    app.kubernetes.io/name: bemidb-sync\nspec:\n  replicas: 1\n  template:\n    metadata:\n      labels:\n        app.kubernetes.io/name: bemidb-sync\n    spec:\n      containers:\n      - name: bemidb\n        image: ghcr.io/bemihq/bemidb:latest\n        command: [\"sync\"]\n        env:\n        - name: PG_DATABASE_URL\n          value: \"postgres://postgres:postgres@postgres-host:5432/dbname\"\n        - name: PG_SYNC_INTERVAL\n          value: \"1h\"\n        ...\n```\n\n```yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: bemidb-start\n  namespace: default\n  labels:\n    app.kubernetes.io/name: bemidb-start\nspec:\n  replicas: 1\n  template:\n    metadata:\n      labels:\n        app.kubernetes.io/name: bemidb-start\n    spec:\n      containers:\n      - name: bemidb\n        image: ghcr.io/bemihq/bemidb:latest\n        command: [\"start\"]\n        ...\n```\n\n\u003c/details\u003e\n\n## Configuration\n\n### Storage configuration\n\n#### Local disk storage\n\nBy default, BemiDB stores data on the local disk.\nHere is an example of running BemiDB with default settings and storing data in a local `iceberg` directory:\n\n```sh\n./bemidb \\\n  --storage-type LOCAL \\\n  --storage-path ./iceberg \\ # Data stored in ./iceberg/*\n  start\n```\n\n#### S3 block storage\n\nBemiDB natively supports S3 storage. You can specify the S3 settings using the following flags:\n\n```sh\n./bemidb \\\n  --storage-type S3 \\\n  --storage-path iceberg \\ # Data stored in s3://[AWS_S3_BUCKET]/iceberg/*\n  --aws-region [AWS_REGION] \\\n  --aws-s3-bucket [AWS_S3_BUCKET] \\\n  --aws-access-key-id [AWS_ACCESS_KEY_ID] \\\n  --aws-secret-access-key [AWS_SECRET_ACCESS_KEY] \\\n  start\n```\n\n\u003ca name=\"iam\"\u003e\u003c/a\u003e\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eAWS IAM policy example\u003c/b\u003e\u003c/summary\u003e\n\nHere is the minimal IAM policy required for BemiDB to work with S3:\n\n```json\n{\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Sid\": \"VisualEditor0\",\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"s3:PutObject\",\n                \"s3:GetObject\",\n                \"s3:ListBucket\",\n                \"s3:DeleteObject\"\n            ],\n            \"Resource\": [\n                \"arn:aws:s3:::[AWS_S3_BUCKET]\",\n                \"arn:aws:s3:::[AWS_S3_BUCKET]/*\"\n            ]\n        }\n    ]\n}\n```\n\u003c/details\u003e\n\n\u003ca name=\"minio\"\u003e\u003c/a\u003e\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eMinIO object storage example\u003c/b\u003e\u003c/summary\u003e\n\nBemiDB can work with various S3-compatible object storage solutions, such as MinIO.\n\n1. You can run MinIO locally:\n\n```sh\nminio server ./minio-data\n# API: http://192.168.68.102:9000  http://127.0.0.1:9000\n#    RootUser: minioadmin\n#    RootPass: minioadmin\n# WebUI: http://192.168.68.102:65218 http://127.0.0.1:65218\n#    RootUser: minioadmin\n#    RootPass: minioadmin\n```\n\n2. Open the MinIO WebUI and create a bucket, for example, `bemidb-bucket`.\n\n3. Run BemiDB with the following command:\n\n```sh\n./bemidb \\\n  --storage-type S3 \\\n  --storage-path iceberg \\\n  --aws-s3-bucket bemidb-bucket \\\n  --aws-s3-endpoint 127.0.0.1:9000 \\\n  --aws-region us-east-1 \\\n  --aws-access-key-id minioadmin \\\n  --aws-secret-access-key minioadmin \\\n  sync\n```\n\n\u003c/details\u003e\n\n### Syncing configuration\n\n#### Periodic data syncing\n\nTo sync data periodically from a Postgres database:\n\n```sh\n./bemidb \\\n  --pg-sync-interval 1h \\ # Supported units: h, m, s\n  --pg-database-url postgres://postgres:postgres@localhost:5432/dbname \\\n  sync\n```\n\nTo check when a table was last synced, you can use the `bemidb_last_synced_at` function. For example:\n\n```sh\n# Check when a table was last synced\npsql postgres://localhost:54321/bemidb -c \\\n  \"SELECT bemidb_last_synced_at('public.users')\"\n\n# Check how long ago a table was last synced\npsql postgres://localhost:54321/bemidb -c \\\n  \"SELECT ROUND(EXTRACT(EPOCH FROM age(bemidb_last_synced_at('public.users'))) / 60) AS synced_minutes_ago\"\n```\n\n#### Selective table syncing\n\nBy default, BemiDB syncs all tables from the Postgres database. To include and sync only specific tables from your Postgres database:\n\n```sh\n./bemidb \\\n  --pg-include-tables public.billing_*,public.users \\ # A comma-separated list of tables to include, supports wildcards (*)\n  --pg-database-url postgres://postgres:postgres@localhost:5432/dbname \\\n  sync\n```\n\nTo exclude specific tables during the sync:\n\n```sh\n./bemidb \\\n  --pg-exclude-tables public.*_logs,public.cache \\ # A comma-separated list of tables to exclude, supports wildcards (*)\n  --pg-database-url postgres://postgres:postgres@localhost:5432/dbname \\\n  sync\n```\n\nNote: if a table matches both `--pg-include-tables` and `--pg-exclude-tables`, it will be excluded.\n\nFor example, to include all tables in the `public` schema except for the `public.cache` table:\n\n```sh\n./bemidb \\\n  --pg-include-tables public.* \\     # Include all tables in the public schema\n  --pg-exclude-tables public.cache \\ # Except for the public.cache table\n  --pg-database-url postgres://postgres:postgres@localhost:5432/dbname \\\n  sync\n```\n\n#### Incremental data syncing\n\nBy default, BemiDB performs a full refresh of the table data during each sync.\nFor large tables, you can enable incremental syncing to only refresh the rows that have been inserted or updated since the last sync:\n\n```sh\n./bemidb \\\n  --pg-include-tables * \\                                   # Sync all tables with a full refresh\n  --pg-incrementally-refreshed-tables public.transactions \\ # Refresh only the public.transactions table incrementally\n  --pg-database-url postgres://postgres:postgres@localhost:5432/dbname \\\n  sync\n```\n\nNote: incremental refresh is currently limited to INSERT/UPDATE-modified tables and doesn't detect DELETEd rows.\nI.e., in BemiDB, these tables become append-only.\n\n#### Sharded data syncing\n\nBemiDB allows running multiple sync processes independently, each responsible its own set of tables (\"shards\") from the same Postgres database:\n\n```sh\n./bemidb \\\n  --pg-include-tables public.table1,public.table2 \\\n  --pg-preserve-unsynced \\ # Don't delete the existing tables in BemiDB that are not part of this sync (public.table3 and public.table4)\n  --pg-database-url postgres://postgres:postgres@localhost:5432/dbname \\\n  sync\n\n./bemidb \\\n  --pg-include-tables public.table3,public.table4 \\\n  --pg-preserve-unsynced \\ # Don't delete the existing tables in BemiDB that are not part of this sync (public.table1 and public.table2)\n  --pg-database-url postgres://postgres:postgres@localhost:5432/dbname \\\n  sync\n```\n\n#### Syncing from multiple Postgres databases\n\nBemiDB supports syncing data from multiple Postgres databases into the same BemiDB database by allowing prefixing schemas.\n\nFor example, if two Postgres databases `db1` and `db2` contain `public` schemas, you can prefix them as follows:\n\n```sh\n./bemidb \\\n  --pg-schema-prefix db1_ \\ # Prefix all db1 database schemas with db1_\n  --pg-preserve-unsynced \\  # Don't delete db2 schemas in BemiDB\n  --pg-database-url postgres://postgres:postgres@localhost:5432/db1 \\\n  sync\n\n./bemidb \\\n  --pg-schema-prefix db2_ \\ # Prefix all db2 database schemas with db2_\n  --pg-preserve-unsynced \\  # Don't delete db1 schemas in BemiDB\n  --pg-database-url postgres://postgres:postgres@localhost:5432/db2 \\\n  sync\n```\n\nThen you can query and join tables from both Postgres databases in the same BemiDB database:\n\n```sh\n./bemidb start\n\npsql postgres://localhost:54321/bemidb -c \\\n  \"SELECT * FROM db1_public.[TABLE] JOIN db2_public.[TABLE] ON ...\"\n```\n\n### Configuration options\n\n#### `sync` command options\n\n| CLI argument                          | Environment variable                | Default value | Description                                                                                                                                              |\n|---------------------------------------|-------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `--pg-database-url`                   | `PG_DATABASE_URL`                   | Required      | PostgreSQL database URL to sync                                                                                                                          |\n| `--pg-sync-interval`                  | `PG_SYNC_INTERVAL`                  |               | Interval between syncs. Valid units: `h`, `m`, `s`                                                                                                       |\n| `--pg-exclude-tables`                 | `PG_EXCLUDE_TABLES`                 |               | List of tables to exclude from sync. Comma-separated `schema.table`. May contain wildcards (`*`)                                                         |\n| `--pg-include-tables`                 | `PG_INCLUDE_TABLES`                 |               | List of tables to include in sync. Comma-separated `schema.table`. May contain wildcards (`*`)                                                           |\n| `--pg-incrementally-refreshed-tables` | `PG_INCREMENTALLY_REFRESHED_TABLES` |               | List of tables to refresh incrementally, currently limited to INSERT/UPDATE-modified tables. Comma-separated `schema.table`. May contain wildcards (`*`) |\n| `--pg-schema-prefix`                  | `PG_SCHEMA_PREFIX`                  |               | Prefix for PostgreSQL schema names                                                                                                                       |\n| `--pg-preserve-unsynced`              | `PG_PRESERVE_UNSYNCED`              | `false`       | Do not delete the existing tables in BemiDB that are not part of the sync                                                                                |\n\n#### `start` command options\n\n| CLI argument  | Environment variable | Default value | Description                            |\n|---------------|----------------------|---------------|----------------------------------------|\n| `--host`      | `BEMIDB_HOST`        | `127.0.0.1`   | Host for BemiDB to listen on           |\n| `--port`      | `BEMIDB_PORT`        | `54321`       | Port for BemiDB to listen on           |\n| `--database`  | `BEMIDB_DATABASE`    | `bemidb`      | Database name                          |\n| `--user`      | `BEMIDB_USER`        |               | Database user. Allows any if empty     |\n| `--password`  | `BEMIDB_PASSWORD`    |               | Database password. Allows any if empty |\n\n#### Storage options\n\n| CLI argument              | Environment variable    | Default value                   | Description                                                                                              |\n|---------------------------|-------------------------|---------------------------------|----------------------------------------------------------------------------------------------------------|\n| `--storage-type`          | `BEMIDB_STORAGE_TYPE`   | `LOCAL`                         | Storage type: `LOCAL` or `S3`                                                                            |\n| `--storage-path`          | `BEMIDB_STORAGE_PATH`   | `iceberg`                       | Path to the storage folder                                                                               |\n| `--aws-s3-endpoint`       | `AWS_S3_ENDPOINT`       | `s3.amazonaws.com`              | AWS S3 endpoint                                                                                          |\n| `--aws-region`            | `AWS_REGION`            | Required with `S3` storage type | AWS region                                                                                               |\n| `--aws-s3-bucket`         | `AWS_S3_BUCKET`         | Required with `S3` storage type | AWS S3 bucket name                                                                                       |\n| `--aws-access-key-id`     | `AWS_ACCESS_KEY_ID`     |                                 | AWS access key ID. If empty, tries to fetch AWS SDK credentials in this order: config file, STS, SSO     |\n| `--aws-secret-access-key` | `AWS_SECRET_ACCESS_KEY` |                                 | AWS secret access key. If empty, tries to fetch AWS SDK credentials in this order: config file, STS, SSO |\n\n#### Other options\n\n| CLI argument                   | Environment variable                 | Default value | Description                                                             |\n|--------------------------------|--------------------------------------|---------------|-------------------------------------------------------------------------|\n| `--log-level`                  | `BEMIDB_LOG_LEVEL`                   | `INFO`        | Log level: `ERROR`, `WARN`, `INFO`, `DEBUG`, `TRACE`                    |\n| `--disable-anonymous-analytics`| `BEMIDB_DISABLE_ANONYMOUS_ANALYTICS` | `false`       | Disable collection of anonymous usage metadata (OS type, database host) |\n\nNote: CLI arguments take precedence over environment variables. I.e. you can override the environment variables with CLI arguments.\n\n## Architecture\n\nBemiDB consists of the following main components:\n\n- **Database Server**: implements the [Postgres protocol](https://www.postgresql.org/docs/current/protocol.html) to enable Postgres compatibility.\n- **Query Engine**: embeds the [DuckDB](https://duckdb.org/) query engine to run analytical queries.\n- **Storage Layer**: uses the [Iceberg](https://iceberg.apache.org/) table format to store data in columnar compressed Parquet files.\n- **Postgres Connector**: connects to a Postgres databases to sync tables' schema and data.\n\n\u003cimg src=\"/img/architecture.png\" alt=\"Architecture\" width=\"720px\"\u003e\n\n## Benchmark\n\nBemiDB is optimized for analytical workloads and can run complex queries up to 2000x faster than Postgres.\n\nOn the TPC-H benchmark with 22 sequential queries, BemiDB outperforms Postgres by a significant margin:\n\n* Scale factor: 0.1\n  * BemiDB unindexed: 2.3s 👍\n  * Postgres unindexed: 1h23m13s 👎 (2,170x slower)\n  * Postgres indexed: 1.5s 👍 (99.97% bottleneck reduction)\n* Scale factor: 1.0\n  * BemiDB unindexed: 25.6s 👍\n  * Postgres unindexed: ∞ 👎 (infinitely slower)\n  * Postgres indexed: 1h34m40s 👎 (220x slower)\n\nSee the [benchmark](/benchmark) directory for more details.\n\n## Data type mapping\n\nPrimitive data types are mapped as follows:\n\n| PostgreSQL                                                  | Parquet                                           | Iceberg                          |\n|-------------------------------------------------------------|---------------------------------------------------|----------------------------------|\n| `bool`                                                      | `BOOLEAN`                                         | `boolean`                        |\n| `varchar`, `text`, `bpchar`, `bit`                          | `BYTE_ARRAY` (`UTF8`)                             | `string`                         |\n| `int2`, `int4`                                              | `INT32`                                           | `int`                            |\n| `int8`                                                      | `INT64`                                           | `long`                           |\n| `xid`                                                       | `INT32` (`UINT_32`)                               | `int`                            |\n| `xid8`                                                      | `INT64` (`UINT_64`)                               | `long`                           |\n| `float4`, `float8`                                          | `FLOAT`                                           | `float`                          |\n| `numeric`                                                   | `FIXED_LEN_BYTE_ARRAY` (`DECIMAL`)                | `decimal(P, S)`                  |\n| `date`                                                      | `INT32` (`DATE`)                                  | `date`                           |\n| `time`, `timetz`                                            | `INT64` (`TIME_MICROS` / `TIME_MILLIS`)           | `time`                           |\n| `timestamp`                                                 | `INT64` (`TIMESTAMP_MICROS` / `TIMESTAMP_MILLIS`) | `timestamp` / `timestamp_ns`     |\n| `timestamptz`                                               | `INT64` (`TIMESTAMP_MICROS` / `TIMESTAMP_MILLIS`) | `timestamptz` / `timestamptz_ns` |\n| `uuid`                                                      | `BYTE_ARRAY` (`UTF8`)                             | `uuid`                           |\n| `bytea`                                                     | `BYTE_ARRAY` (`UTF8`)                             | `binary`                         |\n| `interval`                                                  | `BYTE_ARRAY` (`UTF8`)                             | `string`                         |\n| `point`, `line`, `lseg`, `box`, `path`, `polygon`, `circle` | `BYTE_ARRAY` (`UTF8`)                             | `string`                         |\n| `cidr`, `inet`, `macaddr`, `macaddr8`                       | `BYTE_ARRAY` (`UTF8`)                             | `string`                         |\n| `tsvector`, `xml`, `pg_snapshot`                            | `BYTE_ARRAY` (`UTF8`)                             | `string`                         |\n| `json`, `jsonb`                                             | `BYTE_ARRAY` (`UTF8`)                             | `string` (JSON logical type)     |\n| `_*` (array)                                                | `LIST` `*`                                        | `list`                           |\n| `*` (user-defined type)                                     | `BYTE_ARRAY` (`UTF8`)                             | `string`                         |\n\nNote that Postgres `json` and `jsonb` types are implemented as JSON logical types and stored as strings (Parquet and Iceberg don't support unstructured data types).\nYou can query JSON columns using standard operators, for example:\n\n```sql\nSELECT * FROM [TABLE] WHERE [JSON_COLUMN]-\u003e\u003e'[JSON_KEY]' = '[JSON_VALUE]';\n```\n\n## Alternatives\n\n#### BemiDB vs PostgreSQL\n\nPostgreSQL pros:\n\n- It is the most loved general-purpose transactional (OLTP) database 💛\n- Capable of running analytical queries at small scale\n\nPostgreSQL cons:\n\n- Slow for analytical (OLAP) queries on medium and large datasets\n- Requires creating indexes for specific analytical queries, which impacts the \"write\" performance for transactional queries\n- Materialized views as a \"cache\" require manual maintenance and become increasingly slow to refresh as the data grows\n- Further tuning may not be possible if executing various ad-hoc analytical queries\n\n#### BemiDB vs PostgreSQL extensions\n\nPostgreSQL extensions pros:\n\n- There is a wide range of extensions available in the PostgreSQL ecosystem\n- Open-source community driven\n\nPostgreSQL extensions cons:\n\n- Performance overhead when running analytical queries affecting transactional queries\n- Limited support for installable extensions in managed PostgreSQL services (for example, AWS Aurora [allowlist](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQLReleaseNotes/AuroraPostgreSQL.Extensions.html#AuroraPostgreSQL.Extensions.16))\n- Increased PostgreSQL maintenance complexity when upgrading versions\n- Require manual data syncing and schema mapping if data is stored in a different format\n\nMain types of extensions for analytics:\n\n- Foreign data wrapper extensions (parquet_fdw, parquet_s3_fdw, etc.)\n  - Pros: allow querying external data sources like columnar Parquet files directly from PostgreSQL\n  - Cons: use not optimized for analytics query engines\n- OLAP query engine extensions (pg_duckdb, pg_analytics, etc.)\n  - Pros: integrate an analytical query engine directly into PostgreSQL\n  - Cons: cumbersome to use (creating foreign tables, calling custom functions), data layer is not integrated and optimized\n\n#### BemiDB vs DuckDB\n\nDuckDB pros:\n\n- Designed for OLAP use cases\n- Easy to run with a single binary\n\nDuckDB cons:\n\n- Limited support in the data ecosystem like notebooks, BI tools, etc.\n- Requires manual data syncing and schema mapping for best performance\n- Limited features compared to a full-fledged database: no support for writing into Iceberg tables, reading from Iceberg according to the spec, etc.\n\n#### BemiDB vs real-time OLAP databases (ClickHouse, Druid, etc.)\n\nReal-time OLAP databases pros:\n\n- High-performance optimized for real-time analytics\n\nReal-time OLAP databases cons:\n\n- Require expertise to set up and manage distributed systems\n- Limitations on data mutability\n- Steeper learning curve\n- Require manual data syncing and schema mapping\n\n#### BemiDB vs big data query engines (Spark, Trino, etc.)\n\nBig data query engines pros:\n\n- Distributed SQL query engines for big data analytics\n\nBig data query engines cons:\n\n- Complex to set up and manage a distributed query engine (ZooKeeper, JVM, etc.)\n- Don't have a storage layer themselves\n- Require manual data syncing and schema mapping\n\n#### BemiDB vs proprietary solutions (Snowflake, Redshift, BigQuery, Databricks, etc.)\n\nProprietary solutions pros:\n\n- Fully managed cloud data warehouses and lakehouses optimized for OLAP\n\nProprietary solutions cons:\n\n- Can be expensive compared to other alternatives\n- Vendor lock-in and limited control over the data\n- Require separate systems for data syncing and schema mapping\n\n---\n\nFor a more detailed comparison of different approaches to running analytics with PostgreSQL, check out our [blog post](https://blog.bemi.io/analytics-with-postgresql/).\n\n## Development\n\nWe develop BemiDB using [Devbox](https://www.jetify.com/devbox) to ensure a consistent development environment without relying on Docker.\n\nTo start developing BemiDB and run tests, follow these steps:\n\n```sh\ncp .env.sample .env\nmake install\nmake test\n```\n\nTo run BemiDB locally, use the following command:\n\n```sh\nmake up\n```\n\nTo sync data from a Postgres database, use the following command:\n\n```sh\nmake sync\n```\n\n## License\n\nDistributed under the terms of the [AGPL-3.0 License](/LICENSE). If you need to modify and distribute the code, please release it to contribute back to the open-source community.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBemiHQ%2FBemiDB","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FBemiHQ%2FBemiDB","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBemiHQ%2FBemiDB/lists"}