{"id":19145402,"url":"https://github.com/paradigmxyz/sinker","last_synced_at":"2025-04-06T21:14:47.401Z","repository":{"id":171726740,"uuid":"602806661","full_name":"paradigmxyz/sinker","owner":"paradigmxyz","description":"Synchronize Postgres to Elasticsearch","archived":false,"fork":false,"pushed_at":"2025-02-03T16:21:34.000Z","size":239,"stargazers_count":63,"open_issues_count":0,"forks_count":7,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-13T07:43:49.678Z","etag":null,"topics":["cdc","elasticsearch","postgres"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/paradigmxyz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-17T01:20:14.000Z","updated_at":"2025-02-03T16:21:34.000Z","dependencies_parsed_at":"2024-01-02T17:25:26.013Z","dependency_job_id":"97c7e9c9-7e30-4b75-90da-05f1893ffdf1","html_url":"https://github.com/paradigmxyz/sinker","commit_stats":null,"previous_names":["paradigmxyz/sinker"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradigmxyz%2Fsinker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradigmxyz%2Fsinker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradigmxyz%2Fsinker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradigmxyz%2Fsinker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/paradigmxyz","download_url":"https://codeload.github.com/paradigmxyz/sinker/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247550689,"owners_count":20956987,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cdc","elasticsearch","postgres"],"created_at":"2024-11-09T07:40:04.495Z","updated_at":"2025-04-06T21:14:47.370Z","avatar_url":"https://github.com/paradigmxyz.png","language":"Python","readme":"# Sinker: Synchronize Postgres to Elasticsearch\n\nWhat are you [sinking about](https://www.youtube.com/watch?v=yR0lWICH3rY)?\n\n[![](https://img.shields.io/pypi/v/sinker.svg?maxAge=3600)](https://pypi.org/project/sinker/)\n[![ci](https://github.com/paradigmxyz/sinker/actions/workflows/test.yml/badge.svg)](https://github.com/paradigmxyz/sinker/actions/workflows/test.yml)\n[![codecov](https://codecov.io/gh/paradigmxyz/sinker/branch/main/graph/badge.svg?token=AIGMBZR0IG)](https://codecov.io/gh/paradigmxyz/sinker)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\n## What is Sinker?\n\nSinker is middleware that synchronizes relational data from a Postgres database to Elasticsearch.\nIt is simple to operate, requires minimal RAM, and handles arbitrarily complex schemas.\nSinker is built by [Paradigm](https://www.paradigm.xyz/), and is licensed under the Apache and MIT licenses.\n\n### For Example\n\nIn Postgres, you might have a normalized schema like this:\n\n- A Student and a Teacher refer to a Person\n- A Course is taught by a Teacher\n- Students have and belong to many Courses through the Enrollment join table\n\n![schema](sinker_schema.png)\n\nIn Elasticsearch, you might want to index the Course data in an index called `courses` like this:\n\n```json\n{\n  \"name\": \"Reth\",\n  \"description\": \"How to build a modern Ethereum node\",\n  \"teacher\": {\n    \"salary\": 100000.0,\n    \"person\": {\n      \"name\": \"Prof Georgios\"\n    }\n  },\n  \"enrollments\": [\n    {\n      \"grade\": 3.14,\n      \"student\": {\n        \"gpa\": 3.99,\n        \"person\": {\n          \"name\": \"Loren\"\n        }\n      }\n    },\n    {\n      \"grade\": 3.50,\n      \"student\": {\n        \"gpa\": 4.00,\n        \"person\": {\n          \"name\": \"Abigail\"\n        }\n      }\n    }\n  ]\n}\n```\n\nNow you can easily query Elasticsearch for courses taught by Prof Georgios, or students with high GPAs named Loren.\nTo do this, you need to do two things reliably:\n\n1. Denormalize the normalized data from the five Postgres tables into a single Elasticsearch document with the Course as\n   the parent and the other four tables nested appropriately inside it.\n2. Keep the Elasticsearch document in sync with the Postgres data, so that if Abigail changes her name in the database\n   to Abby, it's reflected in the `Course-\u003eEnrollments-\u003eStudent-\u003ePerson.name` field.\n\n## How it Works\n\n![diagram](paradigm_sinker.png)\n\nSinker transforms the normalized Postgres data into JSON documents stored in a simple key-value materialized view where\nthe key is the Elasticsearch document ID and the value is the JSON document to be stored in Elasticsearch.\n\nSinker creates triggers on the Postgres tables that you want to synchronize (e.g., the five tables in the example\nabove). When a row is inserted, updated, or deleted in any of these tables, the trigger schedules the materialized view\nto be refreshed at the next interval.\n\nThe changes to the materialized view are sent to a logical replication slot. Sinker reads from this slot and indexes the\ndocuments in Elasticsearch.\n\nYou define the query behind the materialized view, so you can denormalize the data however you want, filter out unwanted\ndocuments, transform some fields, etc. If you can express it in SQL, you can build your materialized view around it.\n\nYou also configure the Elasticsearch index settings and mappings however you like.\n\n## Installation\n\n```shell\npip install sinker\n```\n\n### Environment Variables\n\nHere are some of the environment variables that you'll want to set:\n\n| Environment Variable    | Value     |\n|-------------------------|-----------|\n| SINKER_DEFINITIONS_PATH | .         |\n| SINKER_SCHEMA           | public    |\n| SINKER_POLL_INTERVAL    | 10        |\n| SINKER_LOG_LEVEL        | DEBUG     |\n| PGPASSWORD              | secret!   |\n| PGHOST                  | localhost |\n| PGUSER                  | dev       |\n| PGDATABASE              | dev_db    |\n| ELASTICSEARCH_HOST      | localhost |\n| ELASTICSEARCH_SCHEME    | http      |\n\nSee sinker/settings.py for the full list.\n\n## Configuration\n\nSinker's main configuration file `views_to_indices.json` specifies the mapping between the root Postgres materialized\nview names and the Elasticsearch indexes that will get populated by them, e.g.:\n\n```json\n{\n  \"person_mv\": \"people\",\n  \"course_mv\": \"courses\"\n}\n```\n\nThis tells Sinker to define a Postgres materialized view called `person_mv` based on the query in the `person_mv.sql`\nfile and an Elasticsearch index called `people` based on the settings and mappings in the `people.json` file. It will\nthen populate the `people` index with the documents from the `person_mv` materialized view. It will then do the same for\nthe `course_mv` materialized view and the `courses` index.\n\n### Materialized View Configuration\n\nThe `person_mv` materialized view is defined by the SQL in the `person_mv.sql` file, e.g.:\n\n```sql\nselect id, json_build_object('name', \"name\") as \"person\"\nfrom \"person\"\n```\n\nThe `course_mv` SQL is more complex, but you can see how it denormalizes the data from the five tables into a single\nJSON document:\n\n```sql\nselect id,\n       json_build_object(\n               'name',\n               \"name\",\n               'description',\n               \"description\",\n               'teacher',\n               (select json_build_object(\n                               'salary',\n                               \"salary\",\n                               'person',\n                               (select json_build_object('name', \"name\")\n                                from person\n                                where person.id = person_id)\n                           )\n                from teacher\n                where teacher.id = teacher_id),\n               'enrollments',\n               (select json_agg(\n                               json_build_object(\n                                       'grade',\n                                       \"grade\",\n                                       'student',\n                                       (select json_build_object(\n                                                       'gpa',\n                                                       \"gpa\",\n                                                       'person',\n                                                       (select json_build_object('name', \"name\")\n                                                        from person\n                                                        where person.id = person_id)\n                                                   )\n                                        from student\n                                        where student.id = student_id)\n                                   )\n                           )\n                from enrollment\n                where enrollment.course_id = course.id)\n           ) as \"course\"\nfrom \"course\";\n```\n\n### Index Configuration\n\nThe Elasticsearch index configurations are stored in the `people.json` and `courses.json` files, e.g.:\n\n```json\n{\n  \"mappings\": {\n    \"dynamic\": \"strict\",\n    \"properties\": {\n      \"name\": {\n        \"type\": \"keyword\"\n      }\n    }\n  },\n  \"settings\": {\n    \"index\": {\n      \"number_of_shards\": \"1\",\n      \"number_of_replicas\": \"0\"\n    }\n  }\n}\n```\n\nUsing `strict` mappings helps ensure the JSON document structure from the materialized view matches what Elasticsearch\nexpects in the index.\n\n```json\n{\n  \"mappings\": {\n    \"dynamic\": \"strict\",\n    \"properties\": {\n      \"description\": {\n        \"type\": \"text\"\n      },\n      \"enrollments\": {\n        \"properties\": {\n          \"grade\": {\n            \"type\": \"float\"\n          },\n          \"student\": {\n            \"properties\": {\n              \"gpa\": {\n                \"type\": \"float\"\n              },\n              \"person\": {\n                \"properties\": {\n                  \"name\": {\n                    \"type\": \"text\"\n                  }\n                }\n              }\n            }\n          }\n        }\n      },\n      \"name\": {\n        \"type\": \"text\"\n      },\n      \"teacher\": {\n        \"properties\": {\n          \"person\": {\n            \"properties\": {\n              \"name\": {\n                \"type\": \"text\"\n              }\n            }\n          },\n          \"salary\": {\n            \"type\": \"float\"\n          }\n        }\n      }\n    }\n  },\n  \"settings\": {\n    \"index\": {\n      \"number_of_shards\": \"1\",\n      \"number_of_replicas\": \"0\"\n    }\n  }\n}\n```\n\n### Dry Run\n\nBefore running sinker, do a dry run on one of your view-index mappings to verify everything is defined correctly:\n\n1. Execute the SQL query with `LIMIT 1` in the materialized view definition file to see what a sample JSON record will\n   look like.\n2. Create the destination Elasticsearch index with the settings and mappings in the index configuration file.\n3. In Kibana or via cURL, `PUT` the sample JSON record into the Elasticsearch index.\n4. Verify the record is indexed as you expect in Elasticsearch.\n\n## Running\n\nOnce you have the environment variables and configuration files set up, you can run Sinker with:\n\n```shell\nsinker\n```\n\n### Performance\n\nOnce you have Sinker running, you may well want it to run faster. Here are some things you can do to improve\nperformance:\n\n1. Decrease `SINKER_POLL_INTERVAL`. This will make Sinker refresh the materialized views more\n   frequently (and thus keep Elasticsearch in closer sync), but it will also increase the load on the database. Note\n   that the materialized views are only refreshed when one of the underlying tables has changed, so this won't increase\n   the load on the database if there are no changes.\n2. Increase the `PGCHUNK_SIZE`. This will make Sinker read more rows from the logical replication slot at a time, which\n   will reduce the number of round trips to the database. However, it will also increase the memory usage of Sinker.\n3. Increase the `ELASTICSEARCH_CHUNK_SIZE`. This will make Sinker index more documents in a single Elasticsearch\n   bulk request, which will reduce the number of round trips to Elasticsearch. However, it will also increase the memory\n   usage of Sinker and the CPU load on the Elasticsearch cluster.\n4. Run `EXPLAIN ANALYZE` on your materialized view queries to see if you can optimize them (e.g., by adding indexes on\n   the foreign keys).\n\n## Developing\n\nBefore getting started, make sure you have [`poetry`](https://python-poetry.org/docs/)\nand [`docker-compose`](https://docs.docker.com/compose/install/) installed.\n\n```shell\n% docker-compose --version\nDocker Compose version v2.31.0\n% poetry --version\nPoetry (version 1.8.4)\n```\n\nClone the repo:\n\n```shell\ngit clone git@github.com:paradigmxyz/sinker.git\ncd sinker\n```\n\nInstall dependencies:\n\n```shell\npoetry install\n```\n\nSpin up Postgres and Elasticsearch:\n\n```shell\ndocker-compose --env-file=.env.test up -d\n```\n\nRun tests (with -s option to allow more verbose output):\n\n```shell\ncp .env.test .env # copy the test environment variables\npoetry run pytest -s\n```\n\n## Operations\n\n### Docker\n\nTo run Sinker as a Docker container, you can use the `docker/` files in this repo as a starting point.\nBundle your views and indices configuration files into the Docker image, set up your environment variables, and deploy.\n\n### Monitoring and Alerting\n\nSetting up monitoring and alerting will give you confidence that Sinker is functioning properly.\nFor instance, you can periodically check that the row counts in Postgres match the expected document counts in\nElasticsearch.\n\n## Contributing\n\nContributions are welcome! Please open an issue or submit a pull request.\n\n## Acknowledgements\n\nSinker was inspired by [pgsync](https://github.com/toluaina/pgsync) and [debezium](https://debezium.io/). Each project\ntakes a different approach to the problem, so check them out to see which one is best for you.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparadigmxyz%2Fsinker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fparadigmxyz%2Fsinker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparadigmxyz%2Fsinker/lists"}