{"id":26984164,"url":"https://github.com/transferia/transferia","last_synced_at":"2025-04-03T17:35:08.829Z","repository":{"id":254285673,"uuid":"821729041","full_name":"transferia/transferia","owner":"transferia","description":"Open Source Cloud Native Ingestion engine","archived":false,"fork":false,"pushed_at":"2025-04-03T14:40:06.000Z","size":21319,"stargazers_count":104,"open_issues_count":36,"forks_count":13,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-03T15:41:02.784Z","etag":null,"topics":["bigdata","cdc","clickhouse","elt","go","golang","ingestion-platform","kafka","streaming"],"latest_commit_sha":null,"homepage":"https://transferia.github.io/transferia/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/transferia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/roadmap/index.md","authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-29T09:12:50.000Z","updated_at":"2025-04-03T14:40:09.000Z","dependencies_parsed_at":"2024-11-11T12:24:21.130Z","dependency_job_id":"5016d5a8-2562-41b8-bf78-86675ca6857d","html_url":"https://github.com/transferia/transferia","commit_stats":null,"previous_names":["doublecloud/transfer","transferia/transferia"],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/transferia%2Ftransferia","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/transferia%2Ftransferia/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/transferia%2Ftransferia/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/transferia%2Ftransferia/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/transferia","download_url":"https://codeload.github.com/transferia/transferia/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247048780,"owners_count":20875112,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","cdc","clickhouse","elt","go","golang","ingestion-platform","kafka","streaming"],"created_at":"2025-04-03T17:35:08.056Z","updated_at":"2025-04-03T17:35:08.815Z","avatar_url":"https://github.com/transferia.png","language":"Go","funding_links":[],"categories":["Integrations"],"sub_categories":["ETL and Data Processing"],"readme":"\u003ch1 align=\"center\"\u003eTransferia: Cloud Native Ingestion engine\u003c/h1\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n\u003ch4 align=\"center\"\u003e\n  \u003ca href=\"https://transferia.github.io/transferia/\"\u003eTransferia\u003c/a\u003e  |\n  \u003ca href=\"https://transferia.github.io/transferia/docs/getting_started.html\"\u003eDocumentation\u003c/a\u003e  |\n  \u003ca href=\"https://transferia.github.io/transferia/docs/benchmarks.html\"\u003eBenchmarking\u003c/a\u003e  |\n  \u003ca href=\"https://transferia.github.io/transferia/docs/roadmap\"\u003eRoadmap\u003c/a\u003e\n\u003c/h4\u003e\n\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n## 🦫 Introduction\n\n\u003c/div\u003e\n\n**Transferia**, built in Go, is an open-source cloud native ingestion engine. Essentially we are building no-code (or low-code) EL(T) service that can scale data pipelines from several megabytes of data to dozens of petabytes without hassle.\n\nTransferia provides a convenient way to transfer data between DBMSes, object stores, message brokers or anything that stores data.\nOur ultimate mission is to help you move data from any source to any destination with fast, effective and easy-to-use tool.\n\n\u003cdiv align=\"center\"\u003e\n\n\n## 🚀 Try Transferia\n\n\u003c/div\u003e\n\n### 1. Using CLI\n\nBuild from souces:\n\n```shell\nmake build\n```\n\n![Made with VHS](https://vhs.charm.sh/vhs-3ETIytnxDtBmrgkcOX3ZBf.gif)\n\n\n### 2. Using docker container\n\n```shell\ndocker pull ghcr.io/transferi/transferia:dev\n```\n\n### 3. Deploy via helm-chart\n\nWe strongly believe in cloud-native technologies, and see **transferia** as a driven power for open-source data-platforms build on top of clouds.\n\nDeploy as helm-chart in your own k8s cluster\n\n```bash\nhelm upgrade NAME_OF_TRANSFER \\\n  --namespace NAME_OF_NAMESPACE oci://ghcr.io/transferia/transferia-helm/transfer \\\n  --values PATH_TO_VALUES_FILE \\\n  --install\n```\n\nMore details [here](./docs/deploy_k8s.md). \n\n\u003cdiv align=\"center\"\u003e\n\n## 🚀 Getting Started\n\n\u003c/div\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eIngestion from OLTP\u003c/summary\u003e\n\n- [Replication MySQL Changes into Clickhouse](./examples/mysql2ch)\n- [Snapshot PostgreSQL Changes into Clickhouse](./examples/pg2ch)\n- [Replication MongoDB into Clickhouse with transformation](./docs/mongodb2ch)\n- [CDC From Postgres into YTSaurus](./examples/pg2yt)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eStreaming Ingestion\u003c/summary\u003e\n\n- [Kafka to Clickhouse](./examples/kafka2ch)\n- [Re-Map Kafka source to other Kafka Target](./examples/kafka2kafka)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eCDC Streaming into Kafka\u003c/summary\u003e\n\n- [MySQL CDC into Kafka](./examples/mysql2kafka)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eSemi-structured Ingestion\u003c/summary\u003e\n\n- [S3 with SQS ingestion to Clickhouse](./examples/s3sqs2ch/README.md)\n\n[//]: # (- [Parquet file to Clickhouse]\u0026#40;./examples/s32ch/parquet.md\u0026#41;)\n[//]: # (- [CSV file to Clickhouse]\u0026#40;./examples/s32ch/csv.md\u0026#41;)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eAirbyte compatibility\u003c/summary\u003e\n\n- [Airbyte source](./examples/airbyte_adapter)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eTransformers\u003c/summary\u003e\n\n- [Rename table](./docs/transformers/rename.md)\n- [Hide column](./docs/transformers/hide.md)\n- [Mask column](./docs/transformers/mask.md)\n- [SQL Transformer](./docs/transformers/sql.md)\n- [Lambda Transformer](./docs/transformers/lambda.md)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eData parsers\u003c/summary\u003e\n\n- [How to Parse JSON](./docs/parser/json.md)\n- [How to Parse With Confluent SR](./docs/parser/confluent_sr.md)\n- [How to Parse Proto](./docs/parser/proto.md)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eScaling Snapshot\u003c/summary\u003e\n\n- [Vertical scaling](./docs/scale_vertical.md)\n- [Horisontal scaling](./docs/scale_horisontal.md)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eScaling Replication\u003c/summary\u003e\n\n- [Scaling Kafka streaming](./docs/scale_kafka_stream.md)\n- [Scaling Postgres CDC](./docs/scale_postgres_cdc.md)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003ePerformance\u003c/summary\u003e\n\n- [Measuring benchmarks with clickbench](./docs/benchmarks.md)\n\n\u003c/details\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n## 🚀 Why Transferia\n\n\u003c/div\u003e\n\n- **Cloud-Native**: Single binary and cloud-native as heck, just drop it into your k8s cluster and be happy.\n\n- **High Performance**: Go-built, with cutting-edge, high-speed vectorized execution. 👉 [Bench](./docs/benchmarks.md).\n\n- **Data Simplification**: Streamlines data ingestion, no code needed needed. 👉 [Data Loading](./docs/ingestion.md).\n\n- **Schema infering**: Automatically sync not just data but also data schemas.\n\n- **Format Flexibility**: Supports multiple data formats and types, including JSON, CSV, Parquet, Proto, and more.\n\n- **ACID Transactions**: Ensures data integrity with atomic, consistent, isolated, and durable operations.\n\n- **Schemafull**: [Type system](./docs/typesystem.md) enabling schema-full data storage with flexible data modeling.\n\n- **Community-Driven**: Join a welcoming community for a user-friendly cloud analytics experience.\n\n\u003cdiv align=\"center\"\u003e\n\n## ⚡ Performance\n\n\n[Naive-s3-vs-airbyte](https://medium.com/@laskoviymishka/transfer-s3-connector-vs-airbyte-s3-connector-360a0da084ae)\n\n\u003c/div\u003e\n\n![Naive-s3-vs-airbyte](./docs/_assets/bench_s3_vs_airbyte.png)\n\n\u003cdiv align=\"center\"\u003e\n\n## 📐 Architecture\n\n\n\u003cimg src=\"./docs/_assets/architecture.png\" alt=\"transfer\" /\u003e\n\n\u003c/div\u003e\n\nTransferia is a golang pluggable package that include into transferia binary and register itself into it. Our transferia plugins can be one of:\n\n1. [Storage](./pkg/abstract/storage.go) - one-time data reader\n2. [Sink](./pkg/abstract/async_sink.go) - data writer\n3. [Source](./pkg/abstract/source.go) - streaming data reader\n4. [Transformer](./pkg/transformer/README.md) - something that make row-level change\n\nData pipeline composes with two **Endpoint**-s: **Source** and **Destination**.\nEach Data pipeline essentially link between **Source** {`Storage`|`Source`} and **Destination** {`Sink`}.\n**Transferia** is a **LOGICAL** data transfer service. The minimum unit of data is a logical **ROW** (object). Between **source** and **target** we communicate via **ChangeItem**-s.\nThose items batched and we may apply stateless **Transformations**.\nOverall this pipeline called **Transfer**\n\nWe could compose our primitive to create 2 main different types of connection\n\n1. {`Storage`} + {`Sink`} = `Snapshot`\n2. {`Source`} + {`Sink`} = `Replication`\n3. {`Storage`} + {`Source`} + {`Sink`} = `Snapshot and Replication`\n\nThese 2 directions are conceptually different and have different requirements for specific storages.\nSnapshot and Replication threads can follow each other.\nEvent channels are conceptually unaware of the base types they bind.\nWe mainly build cross system data connection (or as we called them **Hetero** replications), therefore we are not adding any nitpicking for them (type fit or schema adjustment).\nBut for connection between same type of storages to improve accuracy, the system can tell `Source`|`Storage`|`Sinks` if they are homogeneous (or simply **Homo** replication), and do some adjustments and fine-tuning.\nApart from this cross db-type connections should **NOT** know of what type of storage on apart side.\n\n## Storage / SnapshotProvider\n\nLarge-block reading primitive from data. The final stream of events of one type is the insertion of a row. It can give different levels of read consistency guarantees, depending on the depth of integration into a particular database.\n\n![snapshot image](./assets/transferring-data-1.png)\n\n### ROW level Gurantee\n\nAt the most primitive storage level, it is enough to implement the reading of all logical lines from the source to work. In this case, the unit of consistency is the string itself. Example - if we say that one line is one file on disk, then reading the directory gives a guarantee of consistency within one specific file.\n\n\n### Table level Gurantee\n\nRows are logically grouped into groups of homogeneous rows, usually tables. If the source is able to read a consistent snapshot of the rows of one table, then we can guarantee that the data is consistent at the entire table level. From the point of view of the contract, consistency at the table / row level is indistinguishable for us.\n\n### Whole Storage\n\nIt can be arranged if we can take a consistent snapshot and reuse it to read several tables (for example, reading in one transaction sequentially or having a transaction pool with one database state).\n\n### Point of replication (Replication Slot)\n\nIf the source can atomically take a snapshot / snapshot mark for reading and a mark for future replication, we can implement a consistent transition between the snapshot and the replica.\n\n### Summary\n\nFrom a contractual point of view, consistency at the table/row level is **indistinguishable** for us. We have no clear signs to clearly define with what level of assurance we have read the data from the source.\n\n## Source / ReplicationProvider\n\nA streaming primitive. An endless stream of CRUD events line by line. In logical replication, **conceptually** there are only 3 types of events - create / edit / delete. For editing and deleting, we need to somehow identify the object with which we operate, so to support such events, we expect the source itself to be able to give them.\n\n![tx-bounds](./assets/transferring-data-3.png)\n\nFor some storages such events can be grouped into transactions.\n\n![replication-lag](./assets/transferring-data-4.png)\n\nOnce we start replication process we apply this stream of actions to target and try to minimize our data-lag between source database and target.\n\nAt the replication source level, we maintain different levels of consistency:\n\n### Row\n\nThis is the most basic mechanism, if the source does not link strings to each other, then there is a guarantee only at the string level. An example of MongoDB in FullDocument mode, each event in the source is one row living in its own timeline. Events with this level of assurance do not have a transaction tag and logical source time (LSN) **or** not in a strict order.\n\n### Table\n\nIf the rows begin to live in a single timeline - we can give consistency at the table level, applying the entire stream of events in the same order as we received them gives us a consistent slice of the table **Eventually**. Events with this level of guarantee do not have a transaction stamp in them, but contain a source logical timestamp (LSN) **and** a strict order.\n\n### Transaction\n\nIf the rows live in a single timeline and are attributed with transaction labels, as well as linearized in the transaction log (that is, there is a guarantee that all changes in one transaction are continuous and the transactions themselves are logically ordered) - we can give consistency at the table and transaction levels. Applying the entire stream of events in the same order with the same (or larger) batches of transactions, we will get a consistent slice of the table from the source at **any** moment in time.\n\n## Sink / Target\n\nEach of our Targets is a simple thing that can consume a stream of events; at its level, the target can both support source guarantees and weaken them.\n\n### Primitive\n\nAt the most basic level, the target simply writes everything that comes in (the classic example is the / fs / s3 queue), at this level we do not guarantee anything other than the very fact of writing everything that comes in (while the records may be duplicated).\n\n### Unique Key deduplication\n\nThe Target can de-duplicate the row by the primary key, in which case we give an additional guarantee - there will be no key duplicates in the target.\n\n### Logical clock deduplication\n\nIf the Target can write to 2 tables in single transaction, we can transactional store the source logical timestamp in separate table and discard already written rows. In this case, there will be no duplicates in the targets, including in lines without keys.\n\n### Transaction boundaries\n\nIf the receiver can hold transactions for an arbitrarily long time and apply transactions of an arbitrary size, we can implement saving transaction boundaries on writes. In this case, the sink will receive rows in the same or larger transactions, which will give an exact cut of the source at **any** point in time.\n\n## Summary\n\nFor maximum guarantees (exact slice of the source at **any** point in time) both the source and the destination should give maximum guarantee between themselves.\n\nFor current storages, we have approximately the following matrix:\n\n| Storage Type | S/Row |S/Table|S/DB|S/Slot|R/Row|R/Table|R/TX|T/Rows|T/Keys|T/LSN|T/TX|\n|:-------------|:------|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|\n| PG           | \\+    |\\+|\\+|\\+|\\+|\\+|\\+|\\+|\\+|\\+|\\+|\n| Mysql        | \\+    |\\+|\\+||\\+|\\+|\\+|\\+|\\+|\\+|\\+|\n| Mongodb      | \\+    ||||\\+|||\\+|\\+|||\n| Clickhouse   | \\+    |||||||\\+||\\+||\n| Greenplum    | \\+    |\\+|\\+|||||\\+|\\+|\\+|\\+|\n| YDB          | \\+    |\\+||||||\\+|\\+|||\n| YT           | \\+    |\\+||||||\\+|\\+|\\+||\n| Airbyte      | \\+    |\\+/-||||\\+/-||\\+|\\+/-|||\n| Kafka        | \\+    ||||\\+|||\\+||||\n| EventHub     | \\+    ||||\\+|||\\+||||\n| LogBroker    | \\+    ||||\\+|||\\+||\\+||\n\n\n\u003cdiv align=\"center\"\u003e\n\n## 🤝 Contributing\n\n\u003c/div\u003e\n\nTransferia thrives on community contributions! Whether it's through ideas, code, or documentation, every effort helps in enhancing our project. As a token of our appreciation, once your code is merged, your name will be eternally preserved in the **system.contributors** table.\n\nHere are some resources to help you get started:\n\n- [Building From Source](./docs/getting_started.md)\n- [Contributing Guidelines](./docs/transfer-faq.md)\n\n\u003cdiv align=\"center\"\u003e\n\n## 👥 Community\n\n\u003c/div\u003e\n\nFor guidance on using Transferia, we recommend starting with the official documentation. If you need further assistance, explore the following community channels:\n\n- [GitHub](https://github.com/transferia/transferia) (Feature/Bug reports, Contributions)\n- [Telegram](https://t.me/andrei_tserakhau) (Get the news fast)\n\n\u003cdiv align=\"center\"\u003e\n\n## 🛣️ Roadmap\n\n\u003c/div\u003e\n\nStay updated with Transferia's development journey. Here are our roadmap milestones:\n\n- [Roadmap 2024](./docs/roadmap/roadmap_2024.md)\n- [Roadmap 2025](./docs/roadmap/roadmap_2025.md)\n\n\u003cdiv align=\"center\"\u003e\n\n## 📜 License\n\n\u003c/div\u003e\n\nTransferia is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).\n\nFor more information, see the [LICENSE](./LICENSE) file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftransferia%2Ftransferia","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftransferia%2Ftransferia","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftransferia%2Ftransferia/lists"}