{"id":17570834,"url":"https://github.com/jitsucom/bulker","last_synced_at":"2025-03-07T22:30:25.172Z","repository":{"id":65470972,"uuid":"506408470","full_name":"jitsucom/bulker","owner":"jitsucom","description":"Service for bulk-loading data to databases with automatic schema management (Redshift, Snowflake, BigQuery, ClickHouse, Postgres, MySQL)","archived":false,"fork":false,"pushed_at":"2024-04-16T18:43:51.000Z","size":3755,"stargazers_count":115,"open_issues_count":2,"forks_count":11,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-04-17T05:54:42.400Z","etag":null,"topics":["data-engineering","datawarehouse","etl","etl-pipeline","ingestion","pipeline"],"latest_commit_sha":null,"homepage":"https://github.com/jitsucom/bulker","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jitsucom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-06-22T21:16:59.000Z","updated_at":"2024-04-18T10:44:21.340Z","dependencies_parsed_at":"2023-10-16T18:40:05.465Z","dependency_job_id":"b065fa0c-34bb-48ce-b0ba-86259974480c","html_url":"https://github.com/jitsucom/bulker","commit_stats":null,"previous_names":[],"tags_count":63,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jitsucom%2Fbulker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jitsucom%2Fbulker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jitsucom%2Fbulker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jitsucom%2Fbulker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jitsucom","download_url":"https://codeload.github.com/jitsucom/bulker/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242472676,"owners_count":20134006,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","datawarehouse","etl","etl-pipeline","ingestion","pipeline"],"created_at":"2024-10-21T18:01:18.254Z","updated_at":"2025-03-07T22:30:24.638Z","avatar_url":"https://github.com/jitsucom.png","language":"Go","funding_links":[],"categories":["Integrations"],"sub_categories":["ETL and Data Processing"],"readme":"# 🚚 Bulker\n\nBulker is a tool for streaming and batching large amount of semi-structured data into data warehouses. It uses Kafka internally\n\n## How it works?\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"./.docs/assets/bulker-summary.excalidraw.png\" width=\"600\" /\u003e\n\u003c/p\u003e\n\nSend and JSON object to Bulker HTTP endpoint, and it will make sure it will be saved to data warehouse:\n \n * **JSON flattening**. Your object will be flattened - `{a: {b: 1}}` becomes `{a_b: 1}`\n * **Schema managenent** for **semi-structured** data. For each field, bulker will make sure that a corresponding column exist in destination table. If not, Bulker\nwill create it. Type will be best-guessed by value, or it could be explicitely set via type hint as in `{\"a\": \"test\", \"__sql_type_a\": \"varchar(4)\"}`\n * **Reliability**. Bulker will put the object to Kafka Queue immediately, so if datawarehouse is down, data won't be lost\n * **Streaming** or **Batching**. Bulker will send data to datawarehouse either as soon it become available  in Kafka (streaming) or after some time (batching). Most\ndata warehouses won't tolerate large number of inserts, that's why we implemented batching\n\n\nBulker is a 💜 of [Jitsu](https://github.com/jitsucom/jitsu), an open-source data integration platform.\n\nSee full list of features below\n\n\nBulker is also available as a go library if you want to embed it into your application as opposed to use a HTTP-server\n\n## Features\n\n* 🛢️ **Batching** - Bulker sends data in batches in most efficient way for particular database. For example, for Postgres it uses \nCOPY command, for BigQuery it uses batch-files\n* 🚿 **Streaming** - alternatively, Bulker can stream data to database. It is useful when number of records is low. Up to 10 records\nper second for most databases\n* 🐫 **Deduplication** - if configured, Bulker will deduplicate records by primary key \n* 📋 **Schema management** - Bulker creates tables and columns on the fly. It also flattens nested JSON-objects. Example if you send `{\"a\": {\"b\": 1}}` to \nbulker, it will make sure that there is a column `a_b` in the table (and will create it)\n* 🦾 **Implicit typing** - Bulker infers types of columns from JSON-data.\n* 📌 **Explicit typing** - Explicit types can be by type hints that are placed in JSON. Example: for event `{\"a\": \"test\", \"__sql_type_a\": \"varchar(4)\"}`\nBulker will make sure that there is a column `a`, and it's type is `varchar(4)`.\n* 📈 **Horizontal Scaling**. Bulker scales horizontally. Too much data? No problem, just add Bulker instances!\n* 📦 **Dockerized** - Bulker is dockerized and can be deployed to any cloud provider and k8s. \n* ☁️ **Cloud Native** - each Bulker instance is stateless and is configured by only few environment variables. \n\n## Supported databases\n\nBulker supports the following databases:\n\n * ✅ PostgresSQL \u003cbr/\u003e\n * ✅ Redshit \u003cbr/\u003e\n * ✅ Snowflake \u003cbr/\u003e\n * ✅ Clickhouse \u003cbr/\u003e\n * ✅ BigQuery \u003cbr/\u003e\n * ✅ MySQL \u003cbr/\u003e\n * ✅ S3 \u003cbr/\u003e\n * ✅ GCS \u003cbr/\u003e\n\nPlease see  [Compatibility Matrix](.docs/db-feature-matrix.md) to learn what Bulker features are supported by each database.\n\n\n## Documentation Links\n\n\u003e **Note**\n\u003e We highly recommend to read [Core Concepts](#core-concepts) below before diving into details\n\n* [How to use Bulker as HTTP Service](./.docs/server-config.md)\n  * [Server Configuration](./.docs/server-config.md)  \n  * [HTTP API](./.docs/http-api.md)\n* How to use bulker as Go-lib *(coming soon)*\n\n## Core Concepts\n\n### Destinations\n\nBulker operates with destinations. Destination is database or\nstorage service (e.g. S3, GCS). Each destination has an ID and configuration\nwhich is represented by JSON object.\n\nBulker exposes HTTP API to load data into destinations, where those\ndestinations are referenced by their IDs.\n\nIf destination is a database, you'll need to provide a destination table name.\n\n### Event\n\nThe main unit of data in Bulker is an *event*. Event is a represented JSON-object \n\n### Batching and Streaming (aka Destination Mode)\n\nBulker can send data to database in two ways:\n * **Streaming**. Bulker sends evens to destinaion one by one. It is useful when number of events is low (less than 10 events per second for most DBs).\n * **Batching**. Bulker accumulates events in batches and sends them periodically once batch is full or timeout is reached. Batching is more efficient for large amounts of events. Especially for cloud data-warehouses \n(e.g. Postgres, Clickhouse, BigQuery).\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"./.docs/assets/stream-batch.excalidraw.png\" width=\"600\" /\u003e\n\u003c/p\u003e\n\n### Primary Keys and Deduplication\n\nOptionally, Bulker can deduplicate events by primary key. It is useful when you same event can be sent to Bulker multiple times.\nIf available, Bulker uses primary keys, but for some data warehouses alternative strategies are used.\n\n\u003e[Read more about deduplication »](./.docs/db-feature-matrix.md)\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjitsucom%2Fbulker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjitsucom%2Fbulker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjitsucom%2Fbulker/lists"}