{"id":13574374,"url":"https://github.com/spotify/dbeam","last_synced_at":"2025-05-16T18:05:04.034Z","repository":{"id":27068343,"uuid":"110117973","full_name":"spotify/dbeam","owner":"spotify","description":"DBeam exports SQL tables into Avro files using JDBC and Apache Beam","archived":false,"fork":false,"pushed_at":"2025-05-08T01:04:43.000Z","size":1246,"stargazers_count":196,"open_issues_count":18,"forks_count":57,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-05-11T10:57:00.670Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spotify.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":".github/CODE_OF_CONDUCT","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2017-11-09T13:21:09.000Z","updated_at":"2025-04-23T09:23:18.000Z","dependencies_parsed_at":"2024-04-10T09:38:08.427Z","dependency_job_id":"8aecf947-8bfa-42cf-b526-64d7b656f798","html_url":"https://github.com/spotify/dbeam","commit_stats":{"total_commits":858,"total_committers":23,"mean_commits":37.30434782608695,"dds":0.4533799533799534,"last_synced_commit":"9c72eac683680aeab7a8ab86138d4c44cb2524a1"},"previous_names":[],"tags_count":79,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fdbeam","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fdbeam/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fdbeam/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fdbeam/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spotify","download_url":"https://codeload.github.com/spotify/dbeam/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254582902,"owners_count":22095518,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T15:00:51.010Z","updated_at":"2025-05-16T18:05:03.956Z","avatar_url":"https://github.com/spotify.png","language":"Java","funding_links":[],"categories":["Java"],"sub_categories":[],"readme":"\nDBeam\n=======\n\n[![Github Actions Build Status](https://github.com/spotify/dbeam/actions/workflows/maven.yml/badge.svg)](https://github.com/spotify/dbeam/actions/workflows/maven.yml)\n[![codecov.io](https://codecov.io/github/spotify/dbeam/coverage.svg?branch=master)](https://codecov.io/github/spotify/dbeam?branch=master)\n[![Apache Licensed](https://img.shields.io/github/license/spotify/dbeam.svg)](https://opensource.org/licenses/Apache-2.0)\n[![GitHub tag](https://img.shields.io/github/tag/spotify/dbeam)](https://github.com/spotify/dbeam/releases/?include_prereleases\u0026sort=semver)\n[![Maven Central](https://img.shields.io/maven-central/v/com.spotify/dbeam-core.svg)](https://maven-badges.herokuapp.com/maven-central/com.spotify/dbeam-core)\n\nA connector tool to extract data from SQL databases and import into [GCS](https://cloud.google.com/storage/) using [Apache Beam](https://beam.apache.org/).\n\nThis tool is runnable locally, or on any other backend supported by Apache Beam, e.g. [Cloud Dataflow](https://cloud.google.com/dataflow/).\n\n**DEVELOPMENT STATUS: Mature, maintained and used in production since August 2017. No major features or development planned.**\n\n## Overview\n\nDBeam is a tool that reads all the data from single SQL database table,\nconverts the data into [Avro](https://avro.apache.org/) and stores it into\nappointed location, usually in GCS.\nIt runs as a single threaded [Apache Beam](https://beam.apache.org/) pipeline.\n\nDBeam requires the database credentials, the database table name to read, and the output location\nto store the extracted data into. DBeam first makes a single select into the target table with\nlimit one to infer the table schema. After the schema is created the job will be launched which\nsimply streams the table contents via JDBC into target location as Avro.\n\n[Generated Avro Schema Type Conversion Details](docs/type-conversion.md)\n\n\n## dbeam-core package features\n\n- Supports both PostgreSQL, MySQL, MariaDB, and H2 JDBC connectors\n- Supports [Google CloudSQL](https://cloud.google.com/sql/) managed databases\n- Currently outputs only to Avro format\n- Reads database from an external password file (`--passwordFile`) or an external [KMS](https://cloud.google.com/kms/) encrypted password file (`--passwordFileKmsEncrypted`)\n- Can filter only records of the current day with the `--partitionColumn` parameter\n- Check and fail on too old partition dates. Snapshot dumps are not filtered by a given date/partition, when running for a too old partition, the job fails to avoid new data in old partitions. (can be disabled with `--skipPartitionCheck`)\n- Implemented as [Apache Beam SDK](https://beam.apache.org/) pipeline, supporting any of its [runners](https://beam.apache.org/documentation/runners/capability-matrix/) (tested with `DirectRunner` and `DataflowRunner`)\n\n### DBeam export parameters\n\n```\ncom.spotify.dbeam.options.DBeamPipelineOptions:\n\n  --connectionUrl=\u003cString\u003e\n    The JDBC connection url to perform the export.\n  --password=\u003cString\u003e\n    Plaintext password used by JDBC connection.\n  --passwordFile=\u003cString\u003e\n    A path to a file containing the database password.\n  --passwordFileKmsEncrypted=\u003cString\u003e\n    A path to a file containing the database password, KMS encrypted and base64\n    encoded.\n  --sqlFile=\u003cString\u003e\n    A path to a file containing a SQL query (used instead of --table parameter).\n  --table=\u003cString\u003e\n    The database table to query and perform the export.\n  --username=\u003cString\u003e\n    Default: dbeam-extractor\n    The database user name used by JDBC to authenticate.\n\ncom.spotify.dbeam.options.OutputOptions:\n\n  --output=\u003cString\u003e\n    The path for storing the output.\n  --dataOnly=\u003cBoolean\u003e\n    Default: false\n    Store only the data files in output folder, skip queries, metrics and\n    metadata files.\n\ncom.spotify.dbeam.options.JdbcExportPipelineOptions:\n    Configures the DBeam SQL export\n\n  --avroCodec=\u003cString\u003e\n    Default: deflate6\n    Avro codec (e.g. deflate6, deflate9, snappy).\n  --avroDoc=\u003cString\u003e\n    The top-level record doc string of the generated avro schema.\n  --avroSchemaFilePath=\u003cString\u003e\n    Path to file with a target AVRO schema.\n  --avroSchemaName=\u003cString\u003e\n    The name of the generated avro schema, the table name by default.\n  --avroSchemaNamespace=\u003cString\u003e\n    Default: dbeam_generated\n    The namespace of the generated avro schema.\n  --exportTimeout=\u003cString\u003e\n    Default: P7D\n    Export timeout, after this duration the job is cancelled and the export\n    terminated.\n  --fetchSize=\u003cInteger\u003e\n    Default: 10000\n    Configures JDBC Statement fetch size.\n  --limit=\u003cLong\u003e\n    Limit the output number of rows, indefinite by default.\n  --minPartitionPeriod=\u003cString\u003e\n    The minimum partition required for the job not to fail (when partition\n    column is not specified),by default `now() - 2*partitionPeriod`.\n  --minRows=\u003cLong\u003e\n    Default: -1\n    Check that the output has at least this minimum number of rows. Otherwise\n    fail the job.\n  --partition=\u003cString\u003e\n    The date/timestamp of the current partition.\n  --partitionColumn=\u003cString\u003e\n    The name of a date/timestamp column to filter data based on current\n    partition.\n  --partitionPeriod=\u003cString\u003e\n    The period frequency which the export runs, used to filter based on current\n    partition and also to check if exports are running for too old partitions.\n  --preCommand=\u003cList\u003e\n    SQL commands to be executed before query.\n  --queryParallelism=\u003cInteger\u003e\n    Max number of queries to run in parallel for exports. Single query used if\n    nothing specified. Should be used with splitColumn.\n  --skipPartitionCheck=\u003cBoolean\u003e\n    Default: false\n    When partition column is not specified, fails if partition is too old; set\n    this flag to ignore this check.\n  --splitColumn=\u003cString\u003e\n    A long/integer column used to create splits for parallel queries. Should be\n    used with queryParallelism.\n  --useAvroLogicalTypes=\u003cBoolean\u003e\n    Default: false\n    Controls whether generated Avro schema will contain logicalTypes or not.\n```\n\n#### Input Avro schema file\n\nIf provided an input Avro schema file, dbeam will read input schema file and use some of the \nproperties when an output Avro schema is created.\n\n#### Following fields will be propagated from input into output schema:\n\n* `record.doc`\n* `record.namespace`\n* `record.field.doc`\n\n\n#### DBeam Parallel Mode\n\nThis is a pre-alpha feature currently under development and experimentation.\n\nRead queries used by dbeam to extract data generally don't place any locks, and hence multiple read queries\ncan run in parallel. When running in parallel mode with `--queryParallelism` specified, dbeam looks for\n`--splitColumn` argument to find the max and min values in that column. The max and min are then used\nas range bounds for generating `queryParallelism` number of queries which are then run in parallel to read data. \nSince the splitColumn is used to calculate the query bounds, and dbeam needs to calculate intermediate\nbounds for each query, the type of the column must be long / int. It is assumed that the distribution of values on the `splitColumn` is sufficiently random and sequential. Example if the min and max of the split column is divided equally into query parallelism parts, each part would contain approximately equal number of records. Having skews in this data would result in straggling queries, and hence wont provide much improvement. Having the records sequential would help in having the queries run faster and it would reduce random disk seeks.\n\nRecommended usage:\nBeam would run each query generated by DBeam in 1 dedicated vCPU (when running with Dataflow Runner), thus for best performance it is recommended that the total number of vCPU available for a given job should be equal to the `queryParallelism` specified. Hence if `workerMachineType` for Dataflow is `n1-standard-w` and `numWorkers` is `n` then `queryParallelism` `q` should be a multiple of `n*w` and the job would be fastest if `q = n * w`.\n\nFor an export of a table running from a dedicated PostgresQL replica, we have seen best performance over vCPU time and wall time when having a `queryParallelism` of 16. Bumping `queryParallelism` further increases the vCPU time without offering much gains on the wall time of the complete export. It is probably good to use `queryParallelism` less than 16 for experimenting.\n\n## Building\n\nBuilding and testing can be achieved with `mvn`:\n\n```sh\nmvn verify\n```\n\nIn order to create a jar with all dependencies under `./dbeam-core/target/dbeam-core-shaded.jar` run the following:\n\n```sh\nmvn clean package -Ppack\n```\n\n## Usage examples\n\nUsing Java from the command line:\n\n```sh\njava -cp ./dbeam-core/target/dbeam-core-shaded.jar \\\n  com.spotify.dbeam.jobs.JdbcAvroJob \\\n  --output=gs://my-testing-bucket-name/ \\\n  --username=my_database_username \\\n  --password=secret \\\n  --connectionUrl=jdbc:postgresql://some.database.uri.example.org:5432/my_database \\\n  --table=my_table\n```\n\nFor CloudSQL:\n\n```sh\njava -cp ./dbeam-core/target/dbeam-core-shaded.jar \\\n  com.spotify.dbeam.jobs.JdbcAvroJob \\\n  --output=gs://my-testing-bucket-name/ \\\n  --username=my_database_username \\\n  --password=secret \\\n  --connectionUrl=jdbc:postgresql://google/database?socketFactory=com.google.cloud.sql.postgres.SocketFactory\u0026socketFactoryArg=project:region:cloudsql-instance \\\n  --table=my_table\n```\n\n- When using MySQL: `--connectionUrl=jdbc:mysql://google/database?socketFactory=com.google.cloud.sql.mysql.SocketFactory\u0026cloudSqlInstance=project:region:cloudsql-instance\u0026useCursorFetch=true`\n- Note `?useCursorFetch=true` is important for MySQL, to avoid early fetching all rows, more details on [MySQL docs](https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html).\n- More details can be found at [CloudSQL JDBC SocketFactory](https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory)\n\nTo run a cheap data extraction, as a way to validate, one can add `--limit=10 --skipPartitionCheck` parameters. It will run the queries, generate the schemas and export only 10 records, which should be done in a few seconds.\n\n### Password configuration\n\nDatabase password can be configured by simply passing `--password=writepasswordhere`, `--passwordFile=/path/to/file/containing/password` or `--passwordFile=gs://gcs-bucket/path/to/file/containing/password`.\n\nA more robust configuration is to point to a [Google KMS](https://cloud.google.com/kms/) encrypted file.\nDBeam will try to decrypt using KMS if the file ends with `.encrypted` (e.g. `--passwordFileKmsEncrypted=gs://gcs-bucket/path/to/db-password.encrypted`).\n\nThe file should contain a base64 encoded encrypted content.\nIt can be generated using [`gcloud`](https://cloud.google.com/sdk/gcloud/) like the following:\n\n```sh\necho -n \"super_secret_password\" \\\n  | gcloud kms encrypt \\\n      --location \"global\" \\\n      --keyring \"dbeam\" \\\n      --key \"default\" \\\n      --project \"mygcpproject\" \\\n      --plaintext-file - \\\n      --ciphertext-file - \\\n  | base64 \\\n  | gsutil cp - gs://gcs-bucket/path/to/db-password.encrypted\n```\n\nKMS location, keyring, and key can be configured via Java Properties, defaults are:\n\n\n```sh\njava \\\n  -DKMS_KEYRING=dbeam \\\n  -DKMS_KEY=default \\\n  -DKMS_LOCATION=global \\\n  -DKMS_PROJECT=default_gcp_project \\\n  -cp ./dbeam-core/target/dbeam-core-shaded.jar \\\n  com.spotify.dbeam.jobs.JdbcAvroJob \\\n  ...\n```\n\n### IAM authentication\n\nWhen using Google Cloud, [IAM authentication](https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/blob/main/docs/jdbc.md#iam-authentication) may be used by adding `enableIamAuth=true` to the JDBC URL, providing a dummy password, and following the appropriate rules for user naming with respect to the Service Account name.\n\n## Using as a library\n\n\nTo include DBeam library in a mvn project add the following dependency in `pom.xml`:\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.spotify\u003c/groupId\u003e\n  \u003cartifactId\u003edbeam-core\u003c/artifactId\u003e\n  \u003cversion\u003e${dbeam.version}\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n\nTo include DBeam library in a SBT project add the following dependency in `build.sbt`:\n\n```sbt\n  libraryDependencies ++= Seq(\n   \"com.spotify\" % \"dbeam-core\" % dbeamVersion\n  )\n```\n\n## Development\n\nMake sure you have [mvn](https://maven.apache.org/) installed.\nFor editor, [IntelliJ IDEA][idea] is recommended.\n\nTo test and verify changes during development, run:\n\n```sh\nmvn verify\n```\n\nOr:\n\n\n```sh\nmvn verify -Pcoverage\n```\n\nThis project adheres to the [Open Code of Conduct][code-of-conduct]. By participating, you are\nexpected to honor this code.\n\n## Release\n\nTrigger the [release](https://github.com/spotify/dbeam/actions/workflows/release.yml) workflow manually. This workflow requires a\nsingle input, `version`, which should be set to the desired semantic version in the format `{major_version}.{minor_version}.{patch_version}`.\nIt will update versions in all `pom.xml` files, push a tag `vx.y.z`, package, sign, and deploy artifacts to Sonatype, and finally bump all\n`pom.xml`s to the next development SNAPSHOT version.\n\nYou can check the deployment in the following links:\n\n- https://github.com/spotify/dbeam/actions\n- https://oss.sonatype.org/#nexus-search;quick~dbeam-core\n\nYou can also do a manual release. First, export env variables $SONATYPE_USERNAME, $SONATYPE_PASSWORD (for information on generating a token see [here](https://help.sonatype.com/en/user-tokens.html)), $MAVEN_GPG_KEY_NAME.\nThen, you can run `maven release` to deploy to Sonatype and automatically push commits bumping the project version:\n\n```shell\nmvn -s sonatype-settings.xml -DreleaseVersion={NEW_VERSION} release:prepare release:perform # Run with -DdryRun=true first to validate pom modification\n```\n\n## Future roadmap\n\nDBeam is mature, maintained and used in production since August 2017. No major features or development planned.\nLike Redis/[Redict](https://andrewkelley.me/post/redis-renamed-to-redict.html), DBeam can be considered a finished product.\n\n\u003e It can be maintained for decades to come with minimal effort. It can continue to provide a high amount of value for a low amount of labor.\n\n\n---\n\n## License\n\nCopyright 2016-2022 Spotify AB.\n\nLicensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0\n\n---\n\n[code-of-conduct]: https://github.com/spotify/code-of-conduct/blob/master/code-of-conduct.md\n[idea]: https://www.jetbrains.com/idea/download/\n[beam]: https://beam.apache.org/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspotify%2Fdbeam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspotify%2Fdbeam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspotify%2Fdbeam/lists"}