{"id":24175854,"url":"https://github.com/iht/elastic2bq","last_synced_at":"2026-06-08T11:31:47.488Z","repository":{"id":74375469,"uuid":"519346700","full_name":"iht/elastic2bq","owner":"iht","description":"A Beam pipeline that takes a ElasticSearch index and creates a BigQuery table with the same contents.","archived":false,"fork":false,"pushed_at":"2023-08-10T09:59:22.000Z","size":190,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-02T14:49:16.702Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iht.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-29T20:52:11.000Z","updated_at":"2023-08-07T16:51:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"97a8cc23-aa12-4bd6-9266-513e6cf4e93b","html_url":"https://github.com/iht/elastic2bq","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/iht/elastic2bq","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Felastic2bq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Felastic2bq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Felastic2bq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Felastic2bq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iht","download_url":"https://codeload.github.com/iht/elastic2bq/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Felastic2bq/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34061121,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-13T02:33:25.647Z","updated_at":"2026-06-08T11:31:47.473Z","avatar_url":"https://github.com/iht.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Elastic to BigQuery\n\nThis pipeline will take a ElasticSearch index and will create a table with the contents of that index in\nBigQuery.\n\nThe schema of the index can be inferred using a command line utility provided with the pipeline.\n\nThe input sources for this pipeline are the following:\n\n- An Elastic search host and index\n- A file in Google Cloud Storage with the schema of the index, in BigQuery JSON format.\n\nThe outputs of the pipeline are the following:\n\n- A table in BigQuery with the contents of the index\n- An errors table, for those JSON elements that could not be parsed, including information about the specific\nparsing error.\n\n## Building the pipeline and the utility∏\n\nYou will need Java 17 to compile and run the pipeline and the utility.\n\nFor the build process, you need Gradle. Run the following script, and it should install all the\nrequired Gradle dependencies if you don't have them already:\n\n`./gradlew build`\n\nThis will create a package of name `elastic2bq-\u003cCOMMIT_HASH\u003e-SNAPSHOT.jar` in the `build/` subdirectory.\n\n## Schema inference utility\n\nThe inference utility depends on the BigQuery automatic schema detection when loading JSON data, so the\nresults will not be perfect, and you may have small inconsistencies. The utility is provided mainly to assist\nyou in the creation of a schema file.\n\nOnce you obtain a schema, it is advised to review the generated schema, and adjust any type that might not\nhave been inferred properly.\n\n### Data format for the inference utility\n\nThe schema inference utility requires the JSON data to be located in Google Cloud Storage, in the form of\na file, with each JSON element in a single line.\n\nTo transform the data extracted from Elastic into a file with a single JSON element per line, you can use the\n[`jq` utility](https://stedolan.github.io/jq/).\n\n`cat mydata.json | jq -c \u003e oneline_per_element.json`\n\nThen upload the `oneline_per_element.json` file to Google Cloud Storage.\n\nFor an example of this format, have a look at the `data/commits.json` file in this repository.\n\n### Running the schema inference utility\n\nOnce you have built the package, add the location to an environment variable in the shell\n\n`export MYJAR=./target/elastic2bq-bundled-0.1-SNAPSHOT.jar`\n\nand then run with the following options:\n\n```shell\njava -cp $MYJAR dev.herraiz.cli.InferSchemaFromData \\\n--dataset=\u003cBIGQUERY DATASET\u003e \\\n--project=\u003cGCP PROJECT\u003e \\\n--data=\u003cGCS DATA LOCATION\u003e\n--output=\u003cLOCAL OUTPUT FILE FOR SCHEMA\u003e\n```\n\nYou need to have a pre-existing BigQuery dataset, and the data already uploaded to Google Cloud Storage. Just\na small sample of data (50-100 records) should be enough to have a proper schema inferred.\n\nThe utility will create a temporary table in the dataset, and it will remove the table once the schema has\nbeen inferred. The schema will be written to a local file.\n\nThe utility will refuse to overwrite the local output file for the schema, so the destination file must not\nexist.\n\nThe output file must be local; you will need to upload it to Google Cloud Storage later.\n\n## Running the pipeline locally\n\nBuild the package and export the location of the JAR:\n\n`export MYJAR=./target/elastic2bq-bundled-0.1-SNAPSHOT.jar`\n\nYou can run the pipeline in local with these arguments:\n\n```shell\njava -cp $MYJAR dev.herraiz.beam.pipelines.Elastic2BQ \\\n--runner=DirectRunner \\\n--elasticHost=\"http://localhost:9200\" \\\n--elasticIndex=\u003cYOUR ELASTIC INDEX NAME\u003e \\\n--project=\u003cGCP PROJECT ID\u003e \\\n--tempLocation=\u003cGCS LOCATION FOR TEMPORARY FILES\u003e \\\n--bigQueryDataset=\u003cBIGQUERY DATASET ID\u003e \\\n--bigQueryTable=\u003cTABLE NAME FOR THE DATA\u003e \\\n--bigQueryErrorsTable=\u003cTABLE NAME FOR PARSING ERRORS\u003e \\\n--schema=\u003cGCS LOCATION OF SCHEMA FILE\u003e\n```\n\nFor reading from Elastic, you can also apply a query, using the option `--query`, to apply a query\nto the index. The output of the query is what it will be written to BigQuery.\n\nYou can also optionally set a `--username` and `--password` to connect to Elastic.\n\nFor the BigQuery destination tables, you can also write each table to a different project and dataset, using\nthe options `--bigQueryProject`, `--bigQueryErrorsDataset` and/or `--bigQueryErrorsProject`. The datasets\nmust exist before running the pipeline, and the credentials must have permissions to create tables in those\ndatasets.\n\nHere we assume that you are running with a local Elasticsearch server. See below for how to create one and\npopulate it with some data.\n\nThe schema must be located in Google Cloud Storage. If you have used the Schema Inference Utility, make sure\nthat you upload the generated file to GCS.\n\nOnce you have run the pipeline, you should see two new tables in the BigQuery dataset.\n\n## Running the pipeline in Dataflow\n\nThe options are the same as in the case of the direct runner (except `--runner=DataflowRunner`),\nbut you may need to add additional options for networking, so the Dataflow workers can reach the\nElasticSearch server. For instance, the workers and the server may run in the same VPC, or you may need\nto do VPC peering between the VPC where ElasticSearch is located and the workers' VPC. For more details, see:\n\n- https://cloud.google.com/dataflow/docs/guides/specifying-networks\n\n## Google Cloud requirements\n\nBoth the pipeline and the inference utility require to have access to Google Cloud credentials to use\nBigQuery and Google Cloud Storage.\n\nIf you are using the Google Cloud SDK, make sure you configure it with your user and project id, and that\nyou run both:\n\n`gcloud auth login`\n\nand\n\n`gcloud auth application-default login`\n\nThe user needs permission to run Dataflow jobs, to read and write from Google Cloud Storage, and to create\ntables in the provided dataset in BigQuery.\n\nThe pipeline is intended to be run in Dataflow, although with the corresponding additional runner\ndependencies, it should run in any Beam runner.\n\nIt can also be run with the DirectRunner, but you will still need to have a BigQuery dataset and a\nGoogle Cloud Storage bucket for the pipeline to work.\n\n## Getting some data to play with (for testing)\n\nWith minikube, you can easily install a Elasticsearch server in local, and use it to import some data, run\nthe pipeline locally, etc.\n\n### Install Elastic in minikube\n\nThis is for testing purposes, to have a Elastic instance to run the pipeline.\n\nInstall minikube and helm (e.g. using Homebrew on Mac).\n\nThen run minikube and follow these instructions to add Elastic to the minikube instance.\n\nCreate a namespace for Elastic:\n\n`k create namespace elastic`\n\n`helm repo add elastic https://helm.elastic.co`\n\nIn the `manifests` directory, run:\n\n`helm install elasticsearch elastic/elasticsearch -f ./values.yaml -n elastic`\n\nTo make sure that the pod is running correctly, wait until it is ready. For a while, it will show something\nlike:\n\n```\nNAME                     READY   STATUS    RESTARTS   AGE\nelasticsearch-master-0   0/1     Running   0          103s\nelasticsearch-master-1   0/1     Running   0          103s\n```\n\nBut after a couple of minutes, it should show like this:\n\n```\nNAME                     READY   STATUS    RESTARTS   AGE\nelasticsearch-master-0   1/1     Running   0          2m\nelasticsearch-master-1   1/1     Running   0          2m\n```\n\nRedirect the ports for Elastic to localhost, so you can use Elastic as a local service:\n\n`k port-forward svc/elasticsearch-master 9200 -n elastic`\n\n### Get some data to play with\n\nCreate index in Elastic:\n\n```shell\ncurl --request PUT \\\n--url 'http://localhost:9200/git?pretty=' \\\n--header 'Connection: keep-alive'\n```\n\nThen import some sample data provided in this repository:\n\n```shell\ncat data/commits.json | while read l\ndo\ncurl --request POST \\\n\t--url 'http://localhost:9200/git/_doc/?pretty=' \\\n\t--header 'Content-Type: application/json' \\\n\t--data \"$l\"\ndone\n```\n\nYou can now run the pipeline locally.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiht%2Felastic2bq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiht%2Felastic2bq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiht%2Felastic2bq/lists"}