Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/apache/gobblin
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
https://github.com/apache/gobblin
apache data ingestion management replication
Last synced: 3 days ago
JSON representation
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
- Host: GitHub
- URL: https://github.com/apache/gobblin
- Owner: apache
- License: apache-2.0
- Created: 2014-12-01T18:10:50.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2024-10-29T04:08:28.000Z (about 2 months ago)
- Last Synced: 2024-10-29T14:54:40.736Z (about 1 month ago)
- Topics: apache, data, ingestion, management, replication
- Language: Java
- Homepage: https://gobblin.apache.org/
- Size: 127 MB
- Stars: 2,224
- Watchers: 165
- Forks: 750
- Open Issues: 124
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
- awesome-dataops - Apache Gobblin - A framework that simplifies common aspects of big data such as data ingestion. (Data Ingestion)
README
# Apache Gobblin
[![Build Status](https://github.com/apache/gobblin/actions/workflows/build_and_test.yaml/badge.svg?branch=master)](https://travis-ci.org/apache/gobblin)
[![Documentation Status](https://readthedocs.org/projects/gobblin/badge/?version=latest)](https://gobblin.readthedocs.org/en/latest/?badge=latest)
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/org.apache.gobblin/gobblin-api/badge.svg)](https://search.maven.org/search?q=g:org.apache.gobblin)
[![Stack Overflow](http://img.shields.io/:stack%20overflow-gobblin-brightgreen.svg)](http://stackoverflow.com/questions/tagged/gobblin)
[![Join us on Slack](https://img.shields.io/badge/slack-apache--gobblin-brightgreen.svg)]( https://join.slack.com/t/apache-gobblin/shared_invite/zt-vqgdztup-UUq8S6gGJqE6L5~9~JelNg)
[![codecov.io](https://codecov.io/github/apache/gobblin/branch/master/graph/badge.svg)](https://codecov.io/github/apache/gobblin)Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.
### Capabilities
- Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
- Data Organization within the lake (e.g. compaction, partitioning, deduplication)
- Lifecycle Management of data within the lake (e.g. data retention)
- Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)### Highlights
- Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
- Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
- Supports stream and batch execution modes
- Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.### Common Patterns used in production
- Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
- Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
- Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
- Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
- Enforcing Data retention policies and GDPR deletion on HDFS / ADLS### Apache Gobblin is NOT
- A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
- A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
- A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.# Requirements
* Java >= 1.8If building the distribution with tests turned on:
* Maven version 3.5.3# Instructions to download gradle wrapper
If you are going to build Gobblin from the source distribution,
run the following command for downloading the gradle-wrapper.jar from Gobblin git repository to gradle/wrapper directory
(replace GOBBLIN_VERSION in the URL with the version you downloaded).```
wget --no-check-certificate -P gradle/wrapper https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar
```
(or)
```
curl --insecure -L https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar > gradle/wrapper/gradle-wrapper.jar
```Alternatively, you can download it manually from:
`https://github.com/apache/gobblin/blob/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar`Make sure that you download it to gradle/wrapper directory.
# Instructions to run Apache RAT (Release Audit Tool)
1. Extract the archive file to your local directory.
2. Run `./gradlew rat`. Report will be generated under build/rat/rat-report.html# Instructions to build the distribution
1. Extract the archive file to your local directory.
2. Skip tests and build the distribution:
Run `./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain`
The distribution will be created in build/gobblin-distribution/distributions directory.
(or)
3. Run tests and build the distribution (requires Maven):
Run `./gradlew build`
The distribution will be created in build/gobblin-distribution/distributions directory.# Quick Links
* [Gobblin documentation](https://gobblin.apache.org/docs/)
* [Running Gobblin on Docker from your laptop](https://github.com/apache/gobblin/blob/master/gobblin-docs/user-guide/Docker-Integration.md)
* [Getting started guide](https://gobblin.apache.org/docs/Getting-Started/)
* [Gobblin architecture](https://gobblin.apache.org/docs/Gobblin-Architecture/)
* Community Slack: [Get your invite](https://join.slack.com/t/apache-gobblin/shared_invite/zt-1bjgp38mq-ZLozP9rEic6Odvhxoqtbkg)
* [List of companies known to use Gobblin](https://gobblin.apache.org/docs/Powered-By/)
* [Sample project](https://github.com/apache/gobblin/tree/master/gobblin-example)
* [How to build Gobblin from source code](https://gobblin.apache.org/docs/user-guide/Building-Gobblin/)
* [Issue tracker - Apache Jira](https://issues.apache.org/jira/projects/GOBBLIN/issues/)