An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-integration

A curated list of projects in awesome lists tagged with data-integration .

https://github.com/airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

bigquery change-data-capture data data-analysis data-collection data-engineering data-integration data-pipeline elt etl java mssql mysql pipeline postgresql python redshift s3 self-hosted snowflake

Last synced: 09 Sep 2025

https://github.com/apache/seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.

apache batch cdc change-data-capture data-ingestion data-integration elt high-performance offline real-time streaming

Last synced: 12 May 2025

https://github.com/apache/hudi

Upserts, Deletes And Incremental Processing on Big Data.

apacheflink apachehudi apachespark bigdata data-integration datalake hudi incremental-processing stream-processing

Last synced: 12 May 2025

https://github.com/jitsucom/jitsu

Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days

bigquery clickhouse data-collection data-connectors data-integration golang postgres redshift snowflake

Last synced: 11 May 2025

https://github.com/dtstack/chunjun

A data integration framework

bigdata data-integration flink framework java

Last synced: 13 May 2025

https://github.com/DTStack/chunjun

A data integration framework

bigdata data-integration flink framework java

Last synced: 14 Mar 2025

https://github.com/bruin-data/ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

bigquery copy-database data-ingestion data-integration data-pipeline duckdb ingestion-pipeline mssql postgresql snowflake

Last synced: 13 May 2025

https://github.com/apache/incubator-devlake

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.

dashboard-friendly data data-analysis data-engineering data-integration data-transfers devops domain-layer dora etl golang hacktoberfest integration jira open-source user-friendly

Last synced: 14 May 2025

https://github.com/mara/mara-pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

data data-integration etl pipeline postgresql python

Last synced: 14 May 2025

https://github.com/bytedance/bitsail

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time

Last synced: 15 May 2025

https://github.com/kuwala-io/kuwala

Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times

admin-boundaries data data-integration data-science dbt elt google-trends jupyter kuwala no-code open-data open-source population postgres pyspark python react react-flow scraping spatial-analysis

Last synced: 30 Mar 2025

https://github.com/artie-labs/transfer

Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift, Databricks) in real-time.

apache-kafka bigquery cdc change-data-capture data-integration data-pipelines database debezium elt golang kafka redshift snowflake

Last synced: 28 Dec 2025

https://github.com/apache/seatunnel-web

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

apache data-integration data-pipeline etl-framework high-performance offline real-time seatunnel sql-engine

Last synced: 14 May 2025

https://github.com/immunogenomics/harmony

Fast, sensitive and accurate integration of single-cell data with Harmony

algorithm data-integration r scrna-seq

Last synced: 11 May 2025

https://github.com/ConduitIO/conduit

Conduit streams data between data stores. Kafka Connect replacement. No JVM required.

conduit data-engineering data-integration data-pipeline data-stream etl go kafka kafkaconnect

Last synced: 15 Jul 2025

https://github.com/conduitio/conduit

Conduit streams data between data stores. Kafka Connect replacement. No JVM required.

conduit data-engineering data-integration data-pipeline data-stream etl go kafka kafkaconnect

Last synced: 02 Jan 2026

https://github.com/graphform/swim-rust

Self-contained distributed software platform for building stateful, massively real-time streaming applications in Rust.

actor-model async data-integration decentralized-applications distributed-systems framework kafka real-time rust serverless stateful stream-processing streaming streaming-data-pipelines web

Last synced: 29 Jul 2025

https://github.com/gabledata/recap

Work with your web service, database, and streaming schemas in a single format.

data-catalog data-discovery data-engineering data-integration data-pipelines etl metadata recap

Last synced: 13 Dec 2025

https://github.com/CommonCoreOntology/CommonCoreOntologies

The Common Core Ontology Repository holds the current released version of the Common Core Ontology suite.

applied-ontology bfo cco data-integration interoperability ontologies ontology-suite owl-ontology semantic-consistency semantics

Last synced: 16 Nov 2025

https://github.com/hetio/hetionet

Hetionet: an integrative network of disease

data-integration drug-repurposing hetionet hetnet neo4j network rephetio

Last synced: 07 Apr 2025

https://github.com/slowkow/harmonypy

🎼 Integrate multiple high-dimensional datasets with fuzzy k-means and locally linear adjustments.

bioinformatics data-integration data-science single-cell-analysis

Last synced: 07 Oct 2025

https://github.com/dataplane-app/dataplane

Dataplane is an Airflow inspired unified data platform with additional data mesh and RPA capability to automate, schedule and design data pipelines and workflows. Dataplane is written in Golang with a React front end.

airflow data data-analysis data-engineering data-integration data-pipelines data-science dataplane datawarehouse etl finance golang kubernetes pipelines robotics-process-automation rpa scheduler workflow workflow-automation workflows

Last synced: 27 Dec 2025

https://github.com/morph-kgc/morph-kgc

Powerful RDF Knowledge Graph Generation with RML Mappings

data-engineering data-integration database etl knowledge-graph python r2rml rdf rdf-star rml

Last synced: 11 May 2025

https://github.com/opensanctions/nomenklatura

Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources

data-integration deduplication record-link

Last synced: 30 Dec 2025

https://github.com/mara/mara-example-project-2

An example mini data warehouse for python project stats, template for new projects

bigquery data-integration etl pypi sql

Last synced: 25 Oct 2025

https://github.com/google/megalista

First Party data integration solution built for marketing teams to enable audience and conversion onboarding into Google Marketing products (Google Ads, Campaign Manager, Google Analytics).

audience-targeting audiences bigquery conversions customermatch data-integration dataflow google googleads googleanalytics python

Last synced: 24 Sep 2025

https://github.com/SDM-TIB/SDM-RDFizer

An Efficient RML-Compliant Engine for Knowledge Graph Construction

data-integration knowledge-graph rml

Last synced: 11 May 2025

https://github.com/starlake-ai/starlake

Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.

bigquery data-engineering data-integration data-pipeline etl hdfs redshift snowflake spark synapse

Last synced: 05 Apr 2025

https://github.com/sysbiochalmers/gecko

Toolbox for including enzyme constraints on a genome-scale model.

data-integration enzyme-constraints kinetics matlab proteomics systems-biology toolbox

Last synced: 24 Oct 2025

https://github.com/munchy-bytes/schemamapper

A .NET class library that allows you to import data from different sources into a unified destination

csharp csv data-import data-integration databases excel html json msaccess mysql oracle powerpoint schema-mapping schema-matching sql-server sqlce sqlite tabular-data vcard xml

Last synced: 17 Aug 2025

https://github.com/saezlab/cosmosr

COSMOS (Causal Oriented Search of Multi-Omic Space) is a method that integrates phosphoproteomics, transcriptomics, and metabolomics data sets.

data-integration metabolomic-data network-modelling phosphoproteomics proteomics transcriptomics

Last synced: 09 Apr 2025

https://github.com/buildersoftio/cortex

Cortex | Data Framework—a cutting-edge SDK that simplifies real-time data processing with intuitive operators, robust state management, and seamless telemetry for efficient, scalable pipelines.

ai csharp data-engineering data-integration data-pipeline dotnet event-driven framework machine-learning real-time streaming

Last synced: 30 Aug 2025

https://github.com/umer7/Data-Warehouse-Concepts-Design-and-Data-Integration

Repo for Data Warehouse Concepts, Design, and Data Integration by University of Colorado System (coursera)(Notes,Assignments, quiz and research papers)

data-integration data-warehouse datawarehouse oracle pentaho

Last synced: 20 Jul 2025

https://github.com/azure/data-product-batch

Template to deploy a Data Product for Batch data processing into a Data Landing Zone of the Data Management & Analytics Scenario (former Enterprise-Scale Analytics). The Data Product template can be used by cross-functional teams to ingest, provide and create new data assets within the platform.

architecture arm azure bicep data-fabric data-integration data-mesh data-platform data-product enterprise-scale enterprise-scale-analytics policy-driven

Last synced: 23 Jul 2025

https://github.com/Azure/data-product-batch

Template to deploy a Data Product for Batch data processing into a Data Landing Zone of the Data Management & Analytics Scenario (former Enterprise-Scale Analytics). The Data Product template can be used by cross-functional teams to ingest, provide and create new data assets within the platform.

architecture arm azure bicep data-fabric data-integration data-mesh data-platform data-product enterprise-scale enterprise-scale-analytics policy-driven

Last synced: 05 May 2025

https://github.com/mara/mara-etl-tools

Utilities for creating ETL pipelines with mara

data-integration date-dimension etl sql sql-utils

Last synced: 28 Feb 2025

https://github.com/altschulerwu-lab/muse

MUSE is a deep learning approach characterizing tissue composition through combined analysis of morphologies and transcriptional states for spatially resolved transcriptomics data.

clustering data-integration deep-learning multi-modal-analysis single-cell-ananlysis spatial-transcriptomics tensorflow

Last synced: 14 Dec 2025

https://github.com/artie-labs/reader

Perform historical snapshots without database locks and read change data capture logs from databases. Artie Reader is compatible with Debezium and is written in Go.

apache-kafka cdc change-data-capture data-integration database debezium golang kafka

Last synced: 16 May 2025

https://github.com/linkedin/data-integration-library

The Data Integration Library project provides a library of generic components based on a multi-stage architecture for data ingress and egress.

data-egress data-ingest data-ingestion data-integration gobblin

Last synced: 17 Aug 2025

https://github.com/dhimmel/integrate

Scripts and resources to create Hetionet v1.0, a heterogeneous network for drug repurposing

data-integration drug-repurposing hetionet hetnet neo4j network rephetio

Last synced: 12 Apr 2025

https://github.com/jonnytran/openomics

A bioinformatics API to interface with public multi-omics bio databases for wicked fast data integration.

data-integration data-manipulation genomics multi-omics python

Last synced: 16 Mar 2025

https://github.com/JonnyTran/OpenOmics

A bioinformatics API to interface with public multi-omics bio databases for wicked fast data integration.

data-integration data-manipulation genomics multi-omics python

Last synced: 18 Mar 2025

https://github.com/zazuko/barnard59

An intuitive and flexible RDF pipeline solution designed to simplify and automate ETL processes for efficient data management.

data-integration data-pipeline data-processing etl json-ld linked-data pipeline rdf semantic-web

Last synced: 06 Apr 2025

https://github.com/dosorio/rpanglaodb

An R package to download and merge labeled single-cell RNA-seq data from the PanglaoDB database into a Seurat object.

data-integration data-mining rna-seq single-cell single-cell-rna-seq

Last synced: 22 Oct 2025

https://github.com/cloudquery/plugin-sdk

CloudQuery Go SDK for source and destination plugins

cloudquery data-integration elt

Last synced: 05 Apr 2025

https://github.com/amine-smahi/r-learning-journey

Some of the projects i made when starting to learn R for Data Science at the university

afc cpa data-cleaning data-integration data-science datascience r r-language

Last synced: 18 Mar 2025

https://github.com/davidfoerster/schema-matching

Match schema attributes of relational databases by value similarity. As a study assignment, this isn't well documented, but you can contact me for questions and I may even add docs, if I sense enough interest.

data-integration python schema-matching

Last synced: 24 Apr 2025

https://github.com/oeg-upm/gtfs-bench

GTFS-Madrid-Bench: A Benchmark for Knowledge Graph Construction Engines

data-integration knowledge-graph obda obdi r2rml rml transport-domain

Last synced: 26 Dec 2025

https://github.com/nyxflower/gripnet

GripNet: Graph Information Propagation on Supergraph for Heterogeneous Graphs (PatternRecognit, 2023)

data-integration graph-neural-networks heterogeneous-graph interconnected-graph link-prediction node-classification pytorch

Last synced: 23 Mar 2025

https://github.com/alexkychen/assignpop

Population Assignment using Genetic, Non-genetic or Integrated Data in a Machine-learning Framework. Methods in Ecology and Evolution. 2018;9:439–446.

cross-validation data-integration gbs machine-learning population-assignment population-genomics r radseq

Last synced: 14 Apr 2025

https://github.com/karrlab/datanator

Toolkit for discovering and aggregating data for whole-cell modeling

cells data-aggregation data-discovery data-integration mathematical-modeling systems-biology

Last synced: 02 Sep 2025

https://github.com/asmagen/robustsinglecell

Robust single cell clustering and comparison of population compositions across tissues and experimental models via similarity analysis.

clustering data-integration scrnaseq single-cell-genomics single-cell-rna-seq

Last synced: 23 Mar 2025

https://github.com/meltanolabs/singer-working-group

Working group for ongoing development and iteration of the Singer Spec, the de-facto protocol for open source data connectors. Please use "Issues" to create discussion items - or use "Discussions" for general questions.

data-integration dataops elt etl etl-pipeline singer

Last synced: 19 Feb 2025

https://github.com/asmagen/robustSingleCell

Robust single cell clustering and comparison of population compositions across tissues and experimental models via similarity analysis.

clustering data-integration scrnaseq single-cell-genomics single-cell-rna-seq

Last synced: 08 Apr 2025

https://github.com/lisad/phaser

The missing layer for complex data batch integration pipelines

data data-integration etl etl-pipeline

Last synced: 23 Apr 2025

https://github.com/cognitedata/python-extractor-utils

Framework for developing extractors in Python

cognite-data-fusion cognite-extractor data-integration python

Last synced: 05 Jul 2025

https://github.com/shu-hai/D-CCA

A Decomposition-based Canonical Correlation Analysis for High-dimensional Datasets (JASA-20 paper)

data-fusion data-integration high-dimensional-data integrative-analysis multiblock-structures multiview

Last synced: 13 Apr 2025

https://github.com/dachafra/thesis

PhD thesis: "Knowledge Graph Construction from Heterogeneous Data Sources exploiting Declarative Mapping Rules"

benchmarking data-integration knowledge-graph r2rml rml

Last synced: 04 Jan 2026

https://github.com/oeg-upm/morph-graphql

Translate OBDA mappings into GraphQL Servers

data-integration graphql semantic-web

Last synced: 02 Aug 2025

https://github.com/cloudformations/cf.cumulus

A cloud data platform product to accelerate time to insights. Our open-source framework is designed for the real world. Stripping away the complexity, giving you the power to build, scale, and manage your dataflows with ease, accelerating data delivery.

accelerator cfcumulus cloudformations control data-insights data-integration framework ingest metadata pipeline transform

Last synced: 05 Apr 2025

https://github.com/sysbiochalmers/orthomics

Collection of scripts for gene age sorting and multi-omics data mining and analysis

data-integration data-visualization de-analysis orthology proteomics rnaseq

Last synced: 29 Jul 2025

https://github.com/kinto-technologies/springboot3batchstarter

Spring Batch 5 skeleton for Spring Boot 3. Includes DB to CSV and CSV to DB samples for quick customization. This repository demonstrates multi-database setup, efficient batch processing, and GitHub Actions integration for CI/CD pipelines.

chunk ci-cd csv data-integration database-migration datasource docker github-actions h2 java job-configuration jooq multi-database mysql opencsv skeleton-code spring-batch-5 spring-boot-3 spring-framework tasklet

Last synced: 02 Nov 2025

https://github.com/dobraczka/forayer

forayer is a library of first aid utilities for knowledge graph exploration with an entity centric approach.

data-integration entity-resolution knowledge-graph

Last synced: 25 Jun 2025

https://github.com/sbl-sdsc/kg-import

kg-import automates the ingestion of heterogeneous datasets into a Knowledge Graph.

data-ingestion data-integration datasets-preparation knowledge-graph neo4j property-graph

Last synced: 12 Apr 2025

https://github.com/ronpinkas/dbbridge

dbBridge is an 'SQL Migration Tool' - enabling import of SQL Databases from any supported Dialect (MsSql, MySql, Oracle, PostgreSQL, Sqlite) to any of these supported dialects with just three lines of PHP code.

data-integration data-migration data-transfer data-transformation database-conversion db-migrate db-migration etl migration minimal mssql mysql open-source oracle php postgresql simple sql sqlite

Last synced: 30 Apr 2025

https://github.com/tteofili/certa

CERTA - Computing Entity Resolution explanations with TriAngles

data-integration entity-matching entity-resolution explainable-ai machine-learning python record-linkage xai

Last synced: 12 Apr 2025

https://github.com/vida-nyu/magneto-matcher

Repository for developing and evaluating components and algorithms for data integration tasks

bdf-toolbox data-integration schema-matching

Last synced: 14 Dec 2025

https://github.com/drsnowbird/denodo-vnc-docker

Denodo Platform 7 (Express) in VNC / noVNC for Container Platform (Openshift, Kubernetes, DC/OS, Mesosphere, etc)

data-integration data-virtualization denodo-express vnc-docker

Last synced: 10 Apr 2025

https://github.com/marcosmarxm/awesome-airbyte

Curated list of resources about Airbyte

airbyte airbytehq connectors data-integration

Last synced: 07 Dec 2025

https://github.com/firelink-sh/evolve-py

A highly efficient, composable, and lightweight ETL and data integration framework

analytics arrow big-data data data-engineering data-integration data-science duckdb elt etl ingestion ingress ml olap pipeline polars postgresql python s3

Last synced: 16 Sep 2025