Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dr-saad-la/data-engineer-tools

Data Engineer production tools
https://github.com/dr-saad-la/data-engineer-tools

Last synced: 7 days ago
JSON representation

Data Engineer production tools

Awesome Lists containing this project

README

        

# Data Engineering Tools

![Data Engineer Tools](https://img.shields.io/badge/Data%20Engineer%20Tools-Resource-blue)
![Forks](https://img.shields.io/github/forks/dr-saad-la/Data-Engineer-Tools?style=social)

This repository contains a comprehensive list of tools commonly used in data engineering. These tools are categorized based on their functionality and usage.

## Table of Contents

1. [Data Storage](#data-storage)
2. [Data Integration and ETL](#data-integration-and-etl)
3. [Data Processing](#data-processing)
4. [Data Orchestration](#data-orchestration)
5. [Data Quality and Governance](#data-quality-and-governance)
6. [Data Visualization](#data-visualization)
7. [Big Data Technologies](#big-data-technologies)
8. [Cloud Platforms](#cloud-platforms)
9. [Monitoring and Logging](#monitoring-and-logging)
10. [Development and Version Control](#development-and-version-control)

## Data Storage

- **Relational Databases**
- MySQL
- PostgreSQL
- Oracle Database
- Microsoft SQL Server

- **NoSQL Databases**
- MongoDB
- Cassandra
- Redis
- DynamoDB

- **Data Warehouses**
- Amazon Redshift
- Google BigQuery
- Snowflake
- Microsoft Azure Synapse

- **Data Lakes**
- Apache Hadoop HDFS
- Amazon S3
- Azure Data Lake Storage
- Google Cloud Storage

## Data Integration and ETL

- **ETL Tools**
- Apache Nifi
- Talend
- Informatica
- AWS Glue
- Azure Data Factory
- Google Dataflow

- **Data Integration Platforms**
- Apache Camel
- MuleSoft
- Fivetran
- Stitch

## Data Processing

- **Batch Processing**
- Apache Spark
- Apache Hadoop
- Google Dataflow
- Azure Synapse

- **Stream Processing**
- Apache Kafka
- Apache Flink
- Apache Storm
- Confluent Platform

- **Data Transformation**
- dbt (Data Build Tool)
- SQL
- Pandas (Python Library)

## Data Orchestration

- **Workflow Orchestration**
- Apache Airflow
- Prefect
- Luigi
- Dagster

- **Job Scheduling**
- Apache Oozie
- Kubernetes CronJobs

## Data Quality and Governance

- **Data Quality**
- Great Expectations
- Deequ (Amazon)
- Talend Data Quality

- **Data Governance**
- Apache Atlas
- Collibra
- Alation

## Data Visualization

- **Visualization Tools**
- Tableau
- Power BI
- Looker
- Google Data Studio
- Apache Superset

## Big Data Technologies

- **Big Data Frameworks**
- Apache Hadoop
- Apache Spark
- Apache Flink

- **Data Serialization Formats**
- Apache Avro
- Apache Parquet
- JSON
- ORC

## Cloud Platforms

- **Amazon Web Services (AWS)**
- S3
- RDS
- Redshift
- Glue
- EMR

- **Microsoft Azure**
- Azure Data Lake Storage
- Azure SQL Database
- Azure Synapse
- Azure Data Factory

- **Google Cloud Platform (GCP)**
- Google Cloud Storage
- BigQuery
- Dataflow
- Dataproc

## Monitoring and Logging

- **Monitoring Tools**
- Prometheus
- Grafana
- Datadog

- **Logging Tools**
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Splunk
- Fluentd

## Development and Version Control

- **Version Control**
- Git
- GitHub
- GitLab
- Bitbucket

- **Integrated Development Environments (IDEs)**
- PyCharm
- VS Code
- Jupyter Notebooks
- IntelliJ IDEA

## Contributing

We welcome contributions! If you have suggestions for additional tools or improvements to this list, please open an issue or submit a pull request.

## License

This repository is licensed under the Creative Commons Attribution 4.0 International License. See the [LICENSE](LICENSE) file for more information.