Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/xiaomingx/awesome-data-engineer
This is a repo with links to everything you'd ever want to learn about data engineering
https://github.com/xiaomingx/awesome-data-engineer
List: awesome-data-engineer
awesome awesome-list data data-science engineering project
Last synced: 29 days ago
JSON representation
This is a repo with links to everything you'd ever want to learn about data engineering
- Host: GitHub
- URL: https://github.com/xiaomingx/awesome-data-engineer
- Owner: XiaomingX
- License: apache-2.0
- Created: 2024-11-19T14:28:03.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-11-19T14:44:41.000Z (about 1 month ago)
- Last Synced: 2024-11-19T15:35:40.341Z (about 1 month ago)
- Topics: awesome, awesome-list, data, data-science, engineering, project
- Language: Makefile
- Homepage: https://twitter.com/seclink
- Size: 11.8 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: newsletters.md
- License: LICENSE
Awesome Lists containing this project
- ultimate-awesome - awesome-data-engineer - This is a repo with links to everything you'd ever want to learn about data engineering. (Other Lists / PowerShell Lists)
README
# 数据工程手册
这个仓库包含了成为一名优秀数据工程师所需的所有资源!
## 入门指南
如果你刚刚接触数据工程,可以先查看这份[2024年数据工程入门路线图](https://blog.dataengineer.io/p/the-2024-breaking-into-data-engineering)。
如果你是为了参加[6周免费YouTube训练营](https://youtu.be/myhe0LXpCeo)而来的,可以先查看以下内容:
- [简介](bootcamp/introduction.md)
- [所需软件](bootcamp/software.md)更多实战学习内容:
- 参考[项目](projects.md)部分,获取更多实操示例!
- 参考[面试](interviews.md)部分,获取通过数据工程面试的建议!
- 参考[书籍](books.md)部分,了解优质的数据工程书籍推荐。
- 参考[社区](communities.md)部分,加入高质量的数据工程社区。
- 参考[新闻通讯](newsletters.md)部分,通过电子邮件获取学习资源。---
以下是经过调整,更适合中国人理解的版本:---
### 精选的[超过25本数据工程经典书籍列表](books.md)
推荐必读的三本书:
- [数据工程基础](https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302/)
- [设计数据密集型应用](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/)
- [设计机器学习系统](https://www.amazon.com/Designing-Machine-Learning-Systems-Production-Ready/dp/1098107969)### 精选的[超过10个值得加入的数据工程社区](communities.md)
推荐加入的数据工程社区:
- [EcZachly 数据工程 Discord](https://discord.gg/JGumAXncAK)
- [Data Talks Club Slack](https://datatalks.club/slack)
- [Data Engineer Things 社区](https://www.dataengineerthings.org/aboutus/)推荐加入的机器学习社区:
- [AdalFlow Discord](https://discord.com/invite/ezzszrRZvT)
- [Chip Huyen MLOps Discord](https://discord.gg/dzh728c5t3)### 数据工程公司分类
- 编排工具
- [Mage](https://www.mage.ai)
- [Astronomer](https://www.astronomer.io)
- [Prefect](https://www.prefect.io)
- [Dagster](https://www.dagster.io)
- [Airflow](https://airflow.apache.org/)
- [Kestra](https://kestra.io/)
- [Shipyard](https://www.shipyardapp.com/)
- [Hamilton](https://github.com/dagworks-inc/hamilton)
- 数据湖 / 云平台
- [Tabular](https://www.tabular.io)
- [Microsoft](https://www.microsoft.com)
- [Databricks](https://www.databricks.com/company/about-us)
- [Onehouse](https://www.onehouse.ai)
- [Delta Lake](https://delta.io/)
- 数据仓库
- [Snowflake](https://www.snowflake.com/en/)
- [Firebolt](https://www.firebolt.io/)
- 数据质量
- [dbt](https://www.getdbt.com/)
- [Gable](https://www.gable.ai)
- [Great Expectations](https://www.greatexpectations.io)
- [Streamdal](https://streamdal.com)
- [Coalesce](https://coalesce.io/)
- [Soda](https://www.soda.io/)
- [DQOps](https://dqops.com/)
- [HEDDA.IO](https://hedda.io)
- 教育平台
- [DataExpert.io](https://www.dataexpert.io)
- [LearnDataEngineering.com](https://www.learndataengineering.com)
- [AlgoExpert](https://www.algoexpert.io)
- [ByteByteGo](https://www.bytebytego.com)
- 分析 / 可视化工具
- [Preset](https://www.preset.io)
- [Starburst](https://www.starburst.io)
- [Metabase](https://www.metabase.com/)
- [Looker Studio](https://lookerstudio.google.com/overview)
- [Tableau](https://www.tableau.com/)
- [Power BI](https://powerbi.microsoft.com/)
- [Apache Superset](https://superset.apache.org/)
- [Evidence](https://evidence.dev)
- 数据集成工具
- [Cube](https://cube.dev)
- [Fivetran](https://www.fivetran.com)
- [Airbyte](https://airbyte.io)
- [dlt](https://dlthub.com/)
- [Sling](https://slingdata.io/)
- [Meltano](https://meltano.com/)
- 现代OLAP工具
- [Apache Druid](https://druid.apache.org/)
- [ClickHouse](https://clickhouse.com/)
- [Apache Pinot](https://pinot.apache.org/)
- [Apache Kylin](https://kylin.apache.org/)
- [DuckDB](https://duckdb.org/)
- [QuestDB](https://questdb.io/)
- 大语言模型(LLM)应用库
- [AdalFlow](https://github.com/SylphAI-Inc/AdalFlow)
- [LangChain](https://github.com/langchain-ai/langchain)
- [LlamaIndex](https://github.com/run-llama/llama_index)
- 实时数据处理工具
- [Aggregations.io](https://aggregations.io)
- [Responsive](https://www.responsive.dev/)
- [RisingWave](https://risingwave.com/)
- [Striim](https://www.striim.com/)### 公司博客中的数据工程内容:
- [Netflix](https://netflixtechblog.com/tagged/big-data)
- [Uber](https://www.uber.com/blog/houston/data/?uclick_id=b2f43229-f3f4-4bae-bd5d-10a05db2f70c)
- [Databricks](https://www.databricks.com/blog/category/engineering/data-engineering)
- [Airbnb](https://medium.com/airbnb-engineering/data/home)
- [亚马逊 AWS Blog](https://aws.amazon.com/blogs/big-data/)
- [微软数据架构博客](https://techcommunity.microsoft.com/t5/data-architecture-blog/bg-p/DataArchitectureBlog)
- [微软 Fabric Blog](https://blog.fabric.microsoft.com/)
- [Oracle](https://blogs.oracle.com/datawarehousing/)
- [Meta](https://engineering.fb.com/category/data-infrastructure/)
- [Onehouse](https://www.onehouse.ai/blog)### 数据工程领域的白皮书:
- [五层商业智能架构](https://ibimapublishing.com/articles/CIBIMA/2011/695619/695619.pdf)
- [湖仓一体:新一代统一数据仓储与高级分析平台](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf)
- [大数据质量:一种数据质量分析模型](https://link.springer.com/chapter/10.1007/978-3-030-23381-5_5)
- [湖仓体系:数据仓库及其他](https://arxiv.org/abs/2310.08697)
- [Spark:基于工作集的集群计算](https://dl.acm.org/doi/10.5555/1863103.1863113)
- [Google 文件系统](https://research.google/pubs/the-google-file-system/)
- [构建通用数据湖仓](https://www.onehouse.ai/whitepaper/onehouse-universal-data-lakehouse-whitepaper)
- [XTable 实战:数据湖中的无缝互操作](https://arxiv.org/abs/2401.09621)
- [MapReduce:简化大规模集群的数据处理](https://research.google/pubs/mapreduce-simplified-data-processing-on-large-clusters/)## 社交媒体账号列表
这里是数据工程领域创作者的几乎完整列表:
**(需要至少有5k粉丝才能加入这个列表!)**| 姓名 |
YouTube |
LinkedIn |
X/Twitter |
Instagram |
TikTok |
|----------------------|---------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Zach Wilson | [Data with Zach](https://www.youtube.com/@eczachly_) (70k+) | [Zach Wilson](https://www.linkedin.com/in/eczachly) (400k+) | [EcZachly](https://www.twitter.com/EcZachly) (30k+) | [eczachly](https://www.instagram.com/eczachly) (150k+) | [@eczachly](https://www.tiktok.com/@eczachly) (70k+) |
| Shashank Mishra | [E-learning Bridge](https://www.youtube.com/@shashank_mishra) (100k+) | [Shashank Mishra](https://www.linkedin.com/in/shashank219/) (100k+) | | | |
| Seattle Data Guy | [Seattle Data Guy](https://www.youtube.com/c/SeattleDataGuy) (100k+) | [Ben Rogojan](https://www.linkedin.com/in/benjaminrogojan) (100k+) | [SeattleDataGuy](https://www.twitter.com/SeattleDataGuy) (10k+) | | |
| TrendyTech | [TrendyTech](https://www.youtube.com/c/TrendytechInsights) (100k+) | [Sumit Mittal](https://www.linkedin.com/in/bigdatabysumit/) (100k+) | | | |
| Darshil Parmar | [Darshil Parmar](https://www.youtube.com/@DarshilParmar) (100k+) | [Darshil Parmar](https://www.linkedin.com/in/darshil-parmar/) (100k+) | | | |
| Andreas Kretz | [Andreas Kretz](https://www.youtube.com/c/andreaskayy) (100k+) | [Andreas Kretz](https://www.linkedin.com/in/andreas-kretz) (100k+) | | [learndataengineering](https://www.instagram.com/learndataengineering) (5k+) | |
| ByteByteGo | [ByteByteGo](https://www.youtube.com/c/ByteByteGo) (1m+) | [Alex Xu](https://www.linkedin.com/in/alexxubyte) (100k+) | [alexxubyte](https://twitter.com/alexxubyte/) (100k+) | | |
| The Ravit Show | [The Ravit Show](https://youtube.com/@theravitshow) (100k+) | | | | |
| Guy in a Cube | [Guy in a Cube](https://www.youtube.com/@GuyInACube) (100k+) | | | | |
| Adam Marczak | [Adam Marczak](https://www.youtube.com/@AdamMarczakYT) (100k+) | | | | |
| nullQueries | [nullQueries](https://www.youtube.com/@nullQueries) (100k+) | | | | |
| TECHTFQ by Thoufiq | [TECHTFQ by Thoufiq](https://www.youtube.com/@techTFQ) (100k+) | | | | |
| SQLBI | [SQLBI](https://www.youtube.com/@SQLBI) (100k+) | [Marco Russo](https://www.linkedin.com/in/sqlbi) (50k+) | [marcorus](https://x.com/marcorus) (10k+) | | |
| Azure Lib | [Azure Lib](https://www.youtube.com/@azurelib-academy) (10k+) | [Deepak Goyal](https://www.linkedin.com/in/deepak-goyal-93805a17/) (100k+) | | | |
| Advancing Analytics | [Advancing Analytics](https://www.youtube.com/@AdvancingAnalytics) (10k+) | [Simon Whiteley](https://www.linkedin.com/in/simon-whiteley-uk/) (10k+) | | | |
| Kahan Data Solutions | [Kahan Data Solutions](https://www.youtube.com/@KahanDataSolutions) (10k+) | | | | |
| Ankit Bansal | [Ankit Bansal](https://youtube.com/@ankitbansal6) (10k+) | [Ankit Bansal](https://www.linkedin.com/in/ankitbansal6/) (50k+) | | | |
| Mr. K Talks Tech | [Mr. K Talks Tech](https://www.youtube.com/channel/UCzdOan4AmF65PmLLks8Lmww) (10k+) | | | | |
| Li Yin | | [Li Yin](https://www.linkedin.com/in/li-yin-ai/) (10k+) | | | |
| Jaco van Gelder | | [Jaco van Gelder](https://www.linkedin.com/in/jwvangelder/) (10k+) | | | |
| Joseph Machado | | [Joseph Machado](https://www.linkedin.com/in/josephmachado1991/) (10k+) | [startdataeng](https://twitter.com/startdataeng) (5k+) | | |
| Eric Roby | | [Eric Roby](https://www.linkedin.com/in/codingwithroby/) (10k+) | | | |
| Simon Späti | | [Simon Späti](https://www.linkedin.com/in/sspaeti/) (10k+) | | | |
| Dipankar Mazumdar | | [Dipankar Mazumdar](https://www.linkedin.com/in/dipankar-mazumdar/) (5k+) | | | |
| Daniel Ciocirlan | | [Daniel Ciocirlan](https://www.linkedin.com/in/danielciocirlan) (5k+) | | | |
| Hugo Lu | | [Hugo Lu](https://www.linkedin.com/in/hugo-lu-confirmed/) (5k+) | | | |
| Tobias Macey | | [Tobias Macey](https://www.linkedin.com/in/tmacey) (5k+) | | | |
| Marcos Ortiz | | [Marcos Ortiz](https://www.linkedin.com/in/mlortiz) (5k+) | | | |
| Julien Hurault | | [Julien Hurault](https://www.linkedin.com/in/julienhuraultanalytics/) (5k+) | | | |
| Alex Freberg | [Alex The Analyst](https://www.youtube.com/@AlexTheAnalyst) (100k+) | [Alex Freberg](https://www.linkedin.com/in/alex-freberg/) (100k+) | | | [@alex_the_analyst](https://www.tiktok.com/@alex_the_analyst) (10k+) |
| Marc Lamberti | | [Marc Lamberti](https://www.linkedin.com/in/marclamberti) (50k+) | | | |
| Chip Huyen | | [Chip Huyen](https://www.linkedin.com/in/chiphuyen/) (250k+) | | | |
| Alex Merced | [Alex Merced Data](https://www.youtube.com/@alexmerceddata_) | [Alex Merced](https://www.linkedin.com/in/alexmerced) (30k+) | [@amdatalakehouse](https://www.twitter.com/amdatalakehouse) | [@alexmercedcoder](https://www.instagram.com/alexmercedcoder) | |
| John Kutay | [John Kutay](https://www.youtube.com/watch?v=7K09lxNbF3Q&list=PL1EXPn7cSg91KLnk2P26Fh8OnGrh6eFmx&pp=iAQB) | [John Kutay](https://www.linkedin.com/in/johnkutay/) (5k+) | [@JohnKutay](https://x.com/JohnKutay) | | |
| Lakshmi Sontenam | | [Lakshmi Sontenam](https://www.linkedin.com/in/shivaga9esh) (9.5k+) | | | |
| Hassaan Akbar | | [Hassaan Akbar](https://www.linkedin.com/in/ehassaan) (5k+) | | | |
| Samuel Focht | [Python Basics ](https://www.youtube.com/@PythonBasics) (10k+) | | | | |
| Constantin Lungu | | [Constantin Lungu](https://www.linkedin.com/in/constantin-lungu-668b8756) (10k+) | | | |
| Ijaz Ali | | [Ijaz Ali](https://www.linkedin.com/in/ijaz-ali-6aaa87122/) (24K+) |
| Subhankar | | [Subhankar](https://www.linkedin.com/in/subhankarumass/) (5k+) | | | |
| Ankur Ranjan | [Big Data Show](https://www.youtube.com/@TheBigDataShow) (100k+) | [Ankur Ranjan](https://www.linkedin.com/in/thebigdatashow/) (48k+)### 优秀播客推荐
- [数据工程秀](https://www.dataengineeringshow.com/)
- [数据工程播客](https://www.dataengineeringpodcast.com/)
- [数据话题](https://www.datatopics.io/)
- [数据工程的另一面](https://podcasts.apple.com/us/podcast/the-engineering-side-of-data/id1566999533)
- [数据感知](https://www.ascend.io/dataaware-podcast/)
- [数据咖啡时刻](https://www.deezer.com/us/show/5293247)
- [数据栈秀](https://datastackshow.com/)
- [Intricity101 数据鲨鱼播客](https://www.intricity.com/learningcenter/podcast)
- [深入探讨与 Mark Rittman](https://www.rittmananalytics.com/drilltodetail/)
- [分析力量时刻](https://analyticshour.io/)
- [目录与鸡尾酒](https://listen.casted.us/public/127/Catalog-%26-Cocktails-2fcf8728)
- [数据对话](https://datatalks.club/podcast.html)
- [Databricks 的数据酿造](https://www.databricks.com/discover/data-brew)
- [Snowflake 的数据云播客](https://rise-of-the-data-cloud.simplecast.com/)
- [数据新鲜事](https://www.striim.com/podcast/)
- [Datastax 的开源数据](https://www.datastax.com/resources/podcast/open-source-data)
- [Confluent 的流音频](https://developer.confluent.io/podcast/)
- [数据科学家秀](https://podcasts.apple.com/us/podcast/the-data-scientist-show/id1584430381)
- [MLOps 社区播客](https://podcast.mlops.community/)
- [周一早晨数据聊天](https://open.spotify.com/show/3Km3lBNzJpc1nOTJUtbtMh)
- [数据首席](https://www.thoughtspot.com/data-chief/podcast)### 优秀 [20+ 新闻通讯推荐](newsletters.md)
数据工程领域必关注的新闻通讯:
- [DataEngineer.io 新闻通讯](https://blog.dataengineer.io)
- [Joe Reis 的通讯](https://joereis.substack.com)
- [Start Data Engineering](https://www.startdataengineering.com)
- [数据工程周刊](https://www.dataengineeringweekly.com)### 术语表:
- [数据工程宝库](https://www.ssp.sh/brain/data-engineering/)
- [Airbyte 数据术语表](https://glossary.airbyte.com/)
- [Reddit 数据工程百科](https://dataengineering.wiki/Index)
- [Seconda 术语表](https://www.secoda.co/glossary/)
- [Databricks 术语表](https://www.databricks.com/glossary)
- [Airtable 术语表](https://airtable.com/shrGh8BqZbkfkbrfk/tbluZ3ayLHC3CKsDb)
- [Dagster 的数据工程术语表](https://dagster.io/glossary)### 优质播客推荐
以下是一些与数据工程相关的优质播客:
- [The Data Engineering Show](https://www.dataengineeringshow.com/)
- [Data Engineering Podcast](https://www.dataengineeringpodcast.com/)
- [DataTopics](https://www.datatopics.io/)
- [The Data Engineering Side Of Data](https://podcasts.apple.com/us/podcast/the-engineering-side-of-data/id1566999533)
- [DataWare](https://www.ascend.io/dataaware-podcast/)
- [The Data Coffee Break Podcast](https://www.deezer.com/us/show/5293247)
- [The Datastack Show](https://datastackshow.com/)
- [Intricity101 Data Sharks Podcast](https://www.intricity.com/learningcenter/podcast)
- [Drill to Detail with Mark Rittman](https://www.rittmananalytics.com/drilltodetail/)
- [Analytics Power Hour](https://analyticshour.io/)
- [Catalog & Cocktails](https://listen.casted.us/public/127/Catalog-%26-Cocktails-2fcf8728)
- [Datatalks](https://datatalks.club/podcast.html)
- [Data Brew by Databricks](https://www.databricks.com/discover/data-brew)
- [The Data Cloud Podcast by Snowflake](https://rise-of-the-data-cloud.simplecast.com/)
- [What's New in Data](https://www.striim.com/podcast/)
- [Open||Source||Data by Datastax](https://www.datastax.com/resources/podcast/open-source-data)
- [Streaming Audio by Confluent](https://developer.confluent.io/podcast/)
- [The Data Scientist Show](https://podcasts.apple.com/us/podcast/the-data-scientist-show/id1584430381)
- [MLOps.community](https://podcast.mlops.community/)
- [Monday Morning Data Chat](https://open.spotify.com/show/3Km3lBNzJpc1nOTJUtbtMh)
- [The Data Chief](https://www.thoughtspot.com/data-chief/podcast)---
### 优质数据工程资讯订阅
以下是一些值得关注的数据工程相关电子报:
- [DataEngineer.io Newsletter](https://blog.dataengineer.io)
- [Joe Reis 的专栏](https://joereis.substack.com)
- [Start Data Engineering](https://www.startdataengineering.com)
- [Data Engineering Weekly](https://www.dataengineeringweekly.com)---
### 数据工程术语资源
以下是一些数据工程术语的学习资源:
- [Data Engineering Vault](https://www.ssp.sh/brain/data-engineering/)
- [Airbyte 数据术语库](https://glossary.airbyte.com/)
- [Reddit 的数据工程 Wiki](https://dataengineering.wiki/Index)
- [Seconda 术语库](https://www.secoda.co/glossary/)
- [Databricks 术语库](https://www.databricks.com/glossary)
- [Airtable 术语库](https://airtable.com/shrGh8BqZbkfkbrfk/tbluZ3ayLHC3CKsDb)
- [Dagster 的数据工程术语库](https://dagster.io/glossary)