Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/DataExpert-io/data-engineer-handbook

This is a repo with links to everything you'd ever want to learn about data engineering
https://github.com/DataExpert-io/data-engineer-handbook

apachespark awesome bigdata data dataengineering sql

Last synced: 3 months ago
JSON representation

This is a repo with links to everything you'd ever want to learn about data engineering

Awesome Lists containing this project

README

        

# The Data Engineering Handbook

This repo has all the resources you need to become an amazing data engineer!

Make sure to check out the [projects](projects.md) section for more hands-on examples!

Make sure to check out the [interviews](interviews.md) section for more advice on how to pass data engineering interviews!

## Resources

Great books:

- [Fundamentals of Data Engineering](https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302/)
- [Designing Data-Intensive Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/)
- [Designing Machine Learning Systems](https://www.amazon.com/Designing-Machine-Learning-Systems-Production-Ready/dp/1098107969)
- [The Hundred Page Machine Learning Book](https://www.amazon.com/Hundred-Page-Machine-Learning-Book/dp/199957950X)
- [Kimball - The Data Warehouse Toolkit](https://ia801609.us.archive.org/14/items/the-data-warehouse-toolkit-kimball/The%20Data%20Warehouse%20Toolkit%20-%20Kimball.pdf)
- [Data Mesh](https://www.oreilly.com/library/view/data-mesh/9781492092384/)
- [Machine Learning System Design Interview](https://www.amazon.com/Machine-Learning-System-Design-Interview/dp/1736049127)
- [Streaming Systems](https://www.amazon.com/Streaming-Systems-Where-Large-Scale-Processing/dp/1491983876)
- [High Performance Spark](https://www.amazon.com/High-Performance-Spark-Practices-Optimizing/dp/1491943203)
- [Building Evolutionary Architectures, 2nd Edition](https://www.oreilly.com/library/view/building-evolutionary-architectures/9781492097532/)
- [Data Management at Scale, 2nd Edition](https://www.oreilly.com/library/view/data-management-at/9781098138851/)
- [Deciphering Data Architectures](https://www.oreilly.com/library/view/deciphering-data-architectures/9781098150754/)
- [97 Things Every Data Engineer Should Know: Collective Wisdom from the Experts](https://www.amazon.com/Things-Every-Data-Engineer-Should/dp/1492062413)
- [Data Governance: The Definitive Guide](https://www.oreilly.com/library/view/data-governance-the/9781492063483/)
- [Trino: The Definitive Guide](https://trino.io/trino-the-definitive-guide.html)
- [Delta Lake: The Definitive Guide](https://www.oreilly.com/library/view/delta-lake-the/9781098151935/)
- [Hadoop: The Definitive Guide](https://www.oreilly.com/library/view/hadoop-the-definitive/9781491901687/)
- [Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications](https://www.amazon.com/Modern-Engineering-Apache-Spark-Hands/dp/1484274512)
- [Data Engineering with dbt: A practical guide to building a dependable data platform with SQL](https://www.amazon.com/Data-Engineering-dbt-cloud-based-dependable-ebook/dp/B0C4LL19G7)
- [Data Engineering with AWS](https://www.oreilly.com/library/view/data-engineering-with/9781804614426/)
- [Practical DataOps: Delivering Agile Date Science at Scale](https://www.amazon.com/Practical-DataOps-Delivering-Agile-Science/dp/1484251032)
- [Data Engineering Design Patterns](https://www.dedp.online/)
- [Snowflake Data Engineering](https://www.manning.com/books/snowflake-data-engineering)
- [Unlocking dbt](https://www.amazon.com/Unlocking-dbt-Design-Transformations-Warehouse/dp/1484296990/)
- [Learning Spark, Second Edition](https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf)

Communities:

- [Seattle Data Guy Discord](https://discord.gg/ah95MZKkFF)
- [EcZachly Data Engineering Discord](https://discord.gg/JGumAXncAK)
- [AdalFlow Discrod (LLM Library)](https://discord.com/invite/ezzszrRZvT)
- [Chip Huyen MLOps Discord](https://discord.gg/dzh728c5t3)
- [Data Engineer Things Community](https://www.dataengineerthings.org/aboutus/)
- [DBT Community](https://www.getdbt.com/community/join-the-community/)
- [r/dataengineering](https://www.reddit.com/r/dataengineering)
- [Microsoft Fabric Community](https://community.fabric.microsoft.com/)
- [r/MicrosoftFabric](https://www.reddit.com/r/MicrosoftFabric/)
- [Data Talks Club Slack](https://datatalks.club/slack)
- [Data Engineering Wiki](https://dataengineering.wiki/)

Companies:

- Orchestration
- [Mage](https://www.mage.ai)
- [Astronomer](https://www.astronomer.io)
- [Prefect](https://www.prefect.io)
- [Dagster](https://www.dagster.io)
- [Airbyte](https://airbyte.com)
- [Kestra](https://kestra.io/)
- [Shipyard](https://www.shipyardapp.com/)
- [Hamilton](https://github.com/dagworks-inc/hamilton)
- Data Lake / Cloud
- [Tabular](https://www.tabular.io)
- [Microsoft](https://www.microsoft.com)
- [Databricks](https://www.databricks.com/company/about-us)
- [Onehouse](https://www.onehouse.ai)
- [Delta Lake](https://delta.io/)
- Data Warehouse
- [Snowflake](https://www.snowflake.com/en/)
- [Firebolt](https://www.firebolt.io/)
- Data Quality
- [dbt](https://www.getdbt.com/)
- [Gable](https://www.gable.ai)
- [Great Expectations](https://www.greatexpectations.io)
- [Streamdal](https://streamdal.com)
- [Coalesce](https://coalesce.io/)
- [Soda](https://www.soda.io/)
- [DQOps](https://dqops.com/)
- Education Companies
- [DataExpert.io](https://www.dataexpert.io)
- [LearnDataEngineering.com](https://www.learndataengineering.com)
- [AlgoExpert](https://www.algoexpert.io)
- [ByteByteGo](https://www.bytebytego.com)
- Analytics / Visualization
- [Preset](https://www.preset.io)
- [Starburst](https://www.starburst.io)
- [Metabase](https://www.metabase.com/)
- [Looker Studio](https://lookerstudio.google.com/overview)
- [Tableau](https://www.tableau.com/)
- [Power BI](https://powerbi.microsoft.com/)
- [Apache Superset](https://superset.apache.org/)
- Data Integration
- [Cube](https://cube.dev)
- [Fivetran](https://www.fivetran.com)
- [Airbyte](https://airbyte.io)
- [dlt](https://dlthub.com/)
- [Sling](https://slingdata.io/)
- [Meltano](https://meltano.com/)
- Modern OLAP
- [Apache Druid](https://druid.apache.org/)
- [ClickHouse](https://clickhouse.com/)
- [Apache Pinot](https://pinot.apache.org/)
- [Apache Kylin](https://kylin.apache.org/)
- [DuckDB](https://duckdb.org/)
- LLM application library
- [AdalFlow](https://github.com/SylphAI-Inc/AdalFlow)

Data Engineering blogs of companies:

- [Netflix](https://netflixtechblog.com/tagged/big-data)
- [Uber](https://www.uber.com/blog/houston/data/?uclick_id=b2f43229-f3f4-4bae-bd5d-10a05db2f70c)
- [Databricks](https://www.databricks.com/blog/category/engineering/data-engineering)
- [Airbnb](https://medium.com/airbnb-engineering/data/home)
- [Amazon AWS Blog](https://aws.amazon.com/blogs/big-data/)
- [Microsoft Data Architecture Blogs](https://techcommunity.microsoft.com/t5/data-architecture-blog/bg-p/DataArchitectureBlog)
- [Microsoft Fabric Blog](https://blog.fabric.microsoft.com/)
- [Oracle](https://blogs.oracle.com/datawarehousing/)
- [Meta](https://engineering.fb.com/category/data-infrastructure/)
- [Onehouse](https://www.onehouse.ai/blog)

Data Engineering Whitepapers:

- [A Five-Layered Business Intelligence Architecture](https://ibimapublishing.com/articles/CIBIMA/2011/695619/695619.pdf)
- [Lakehouse:A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf)
- [Big Data Quality: A Data Quality Profiling Model](https://link.springer.com/chapter/10.1007/978-3-030-23381-5_5)
- [The Data Lakehouse: Data Warehousing and More](https://arxiv.org/abs/2310.08697)
- [Spark: Cluster Computing with Working Sets](https://dl.acm.org/doi/10.5555/1863103.1863113)
- [The Google File System](https://research.google/pubs/the-google-file-system/)
- [Building a Universal Data Lakehouse](https://www.onehouse.ai/whitepaper/onehouse-universal-data-lakehouse-whitepaper)
- [XTable in Action: Seamless Interoperability in Data Lakes](https://arxiv.org/abs/2401.09621)
- [MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/mapreduce-simplified-data-processing-on-large-clusters/)

Great YouTube Channels:

- 100k+ subscribers
- [E-learning Bridge](https://www.youtube.com/@shashank_mishra)
- [TrendyTech](https://www.youtube.com/c/TrendytechInsights)
- [Darshil Parmar](https://www.youtube.com/@DarshilParmar)
- [Andreas Kretz](https://www.youtube.com/c/andreaskayy)
- [ByteByteGo](https://www.youtube.com/c/ByteByteGo)
- [The Ravit Show](https://youtube.com/@theravitshow)
- [Guy in a Cube](https://www.youtube.com/@GuyInACube)
- [Adam Marczak](https://www.youtube.com/@AdamMarczakYT)
- [nullQueries](https://www.youtube.com/@nullQueries)
- [TECHTFQ by Thoufiq](https://www.youtube.com/@techTFQ)
- 10k+ subscribers
- [Data with Zach](https://www.youtube.com/c/datawithzach)
- [Seattle Data Guy](https://www.youtube.com/c/SeattleDataGuy)
- [Azure Lib](https://www.youtube.com/@azurelib-academy)
- [Advancing Analytics](https://www.youtube.com/@AdvancingAnalytics)
- [Kahan Data Solutions](https://www.youtube.com/@KahanDataSolutions)
- [Ankit Bansal](https://youtube.com/@ankitbansal6)
- [Mr. K Talks Tech](https://www.youtube.com/channel/UCzdOan4AmF65PmLLks8Lmww)
- 1k+ subscribers
- [Eric Roby](https://www.youtube.com/@codingwithroby)

Great Podcasts

- [The Data Engineering Show](https://www.dataengineeringshow.com/)
- [Data Engineering Podcast](https://www.dataengineeringpodcast.com/)
- [DataTopics](https://www.datatopics.io/)
- [The Data Engineering Side Of Data](https://podcasts.apple.com/us/podcast/the-engineering-side-of-data/id1566999533)
- [DataWare](https://www.ascend.io/dataaware-podcast/)
- [The Data Coffee Break Podcast](https://www.deezer.com/us/show/5293247)
- [Thd datastack show](https://datastackshow.com/)
- [Intricity101 Data Sharks Podcast](https://www.intricity.com/learningcenter/podcast)
- [Drill to Detail with Mark Rittman](https://www.rittmananalytics.com/drilltodetail/)
- [Analytics Power Hour](https://analyticshour.io/)
- [Catalog & cocktails](https://listen.casted.us/public/127/Catalog-%26-Cocktails-2fcf8728)
- [Datatalks](https://datatalks.club/podcast.html)
- [Data Brew by Databricks](https://www.databricks.com/discover/data-brew)
- [The Data Cloud Podcast by Snowflake](https://rise-of-the-data-cloud.simplecast.com/)
- [What's New in data](https://www.striim.com/podcast/)
- [Open||Source||Data by Datastax](https://www.datastax.com/resources/podcast/open-source-data)
- [Streaming Audio by confluent](https://developer.confluent.io/podcast/)
- [The Data Scientist Show](https://podcasts.apple.com/us/podcast/the-data-scientist-show/id1584430381)
- [MLOps.community](https://podcast.mlops.community/)
- [Monday Morning Data Chat](https://open.spotify.com/show/3Km3lBNzJpc1nOTJUtbtMh)
- [The Data Chief](https://www.thoughtspot.com/data-chief/podcast)

Newsletters:

- [DataEngineer.io Newsletter](https://blog.dataengineer.io)
- [Seattle Data Guy](https://seattledataguy.substack.com)
- [Joe Reis](https://joereis.substack.com)
- [Data Engineering Weekly](https://www.dataengineeringweekly.com)
- [Data Engineering Central](https://dataengineeringcentral.substack.com)
- [Dutch Engineer](https://dutchengineer.substack.com)
- [ByteByteGo](https://blog.bytebytego.com)
- [Start Data Engineering](https://www.startdataengineering.com)
- [Developing Dev](https://www.developing.dev)
- [High Growth Engineer](https://careercutler.substack.com/)
- [Learn Analytics Engineering](https://learnanalyticsengineering.substack.com/)
- [Marvelous MLOps](https://marvelousmlops.substack.com/)
- [medium Data Engineering Newsletter](https://medium.com/data-engineering-weekly)
- [Benn Stancil](https://benn.substack.com/)
- [Metadata Weekly](https://metadataweekly.substack.com/)
- [Technically](https://technically.substack.com/)
- [Blef.fr Data News](https://www.blef.fr/blog/)
- [All Hands on Data](https://allhandsondata.substack.com/)
- [Modern Data 101](https://moderndata101.substack.com/)
- [SELECT Insights](https://newsletter.ssp.sh/)
- [Interesting Data Gigs](https://newsletter.interestinggigs.com)
- [Ju Data Engineering Weekly](https://juhache.substack.com/)
- [From An Engineer Sight](https://fromanengineersight.substack.com/)

Glossaries:
- [Data Engineering Vault](https://www.ssp.sh/brain/data-engineering/)
- [Airbyte Data Glossary](https://glossary.airbyte.com/)
- [Data Engineering Wiki by Reddit](https://dataengineering.wiki/Index)
- [Seconda Glossary](https://www.secoda.co/glossary/)
- [Glossary Databricks](https://www.databricks.com/glossary)
- [Airtable Glossary](https://airtable.com/shrGh8BqZbkfkbrfk/tbluZ3ayLHC3CKsDb)
- [Data Engineering Glossary by Dagster](https://dagster.io/glossary)

LinkedIn

- 100k+ Followers
- [Zach Wilson](https://www.linkedin.com/in/eczachly)
- [Ben Rogojan](https://www.linkedin.com/in/benjaminrogojan)
- [Sumit Mittal](https://www.linkedin.com/in/bigdatabysumit/)
- [Shashank Mishra](https://www.linkedin.com/in/shashank219/)
- [Chip Huyen](https://www.linkedin.com/in/chiphuyen/)
- [Alex Xu](https://www.linkedin.com/in/alexxubyte)
- [Deepak Goyal](https://www.linkedin.com/in/deepak-goyal-93805a17/)
- [Andreas Kretz](https://www.linkedin.com/in/andreas-kretz)
- 50k+ Followers
- [Joe Reis](https://www.linkedin.com/in/josephreis)
- [Darshil Parmar](https://www.linkedin.com/in/darshil-parmar/)
- [Ankit Bansal](https://www.linkedin.com/in/ankitbansal6/)
- [Marc Lamberti](https://www.linkedin.com/in/marclamberti)
- 10k+ Followers
- [Li Yin](https://www.linkedin.com/in/li-yin-ai/)
- [Joseph Machado](https://www.linkedin.com/in/josephmachado1991/)
- [Eric Roby](https://www.linkedin.com/in/codingwithroby/)
- [Simon Whiteley](https://www.linkedin.com/in/simon-whiteley-uk/)
- [Simon Späti](https://www.linkedin.com/in/sspaeti/)
- 5k+ Followers
- [Dipankar Mazumdar](https://www.linkedin.com/in/dipankar-mazumdar/)
- [Daniel Ciocirlan](https://www.linkedin.com/in/danielciocirlan)
- [Hugo Lu](https://www.linkedin.com/in/hugo-lu-confirmed/)
- [Tobias Macey](https://www.linkedin.com/in/tmacey)
- [Marcos Ortiz](https://www.linkedin.com/in/mlortiz)
- [Julien Hurault](https://www.linkedin.com/in/julienhuraultanalytics/)
- 1k+ Followers
- [Shruti Mantri](https://www.linkedin.com/in/shruti-mantri-88527a67/)
- [Volker Janz](https://www.linkedin.com/in/vjanz/)
- [Benoit Pimpaud)(https://www.linkedin.com/in/pimpaudben/)

Twitter / X

- [Zach Wilson](https://www.twitter.com/EcZachly)
- [Seattle Data Guy](https://www.twitter.com/SeattleDataGuy)
- [Sumit Mittal](https://www.twitter.com/bigdatasumit)
- [Joseph Machado](https://twitter.com/startdataeng)
- [Alex Xu](https://twitter.com/alexxubyte/)
- [Eric Roby](https://twitter.com/codingwithroby)
- [Andreas Kretz](https://twitter.com/andreaskayy)
- [Marc Lamberti](https://twitter.com/marclambertiml)
- [Dipankar Mazumdar](https://twitter.com/Dipankartnt)
- [Start Data Engineering](https://twitter.com/startdataeng)
- [Data Cyborg](https://twitter.com/data_cyborg)
- [Simon Späti](https://twitter.com/sspaeti)
- [Marcos Ortiz](https://twitter.com/marcosluis2186)

Instagram

- [Zach Wilson](https://www.instagram.com/eczachly)
- [Andreas Kretz](https://www.instagram.com/learndataengineering)
- [Seattle Data Guy](https://www.instagram.com/seattledataguy)

TikTok

- [Zach Wilson](https://www.tiktok.com/@eczachly)
- [Alex The Analyst](https://www.tiktok.com/@alex_the_analyst)
- [Marcos Ortiz](https://www.tiktok.com/@marcosluis2186)

Design Patterns

- [Cumulative Table Design](https://www.github.com/EcZachly/cumulative-table-design)
- [Microbatch Deduplication](https://www.github.com/EcZachly/microbatch-hourly-deduped-tutorial)
- [The Little Book of Pipelines](https://www.github.com/EcZachly/little-book-of-pipelines)
- [Data Developer Platform](https://datadeveloperplatform.org/architecture/)

Courses / Academies

- [DataExpert.io course](https://www.dataexpert.io) use code **HANDBOOK10** for a discount!
- [LearnDataEngineering.com](https://www.learndataengineering.com)
- [Technical Freelancer Academy](https://www.technicalfreelanceracademy.com/) Use code **zwtech** for a discount!
- [IBM Data Engineering for Everyone](https://www.edx.org/learn/data-engineering/ibm-data-engineering-basics-for-everyone)
- [Qwiklabs](https://www.qwiklabs.com/)
- [DataCamp](https://www.datacamp.com/)
- [Udemy Courses from Shruti Mantri](https://www.udemy.com/user/shruti-mantri-5/)
- [Rock the JVM](https://rockthejvm.com/) teaches Spark (in Scala), Flink and others
- [Data Engineering Zoomcamp by DataTalksClub](https://datatalks.club/)
- [Efficient Data Processing in Spark](https://josephmachado.podia.com/efficient-data-processing-in-spark)
- [Scaler](https://www.scaler.com/)

Certifications Courses

- [Google Cloud Certified - Professional Data Engineer](https://cloud.google.com/certification/data-engineer)
- [Databricks - Data Engineer Professional](https://www.databricks.com/learn/certification/data-engineer-professional)
- [Azure Data Engineer Associate](https://learn.microsoft.com/credentials/certifications/azure-data-engineer/)
- [Microsoft Fabric Analytics Engineer Associate](https://learn.microsoft.com/credentials/certifications/fabric-analytics-engineer-associate/)
- [Exam DP-203: Data Engineering on Microsoft Azure](https://learn.microsoft.com/en-us/credentials/certifications/exams/dp-203/?tab=tab-learning-paths)
- [AWS Certified Data Engineer - Associate](https://aws.amazon.com/certification/certified-data-engineer-associate/)

Conferences

- [Trino Summit - December 13-14, 2023 - Virtual](https://www.starburst.io/info/trinosummit2023/)
- [Data Universe - April 10-11, 2024 - New York City](https://www.datauniverseevent.com/)
- [Data Nova @ Data Universe - April 10-11, 2024 - New York City](https://www.starburst.io/datanova/)
- [DataTune Conference - March 8-9, 2024 - Nashville, TN](https://www.datatuneconf.com/)