Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/adilkhash/Data-Engineering-HowTo
A list of useful resources to learn Data Engineering from scratch
https://github.com/adilkhash/Data-Engineering-HowTo
cloud-providers data-engineering data-pipeline distributed-systems scala
Last synced: 2 months ago
JSON representation
A list of useful resources to learn Data Engineering from scratch
- Host: GitHub
- URL: https://github.com/adilkhash/Data-Engineering-HowTo
- Owner: adilkhash
- Created: 2019-03-28T07:43:26.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2024-06-19T08:49:58.000Z (7 months ago)
- Last Synced: 2024-10-29T15:11:52.558Z (2 months ago)
- Topics: cloud-providers, data-engineering, data-pipeline, distributed-systems, scala
- Homepage:
- Size: 56.6 KB
- Stars: 3,512
- Watchers: 102
- Forks: 504
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-repositories - adilkhash/Data-Engineering-HowTo - A list of useful resources to learn Data Engineering from scratch (Others)
- awesome-ai-data-github-repos - How To Become a Data Engineer
- awesome-ai-data-github-repos - How To Become a Data Engineer
README
# How To Become a Data Engineer
### Useful articles
- [The AI Hierarchy of Needs](https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007)
- [The Rise of Data Engineer](https://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603)
- [The Downfall of the Data Engineer](https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b)
- A Beginner’s Guide to Data Engineering
- [Part I](https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7)
- [Part II](https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-ii-47c4e7cbda71?source=---------5------------------)
- [Part III](https://medium.com/@rchang/a-beginners-guide-to-data-engineering-the-series-finale-2cc92ff14b0?source=---------4------------------)
- [Functional Data Engineering — a modern paradigm for batch data processing](https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a)
- How to become a Data Engineer [Ru](https://khashtamov.com/ru/data-engineer/), [En](https://khashtamov.com/en/how-to-become-a-data-engineer/)
- Introduction to Apache Airflow [Ru](https://khashtamov.com/ru/apache-airflow-introduction/?utm_source=github&utm_medium=dataeng-repository&utm_campaign=dataeng), [En](https://khashtamov.com/en/introduction-to-apache-airflow/)
- [Apache Airflow Alternatives](https://airflowmastery.com/apache-airflow-alternatives/)### Talks
- [Data Engineering Principles - Build frameworks not pipelines](https://www.youtube.com/watch?v=pzfgbSfzhXg) by Gatis Seja
- [Functional Data Engineering - A Set of Best Practices](https://www.youtube.com/watch?v=4Spo2QRTz1k) by Maxime Beauchemin
- [Advanced Data Engineering Patterns with Apache Airflow](https://www.youtube.com/watch?v=Fvu2oFyFCT0) by Maxime Beauchemin
- [Creating a Data Engineering Culture](https://www.youtube.com/watch?v=VkeleGIUSM8) by Jesse Anderson
- [Streaming 101: Hello Streaming](https://www.youtube.com/watch?v=A1YC_AC0qf8) by Josh Fischer### Algorithms & Data Structures
- [Algorithmic Toolbox](https://stepik.org/course/217) in Russian
- [Data Structures](https://stepik.org/course/1547) in Russian
- [Data Structures & Algorithms Specialization](https://www.coursera.org/specializations/data-structures-algorithms) on Coursera
- [Algorithms Specialization](https://www.coursera.org/specializations/algorithms) from Stanford on Coursera### SQL
- [Comprehensive SQL Tutorial](https://mode.com/sql-tutorial/introduction-to-sql/) by Mode Analytics
- [SQL Practice](https://leetcode.com/problemset/database/) on Leetcode
- [Modern SQL](https://modern-sql.com/) a website about modern SQL syntax
- Introduction to Window Functions [En](https://khashtamov.com/en/sql-window-functions/), [Ru](https://khashtamov.com/ru/window-functions-sql/)### Programming
- [Scala School](https://twitter.github.io/scala_school/) by Twitter
- [Fluent Python](https://www.amazon.com/gp/product/1491946008/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1491946008&linkCode=as2&tag=adilkhash-20&linkId=8a663e966770c24874e323133cc7a005) intermediate level book about Python
- [Intro to Scala](https://stepik.org/course/16243) in Russian on Stepik by Tinkoff Bank
- [The Hitchhiker’s Guide to Python](https://docs.python-guide.org/) by Kenneth Reitz & Tanya Schlusser
- [Learn Python 3 The Hard Way](https://learnpythonthehardway.org/python3/) by Zed A. Shaw### Databases
- [Intro to Database Systems](https://www.youtube.com/playlist?list=PLSE8ODhjZXjYutVzTeAds8xUt1rcmyT7x) by Carnegie Mellon University
- [Advanced Database Systems](https://www.youtube.com/playlist?list=PLSE8ODhjZXja7K1hjZ01UTVDnGQdx5v5U) by Carnegie Mellon University
- On Disk IO
- I. [Flavors of IO](https://medium.com/databasss/on-disk-io-part-1-flavours-of-io-8e1ace1de017)
- II. [More Flavours of IO](https://medium.com/databasss/on-disk-io-part-2-more-flavours-of-io-c945db3edb13)
- III. [LSM Trees](https://medium.com/databasss/on-disk-io-part-3-lsm-trees-8b2da218496f)
- IV. [B-Trees and RUM Conjecture](https://medium.com/databasss/on-disk-storage-part-4-b-trees-30791060741)
- V. [Access Patterns in LSM Trees](https://medium.com/databasss/on-disk-io-access-patterns-in-lsm-trees-2ba8dffc05f9)### Distributed Systems
- [Distributed systems for fun and profit](http://book.mixu.net/distsys/) by Mikito Takada
- [Distributed Systems](https://www.amazon.com/gp/product/1543057381/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1543057381&linkCode=as2&tag=adilkhash-20&linkId=721aedeb23c313bc46a92c134c5baafa) by Maarten van Steen & Andrew S. Tanenbaum
- [CSE138: Distributed Systems](https://www.youtube.com/playlist?list=PLNPUF5QyWU8O0Wd8QDh9KaM1ggsxspJ31) by Lindsey Kuper
- [CS 436: Distributed Computer Systems](https://www.youtube.com/watch?v=w8KFPWkK0bI&list=PLawkBQ15NDEkDJ5IyLIJUTZ1rRM9YQq6N&index=2) by University of Waterloo
- [MIT 6.824: Distributed Systems](https://www.youtube.com/playlist?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB) by Robert Morris from MIT
- [Distributed consensus reading list](https://github.com/heidi-ann/distributed-consensus-reading-list) maintained by Heidi Howard from University of Cambridge### Books
- [Design Data-Intensive Applications](https://www.amazon.com/gp/product/1449373321/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1449373321&linkCode=as2&tag=adilkhash-20&linkId=e7e0e096aa5761066245eb90965ac849) by Martin Kleppmann
- [Fundamentals of Data Engineering: Plan and Build Robust Data Systems](https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302) by Joe Reis & Matt Housley
- [Introduction to Algorithms](https://www.amazon.com/gp/product/0262033844/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0262033844&linkCode=as2&tag=adilkhash-20&linkId=74742875db503b1a899ca35159749067) by Thomas Cormen
- [The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling](https://www.amazon.com/gp/product/1118530802/ref=as_li_tl?ie=UTF8&tag=adilkhash-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=1118530802&linkId=6ca865e8e9817dca57718bdbe5e52cd5)
- [Star Schema The Complete Reference](https://www.amazon.com/gp/product/0071744320/ref=as_li_tl?ie=UTF8&tag=adilkhash-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=0071744320&linkId=2abf9ef1d327071f74f59c3659ed6223)
- [Database Internals: A Deep Dive into How Distributed Data Systems Work](https://www.amazon.com/gp/product/1492040347/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1492040347&linkCode=as2&tag=adilkhash-20&linkId=4a23dead1aeb11fd4debffb36487aa14)
- [Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing](https://www.amazon.com/gp/product/1491983876/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1491983876&linkCode=as2&tag=adilkhash-20&linkId=9869047f1ac02b597d8a0e67fd29ad68)
- [A Philosophy of Software Design](https://www.amazon.com/gp/product/1732102201/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1732102201&linkCode=as2&tag=adilkhash-20&linkId=b020fab52fa5f1fed2191ea12e824468)
- [Grokking Streaming Systems](https://www.manning.com/books/grokking-streaming-systems) by Josh Fischer & Ning Wang
- [Guide to High Performance Distributed Computing](https://www.amazon.com/Guide-High-Performance-Distributed-Computing/dp/3319134965) by K.G. Srinivasa & Anil Kumar Muppalla
- [Data Pipelines with Apache Airflow](https://www.manning.com/books/data-pipelines-with-apache-airflow) by Bas P. Harenslak and Julian Rutger de Ruiter### Courses
- [Data Engineering on Google Cloud Platform Specialization](https://www.coursera.org/specializations/gcp-data-machine-learning) by Google
- [Data Engineer Nanodegree](https://udacity.com/course/data-engineer-nanodegree--nd027) by Udacity
- [Data Engineering with Python](https://www.datacamp.com/tracks/data-engineer-with-python) by DataCamp### Blogs
- [Martin Kleppmann](https://martin.kleppmann.com/) author of Designing Data-Intensive Application
- [BaseDS](https://medium.com/baseds) by Vaidehi Joshi about Distributed Systems### Tools
- [Apache Airflow](https://airflow.apache.org/) is a platform to programmatically author, schedule and monitor workflows in Python
- [Apache Spark](https://spark.apache.org/) is a unified analytics engine for large-scale data processing
- [Apache Kafka](https://kafka.apache.org/) is a distributed streaming platform
- [Luigi](https://luigi.readthedocs.io) is a Python package that helps you build complex pipelines of batch jobs.
- [Dagster.io](https://docs.dagster.io) is a system for building modern data applications.
- [Prefect](https://prefect.io) includes everything you need to create and run data applications.
- [Metaflow](https://github.com/Netflix/metaflow) build and manage real-life data science projects with ease
- [lakeFS](https://github.com/treeverse/lakeFS) build repeatable, atomic and versioned data lake operations – from complex ETL jobs to data science and analytics.### Cloud Platforms
- [Amazon Web Services](https://aws.amazon.com/)
- [Google Cloud Platform](https://cloud.google.com/gcp/)
- [Microsoft Azure](https://azure.microsoft.com)
- [Yandex Cloud](https://cloud.yandex.ru/)
- [DigitalOcean](https://m.do.co/c/e92056c9e79b)
- [IBM Cloud](https://www.ibm.com/cloud/)### Communities
- [data Engineering](https://t.me/dataeng_chat) - telegram chat about data engineering
- [Data Engineering Subreddit](https://www.reddit.com/r/dataengineering/) - subreddit about data engineering### Data Engineering Jobs
- [Data Engineering jobs](https://remotelist.ru/category/data-engineering/)### Other
- [Data Engineering Podcast](https://www.dataengineeringpodcast.com/)### Newsletters & Digests
- [DataEng Telegram channel](https://t.me/dataeng) - Telegram channel about data engineering (rus/eng)
- [Data Engineering Weekly](https://www.dataengineeringweekly.com/)
- [SF Data Weekly](http://weekly.sfdata.io) - A weekly email of useful links for people interested in building data platforms
- [Data Elixir](https://dataelixir.com/) - Data Elixir is an email newsletter that keeps you on top of the tools and trends in Data Science.
- [Data Governance, Privacy and Security](https://dbadmin.news/) - DbAdmin News is a news letter on the technology behind Data Governance, Security and Privacy