{"id":13492770,"url":"https://github.com/adilkhash/Data-Engineering-HowTo","last_synced_at":"2025-03-28T10:32:54.746Z","repository":{"id":41516235,"uuid":"178151956","full_name":"adilkhash/Data-Engineering-HowTo","owner":"adilkhash","description":"A list of useful resources to learn Data Engineering from scratch","archived":false,"fork":false,"pushed_at":"2024-06-19T08:49:58.000Z","size":58,"stargazers_count":3726,"open_issues_count":8,"forks_count":531,"subscribers_count":102,"default_branch":"master","last_synced_at":"2025-03-27T06:07:03.116Z","etag":null,"topics":["cloud-providers","data-engineering","data-pipeline","distributed-systems","scala"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adilkhash.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-28T07:43:26.000Z","updated_at":"2025-03-27T03:04:35.000Z","dependencies_parsed_at":"2024-12-18T16:12:22.145Z","dependency_job_id":"2dfb0628-2a0b-4e25-b2e2-9fa27fb58ab6","html_url":"https://github.com/adilkhash/Data-Engineering-HowTo","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adilkhash%2FData-Engineering-HowTo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adilkhash%2FData-Engineering-HowTo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adilkhash%2FData-Engineering-HowTo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adilkhash%2FData-Engineering-HowTo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adilkhash","download_url":"https://codeload.github.com/adilkhash/Data-Engineering-HowTo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246012780,"owners_count":20709513,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloud-providers","data-engineering","data-pipeline","distributed-systems","scala"],"created_at":"2024-07-31T19:01:09.007Z","updated_at":"2025-03-28T10:32:54.713Z","avatar_url":"https://github.com/adilkhash.png","language":null,"readme":"# How To Become a Data Engineer\n\n### Useful articles\n- [The AI Hierarchy of Needs](https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007)\n- [The Rise of Data Engineer](https://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603)\n- [The Downfall of the Data Engineer](https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b)\n- A Beginner’s Guide to Data Engineering\n  - [Part I](https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7)\n  - [Part II](https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-ii-47c4e7cbda71?source=---------5------------------)\n  - [Part III](https://medium.com/@rchang/a-beginners-guide-to-data-engineering-the-series-finale-2cc92ff14b0?source=---------4------------------)\n- [Functional Data Engineering — a modern paradigm for batch data processing](https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a)\n- How to become a Data Engineer [Ru](https://khashtamov.com/ru/data-engineer/), [En](https://khashtamov.com/en/how-to-become-a-data-engineer/)\n- Introduction to Apache Airflow [Ru](https://khashtamov.com/ru/apache-airflow-introduction/?utm_source=github\u0026utm_medium=dataeng-repository\u0026utm_campaign=dataeng), [En](https://khashtamov.com/en/introduction-to-apache-airflow/)\n- [Apache Airflow Alternatives](https://airflowmastery.com/apache-airflow-alternatives/)\n\n### Talks\n- [Data Engineering Principles - Build frameworks not pipelines](https://www.youtube.com/watch?v=pzfgbSfzhXg) by Gatis Seja\n- [Functional Data Engineering - A Set of Best Practices](https://www.youtube.com/watch?v=4Spo2QRTz1k) by Maxime Beauchemin\n- [Advanced Data Engineering Patterns with Apache Airflow](https://www.youtube.com/watch?v=Fvu2oFyFCT0) by Maxime Beauchemin\n- [Creating a Data Engineering Culture](https://www.youtube.com/watch?v=VkeleGIUSM8) by Jesse Anderson\n- [Streaming 101: Hello Streaming](https://www.youtube.com/watch?v=A1YC_AC0qf8) by Josh Fischer\n\n### Algorithms \u0026 Data Structures\n- [Algorithmic Toolbox](https://stepik.org/course/217) in Russian\n- [Data Structures](https://stepik.org/course/1547) in Russian\n- [Data Structures \u0026 Algorithms Specialization](https://www.coursera.org/specializations/data-structures-algorithms) on Coursera\n- [Algorithms Specialization](https://www.coursera.org/specializations/algorithms) from Stanford on Coursera\n\n### SQL\n- [Comprehensive SQL Tutorial](https://mode.com/sql-tutorial/introduction-to-sql/) by Mode Analytics\n- [SQL Practice](https://leetcode.com/problemset/database/) on Leetcode\n- [Modern SQL](https://modern-sql.com/) a website about modern SQL syntax\n- Introduction to Window Functions [En](https://khashtamov.com/en/sql-window-functions/), [Ru](https://khashtamov.com/ru/window-functions-sql/)\n\n### Programming\n- [Scala School](https://twitter.github.io/scala_school/) by Twitter\n- [Fluent Python](https://www.amazon.com/gp/product/1491946008/ref=as_li_tl?ie=UTF8\u0026camp=1789\u0026creative=9325\u0026creativeASIN=1491946008\u0026linkCode=as2\u0026tag=adilkhash-20\u0026linkId=8a663e966770c24874e323133cc7a005) intermediate level book about Python\n- [Intro to Scala](https://stepik.org/course/16243) in Russian on Stepik by Tinkoff Bank\n- [The Hitchhiker’s Guide to Python](https://docs.python-guide.org/) by Kenneth Reitz \u0026 Tanya Schlusser\n- [Learn Python 3 The Hard Way](https://learnpythonthehardway.org/python3/) by Zed A. Shaw\n\n### Databases\n- [Intro to Database Systems](https://www.youtube.com/playlist?list=PLSE8ODhjZXjYutVzTeAds8xUt1rcmyT7x) by Carnegie Mellon University\n- [Advanced Database Systems](https://www.youtube.com/playlist?list=PLSE8ODhjZXja7K1hjZ01UTVDnGQdx5v5U) by Carnegie Mellon University\n- On Disk IO\n  - I. [Flavors of IO](https://medium.com/databasss/on-disk-io-part-1-flavours-of-io-8e1ace1de017)\n  - II. [More Flavours of IO](https://medium.com/databasss/on-disk-io-part-2-more-flavours-of-io-c945db3edb13)\n  - III. [LSM Trees](https://medium.com/databasss/on-disk-io-part-3-lsm-trees-8b2da218496f)\n  - IV. [B-Trees and RUM Conjecture](https://medium.com/databasss/on-disk-storage-part-4-b-trees-30791060741)\n  - V. [Access Patterns in LSM Trees](https://medium.com/databasss/on-disk-io-access-patterns-in-lsm-trees-2ba8dffc05f9)\n\n### Distributed Systems\n- [Distributed systems for fun and profit](http://book.mixu.net/distsys/) by Mikito Takada\n- [Distributed Systems](https://www.amazon.com/gp/product/1543057381/ref=as_li_tl?ie=UTF8\u0026camp=1789\u0026creative=9325\u0026creativeASIN=1543057381\u0026linkCode=as2\u0026tag=adilkhash-20\u0026linkId=721aedeb23c313bc46a92c134c5baafa) by Maarten van Steen \u0026 Andrew S. Tanenbaum\n- [CSE138: Distributed Systems](https://www.youtube.com/playlist?list=PLNPUF5QyWU8O0Wd8QDh9KaM1ggsxspJ31) by Lindsey Kuper \n- [CS 436: Distributed Computer Systems](https://www.youtube.com/watch?v=w8KFPWkK0bI\u0026list=PLawkBQ15NDEkDJ5IyLIJUTZ1rRM9YQq6N\u0026index=2) by University of Waterloo \n- [MIT 6.824: Distributed Systems](https://www.youtube.com/playlist?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB) by Robert Morris from MIT\n- [Distributed consensus reading list](https://github.com/heidi-ann/distributed-consensus-reading-list) maintained by Heidi Howard from University of Cambridge\n\n### Books\n- [Design Data-Intensive Applications](https://www.amazon.com/gp/product/1449373321/ref=as_li_tl?ie=UTF8\u0026camp=1789\u0026creative=9325\u0026creativeASIN=1449373321\u0026linkCode=as2\u0026tag=adilkhash-20\u0026linkId=e7e0e096aa5761066245eb90965ac849) by Martin Kleppmann\n- [Fundamentals of Data Engineering: Plan and Build Robust Data Systems](https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302) by Joe Reis \u0026 Matt Housley\n- [Introduction to Algorithms](https://www.amazon.com/gp/product/0262033844/ref=as_li_tl?ie=UTF8\u0026camp=1789\u0026creative=9325\u0026creativeASIN=0262033844\u0026linkCode=as2\u0026tag=adilkhash-20\u0026linkId=74742875db503b1a899ca35159749067) by Thomas Cormen\n- [The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling](https://www.amazon.com/gp/product/1118530802/ref=as_li_tl?ie=UTF8\u0026tag=adilkhash-20\u0026camp=1789\u0026creative=9325\u0026linkCode=as2\u0026creativeASIN=1118530802\u0026linkId=6ca865e8e9817dca57718bdbe5e52cd5)\n- [Star Schema The Complete Reference](https://www.amazon.com/gp/product/0071744320/ref=as_li_tl?ie=UTF8\u0026tag=adilkhash-20\u0026camp=1789\u0026creative=9325\u0026linkCode=as2\u0026creativeASIN=0071744320\u0026linkId=2abf9ef1d327071f74f59c3659ed6223)\n- [Database Internals: A Deep Dive into How Distributed Data Systems Work](https://www.amazon.com/gp/product/1492040347/ref=as_li_tl?ie=UTF8\u0026camp=1789\u0026creative=9325\u0026creativeASIN=1492040347\u0026linkCode=as2\u0026tag=adilkhash-20\u0026linkId=4a23dead1aeb11fd4debffb36487aa14)\n- [Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing](https://www.amazon.com/gp/product/1491983876/ref=as_li_tl?ie=UTF8\u0026camp=1789\u0026creative=9325\u0026creativeASIN=1491983876\u0026linkCode=as2\u0026tag=adilkhash-20\u0026linkId=9869047f1ac02b597d8a0e67fd29ad68)\n- [A Philosophy of Software Design](https://www.amazon.com/gp/product/1732102201/ref=as_li_tl?ie=UTF8\u0026camp=1789\u0026creative=9325\u0026creativeASIN=1732102201\u0026linkCode=as2\u0026tag=adilkhash-20\u0026linkId=b020fab52fa5f1fed2191ea12e824468)\n- [Grokking Streaming Systems](https://www.manning.com/books/grokking-streaming-systems) by Josh Fischer \u0026 Ning Wang\n- [Guide to High Performance Distributed Computing](https://www.amazon.com/Guide-High-Performance-Distributed-Computing/dp/3319134965) by K.G. Srinivasa \u0026 Anil Kumar Muppalla\n- [Data Pipelines with Apache Airflow](https://www.manning.com/books/data-pipelines-with-apache-airflow) by Bas P. Harenslak and Julian Rutger de Ruiter\n\n### Courses\n- [Data Engineering on Google Cloud Platform Specialization](https://www.coursera.org/specializations/gcp-data-machine-learning) by Google\n- [Data Engineer Nanodegree](https://udacity.com/course/data-engineer-nanodegree--nd027) by Udacity\n- [Data Engineering with Python](https://www.datacamp.com/tracks/data-engineer-with-python) by DataCamp\n\n### Blogs\n- [Martin Kleppmann](https://martin.kleppmann.com/) author of Designing Data-Intensive Application\n- [BaseDS](https://medium.com/baseds) by Vaidehi Joshi about Distributed Systems\n\n### Tools\n- [Apache Airflow](https://airflow.apache.org/) is a platform to programmatically author, schedule and monitor workflows in Python\n- [Apache Spark](https://spark.apache.org/) is a unified analytics engine for large-scale data processing\n- [Apache Kafka](https://kafka.apache.org/) is a distributed streaming platform\n- [Luigi](https://luigi.readthedocs.io) is a Python package that helps you build complex pipelines of batch jobs. \n- [Dagster.io](https://docs.dagster.io) is a system for building modern data applications.\n- [Prefect](https://prefect.io) includes everything you need to create and run data applications.\n- [Metaflow](https://github.com/Netflix/metaflow) build and manage real-life data science projects with ease\n- [lakeFS](https://github.com/treeverse/lakeFS) build repeatable, atomic and versioned data lake operations – from complex ETL jobs to data science and analytics.\n\n### Cloud Platforms\n- [Amazon Web Services](https://aws.amazon.com/)\n- [Google Cloud Platform](https://cloud.google.com/gcp/)\n- [Microsoft Azure](https://azure.microsoft.com)\n- [Yandex Cloud](https://cloud.yandex.ru/)\n- [DigitalOcean](https://m.do.co/c/e92056c9e79b)\n- [IBM Cloud](https://www.ibm.com/cloud/)\n\n### Communities\n- [data Engineering](https://t.me/dataeng_chat) - telegram chat about data engineering\n- [Data Engineering Subreddit](https://www.reddit.com/r/dataengineering/) - subreddit about data engineering\n\n### Data Engineering Jobs\n- [Data Engineering jobs](https://remotelist.ru/category/data-engineering/)\n\n### Other\n- [Data Engineering Podcast](https://www.dataengineeringpodcast.com/)\n\n### Newsletters \u0026 Digests\n- [DataEng Telegram channel](https://t.me/dataeng) - Telegram channel about data engineering (rus/eng)\n- [Data Engineering Weekly](https://www.dataengineeringweekly.com/)\n- [SF Data Weekly](http://weekly.sfdata.io) - A weekly email of useful links for people interested in building data platforms\n- [Data Elixir](https://dataelixir.com/) - Data Elixir is an email newsletter that keeps you on top of the tools and trends in Data Science.\n- [Data Governance, Privacy and Security](https://dbadmin.news/) - DbAdmin News is a news letter on the technology behind Data Governance, Security and Privacy\n","funding_links":[],"categories":["Others","Data Engineering ##","Uncategorized"],"sub_categories":["Uncategorized"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadilkhash%2FData-Engineering-HowTo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadilkhash%2FData-Engineering-HowTo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadilkhash%2FData-Engineering-HowTo/lists"}