https://github.com/OBenner/data-engineering-interview-questions

More than 2000+ Data engineer interview questions.
https://github.com/OBenner/data-engineering-interview-questions

airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql

Last synced: over 1 year ago
JSON representation

More than 2000+ Data engineer interview questions.

Host: GitHub
URL: https://github.com/OBenner/data-engineering-interview-questions
Owner: OBenner
Created: 2021-08-08T15:49:45.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2025-01-26T15:28:29.000Z (over 1 year ago)
Last Synced: 2025-04-04T19:00:38.926Z (over 1 year ago)
Topics: airflow, avro, aws, azure, cassandra, data-engineering, data-structures, flink, flume, hadoop, hadoop-hdfs, hbase, hive, impala, interview, interview-questions, kafka, nifi, spark, sql
Homepage:
Size: 938 KB
Stars: 1,298
Watchers: 21
Forks: 463
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-ai-data-github-repos - Data Engineering Interview Questions

More than 2000+ questions for preparing a Data Engineer interview.

Interview questions for Data Engineer

Databases and Data Warehouses

GitHub Repo
Official page
Questions
Description
Useful links

Apache Cassandra
Cassandra is a distributed, wide-column store, NoSQL database management system.
Awesome Cassandra

Greenplum
Greenplum is a big data technology based on MPP architecture and the Postgres open source database technology.
Awesome Greenplum

MongoDB
MongoDB is a document-oriented database.
Awesome MongoDB

Apache Hbase
HBase is an open-source non-relational distributed database.
Awesome HBase

Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis.
Awesome Hive

Amazon DynamoDB
Amazon DynamoDB is a fully managed proprietary NoSQL database service.
Awesome DynamoDB
Awesome AWS

Amazon Redshift
Amazon Redshift is a data warehouse product.
Amazon Redshift Utilities
Awesome AWS

BigQuery GCP
BigQuery is a fully-managed, serverless data warehouse.
Awesome BigQuery

Bigtable GCP
Bigtable is a fully managed wide-column and key-value NoSQL database service.
Awesome Bigtable

Data Formats

Apache Avro
Avro is a row-oriented remote procedure call and data serialization framework.
Awesome Avro

Apache Parquet
Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval.
TODO

Delta
Delta Lake is a storage framework that enables building a Lakehouse architecture with compute engines
Delta examples

Big Data Frameworks

Apache Airflow
Apache Airflow is a workflow management platform for data engineering pipelines.
Awesome Airflow

Apache Flume
Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data.
TODO

Apache Hadoop
Apache Hadoop is a collection of software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.
Awesome Hadoop

Apache Impala
Apache Impala is a parallel processing SQL query engine for data stored in a computer cluster running Apache Hadoop.
TODO

Apache Kafka
Apache Kafka is a distributed event store and stream-processing platform.
Awesome Kafka

Apache NiFi
Apache NiFi is a software project designed to automate the flow of data between software systems.
Awesome NiFi

Apache Spark
Apache Spark is unified analytics engine for large-scale data processing.
Awesome Spark

Apache Flink
Apache Flink is unified stream-processing and batch-processing framework.
Awesome Flink

Kubernetes
Kubernetes is a system for managing containerized applications across multiple hosts.
Awesome Kubernetes

Cloud providers

Amazon Web Services
Amazon web service is an online platform that provides scalable and cost-effective cloud computing solutions.
Awesome AWS

Microsoft Azure
Microsoft Azure is Microsoft's public cloud computing platform.
Awesome Azure

Google Cloud Platform
Google Cloud Platform is a suite of cloud computing services.
Awesome GCP

Theory

DWH Architectures
A data warehouse architecture is a method of defining the overall architecture of data communication processing and presentation that exist for end-clients computing within the enterprise.
Awesome databases

Data Structures
A data structure is a specialized format for organizing, processing, retrieving and storing data.
TODO

SQL
SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS).
Awesome SQL

Data visualization tools/BI

Tableau
Tableau is a powerful data visualization tool used in the Business Intelligence.
TODO

Looker
Looker is an enterprise platform for BI, data applications, and embedded analytics that helps you explore and share insights in real time.
TODO

Apache Superset
Superset is a modern data exploration and data visualization platform
TODO