Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lefteris-souflas/apache-drill-and-impala
Explore data virtualization and query performance optimization with Apache Drill, Hive, and Impala. Tasks include comparing virtualization precision, proposing solutions for a bookstore's diverse data formats, creating Impala databases, and addressing query performance issues. The report offers practical insights and commands for implementation
https://github.com/lefteris-souflas/apache-drill-and-impala
apache-drill apache-hive apache-hue apache-impala
Last synced: 7 days ago
JSON representation
Explore data virtualization and query performance optimization with Apache Drill, Hive, and Impala. Tasks include comparing virtualization precision, proposing solutions for a bookstore's diverse data formats, creating Impala databases, and addressing query performance issues. The report offers practical insights and commands for implementation
- Host: GitHub
- URL: https://github.com/lefteris-souflas/apache-drill-and-impala
- Owner: Lefteris-Souflas
- License: mit
- Created: 2024-04-09T14:54:18.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-04-18T20:08:35.000Z (7 months ago)
- Last Synced: 2024-04-18T21:27:05.557Z (7 months ago)
- Topics: apache-drill, apache-hive, apache-hue, apache-impala
- Homepage:
- Size: 959 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Apache Drill and Impala Assignment
Assignment 1 for the Advanced Data Engineering Course of AUEB's MSc in Business Analytics
## Data Virtualization and Query Performance Optimization Report
In this comprehensive report, we delve into the realm of data virtualization within the domain of big data analytics, with a particular focus on comparing the capabilities of three prominent tools: Apache Drill, Hive, and Impala.
## Task 1 [25 points]
Among Hive, Impala, and Drill, which one implements the concept of data virtualization more precisely? Elaborate.### Apache Drill: Leading the Data Virtualization Frontier
Apache Drill emerges as the frontrunner in implementing data virtualization with precision. Unlike its counterparts, Hive and Impala, Apache Drill boasts a schema-on-read feature. This allows users to seamlessly query and analyze data from diverse sources without prior knowledge of its format or physical location.
## Task 2 [25 points]
You've joined a large bookstore company with diverse data formats: client data in MongoDB, e-books on HDFS, and social media metadata in Hive. They seek to simplify queries for UI elements. What solution would you suggest? Elaborate.### Practical Application of Apache Drill
Task 2 of the report illustrates a real-world application of Apache Drill within a large bookstore company. By leveraging Apache Drill, the company can streamline queries, amplify data analytics capabilities, and circumvent the need for convoluted ETL processes. The adoption of Apache Drill promises to simplify operations and expedite decision-making processes.
## Task 3 [40 points]
Your client has an Impala database and wants a new database with a specific schema:![Screenshot 2024-04-09 182509](https://github.com/Lefteris-Souflas/Apache-Drill-and-Impala/assets/143879796/a58bcc70-fba0-4973-8275-dabc6ba878b3)
- 3a) Create the Impala database & required tables.
- 3b) Provide a command to insert an entry into the Student table.
- 3c) Write a statement to retrieve the names of students who attended "Artificial Intelligence" course during "2021-2022".
- 3d) Write a statement to retrieve course titles and average grades for courses with average student grades below 6.### Creating Impala Databases
Task 3 provides detailed steps for creating Impala databases and tables, accompanied by sample commands for data insertion and retrieval. This section equips users with the necessary knowledge to set up Impala databases effectively, facilitating seamless data management and analysis.
## Task 4 [10 points]
A query in the Impala database is too slow. Describe your approach to investigate and improve efficiency. Provide relevant commands.### Addressing Query Performance
In Task 4, the report addresses the prevalent issue of slow queries in Impala databases. Various strategies are proposed to investigate and enhance query performance, encompassing aspects such as analyzing execution plans, optimizing joins and predicates, updating statistics, and considering hardware upgrades. These insights offer practical solutions for organizations striving to maximize the efficiency of their big data analytics infrastructure.
## Conclusion: Unlocking the Potential of Big Data Analytics
In summary, this report offers valuable insights into the practical applications of data virtualization tools and strategies for optimizing query performance within Impala databases. By leveraging Apache Drill's advanced capabilities and implementing effective performance optimization techniques, organizations can unlock the full potential of their big data analytics initiatives, driving informed decision-making and fostering innovation.
## Assignment Submission Requirements
To successfully submit this assignment, **Cloudera Quickstart VM** running on Red Hat Linux was utilized.
![cloudera-quickstart-vm](https://github.com/Lefteris-Souflas/Apache-Drill-and-Impala/assets/143879796/3b4df4f9-0775-41d0-9cba-cf6c8d30f567)