Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/rvandewater/imdb-flink-jobs

Last synced: 7 days ago
JSON representation

Host: GitHub
URL: https://github.com/rvandewater/imdb-flink-jobs
Owner: rvandewater
Created: 2021-10-24T17:27:38.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2021-10-25T08:11:52.000Z (about 3 years ago)
Last Synced: 2024-10-28T14:39:05.308Z (about 2 months ago)
Language: Java
Size: 2.84 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# IMDb-Flink-Jobs
Generated by [DataFarmGUI](https://github.com/agora-ecosystem/data-farm)!
## Original Jobs
10 Flink jobs based on the following [Kaggle IMDb Dataset](https://www.kaggle.com/ashirwadsangwan/imdb-dataset/code).
They are designed to provide a well rounded use of the complete dataset.
The following are rough descriptions of the semantics of these queries, with as second line the tables they use:
1. Q1: The sorted last name, birthyear, deathyear, age of all actors that have aged between 20 and 30
1. name.basics
2. Q2: Get all sorted unoriginal transliterated greek titles, merged into one list by the amount of entries they have.
1. title.akas
3. Q3:Get all the roles from actors, as well as the number of roles played
1. title.principals, name.basics
4. Q4: Films/series produced before the 1950s with a rating of at least 8.5, with more than 10 reviews
1. title.basics, title.ratings
5. Q5: Get titles after 1950, with a rating of at least 8.5, with more than 10 reviews that are german.
1. title.basics, title.akas, title.ratings
6. Q6: Get all the actors of german movies, with the types of movie, ratings, year, years born/death , roles, jobs. 4 joins involved.
1. title.akas, title.basics, title.principals, title.ratings, name.basics
7. Q7: Get the movie title, role, rating, name, year of birth/death of people that participated in a movie of at least rating 7
1. title.akas, title.principals, title.ratings, name.basics
8. Q8: Get actors that are primarily known for films/series produced before 1970 with a rating of at least 8.5, with more than 10 reviews
1. title.basics, title.ratings, name.basics
9. Q9: Get all the roles from actors and in which movies they played this role, as well as the number of roles played
1. title.basics, title.principals, name.basics
10. Q10: Get all the actors details of german movies that contain archive footage, with the ratings, sorted by the title.
1. title.akas, title.basics, title.principals, title.ratings

From of these handcrafted queries, we generated two datasets to used for testing. These datasets have been fully generated from the small handcrafted workload defined above. We chose to generate two datasets; one containing on a relatively large number of long unique jobs with relatively little instantiations, and one containing a relatively small number of unique jobs, with relatively many instantiations. This was done in order to get two diverse datasets, with which we can benchmark and compare the performance of DataFarnGUI in a realistic manner.
## Generated Set A
For dataset A, we chose to generate 60 abstract plans. These plans have a sequence length of 10 and a maximum of 4 join operators (because the IMDb dataset contains 5 tables, we will usually have at most 4 joins). Furthermore, out of these plans, we selected, using our system's GUI, the most heterogeneous and information-rich. We instantiated 20 jobs for each of these abstract plans. This leaves us with a dataset of about 1, 100 jobs
## Generated Set B
For dataset B, we chose to generate 25 plans. These plans
s have a sequence length of 6 and a maximum of 4 join operators (identical reasoning to dataset A). Again, out of these plans, we selected the most heterogeneous and information-rich. We instantiated 50 jobs for each of these abstract plans. This leaves us with a dataset of about 1, 100 jobs