An open API service indexing awesome lists of open source software.

https://github.com/randikabanura/bdat_assignment


https://github.com/randikabanura/bdat_assignment

Last synced: 6 days ago
JSON representation

Awesome Lists containing this project

README

        

# BDAT Assignment (Mapreduce vs Spark)

## Dataset

According to a 2010 report made by the US Federal Aviation Administration,
the economic price of domestic flight delays entails a yearly cost of 32.9 billion dollars
to passengers, airlines and other parts of the economy. More than half of that amount comes
from the pockets of passengers who not only lose time waiting for their planes to leave, but
they also miss connecting flights, spend money on food and have to sleep on hotel rooms while they're stranded.

## Mapreduce via HiveQL

With hive installed hql script can be run with the following arguments:

```shell
--hivevar delay_type_col_name= // (CarrierDelay, etc.)
--hiveconf hive.session.id=calculate-flight-delay--1 // (CarrierDelay, etc.)
--hiveconf hive.execution.engine=mr
```

## Spark

With spark installed python script can be run with following arguments:

```shell
--data_source
--output_uri
--delay_type_col_name // (CarrierDelay, etc.)
--iterations 1
```

## Comparison

See the below table for details on time consumption with queries and iterations.

HiveQL vs Spark-SQL Performance Comparison Table

## Presentation and Demo

Can check out the presentation on Mapreduce vs Spark and how each task is executed with the following video.

[Watch the video](https://drive.google.com/file/d/10x7jTuetRrKrgC8gFRyjz__U_6FlX7qn/view?usp=share_link)

## Author

Name: [Banura Randika Perera](https://github.com/randikabanura)

Linkedin: [randika-banura](https://www.linkedin.com/in/randika-banura/)

Email: [[email protected]](mailto:[email protected])

## Show your support

Please ⭐️ this repository if this project helped you!

## License

See [LICENSE](LICENSE) © [randikabanura](https://github.com/randikabanura/)