https://github.com/rsangole/oman-rusers-arrow
Explore the nuances of handling large datasets in R through the Arrow package.
https://github.com/rsangole/oman-rusers-arrow
arrow meetups r
Last synced: 5 months ago
JSON representation
Explore the nuances of handling large datasets in R through the Arrow package.
- Host: GitHub
- URL: https://github.com/rsangole/oman-rusers-arrow
- Owner: rsangole
- Created: 2023-11-03T01:50:00.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-11-09T17:11:49.000Z (almost 2 years ago)
- Last Synced: 2024-12-13T13:13:37.185Z (10 months ago)
- Topics: arrow, meetups, r
- Language: JavaScript
- Homepage: https://rsangole.github.io/oman-rusers-arrow/
- Size: 10.5 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# [Oman R Users] Unlocking Big Data in R Using Arrow
This repository contains the materials for the [meetup on "Unlocking Big Data in R Using Arrow"](https://www.meetup.com/oman-r-user/events/297135613/) I presented to Oman R Users on Nov 8 2023.
## Abstract
Explore the nuances of handling large datasets in R through the Arrow package. This session aims to provide an understanding of Arrow's capabilities, detailing its application in real-world scenarios. It's a package that's not only easy to adopt, but one that will drastically improve your capability to handle massive datasets in R.
## Data Sources
You'll need to download the datasets from the sources and place them in the `data` folder to run the code.
### NYC Taxi Data
- [NYC Taxi and Limousine Commission](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
To replicate the dataset locally, run the following code:
```r
library(arrow)
library(dplyr)local_folder <- here::here("data/nyc_part")
fs::dir_create(local_folder)
open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
filter(year %in% 2012:2021) |>
group_by(year, month) |>
write_dataset(local_folder)
```
### Airlines DataCreate a folder called `airlines` in the `data` folder.
Download the `Combined_Flights_2021` CSV and parquet files from the [Flight Status Prediction](https://www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022) dataset from Kaggle.
Links:
1. [CSV file](https://www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022?select=Combined_Flights_2021.csv)
2. [Parquet file](https://www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022?select=Combined_Flights_2021.parquet)## Slides
📽️ The slides are created using quarto, in the `presentation.qmd` file. The slide deck is published on [GitHub pages here](https://rsangole.github.io/oman-rusers-arrow/presentation.html).
## Code
The examples shown in my talk are stored in the `code` folder.