https://github.com/fsanaulla/spark-http-rdd

RDD primitive for fetching data from an HTTP source
https://github.com/fsanaulla/spark-http-rdd

scala spark

Last synced: 4 months ago
JSON representation

RDD primitive for fetching data from an HTTP source

Host: GitHub
URL: https://github.com/fsanaulla/spark-http-rdd
Owner: fsanaulla
License: apache-2.0
Created: 2021-02-19T22:38:07.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2024-07-29T15:15:10.000Z (12 months ago)
Last Synced: 2025-02-14T10:42:08.052Z (5 months ago)
Topics: scala, spark
Language: Scala
Homepage:
Size: 104 KB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 16
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

# spark-http-rdd

[![Scala CI](https://github.com/fsanaulla/spark-http-rdd/actions/workflows/scala.yml/badge.svg)](https://github.com/fsanaulla/spark-http-rdd/actions/workflows/scala.yml)
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.github.fsanaulla/spark2-http-rdd_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/com.github.fsanaulla/spark2-http-rdd_2.12)
[![Scala Steward badge](https://img.shields.io/badge/Scala_Steward-helping-blue.svg?style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA4AAAAQCAMAAAARSr4IAAAAVFBMVEUAAACHjojlOy5NWlrKzcYRKjGFjIbp293YycuLa3pYY2LSqql4f3pCUFTgSjNodYRmcXUsPD/NTTbjRS+2jomhgnzNc223cGvZS0HaSD0XLjbaSjElhIr+AAAAAXRSTlMAQObYZgAAAHlJREFUCNdNyosOwyAIhWHAQS1Vt7a77/3fcxxdmv0xwmckutAR1nkm4ggbyEcg/wWmlGLDAA3oL50xi6fk5ffZ3E2E3QfZDCcCN2YtbEWZt+Drc6u6rlqv7Uk0LdKqqr5rk2UCRXOk0vmQKGfc94nOJyQjouF9H/wCc9gECEYfONoAAAAASUVORK5CYII=)](https://scala-steward.org)

## Installation

Add it into your `build.sbt`

### Spark 3

Compiled for scala 2.12

```
libraryDependencies += "com.github.fsanaulla" %% "spark3-http-rdd" %
```

### Spark 2

Cross-compiled for scala 2.11, 2.12

```
libraryDependencies += "com.github.fsanaulla" %% "spark2-http-rdd" %
```

## Usage

Let's define our source URI:

```scala
val baseUri: URI = ???
```

We will build our partitions on top of it using array of `URIModifier` that looks like:

```scala
val uriPartitioner: Array[URIModifier] = Array(
URIModifier.fromFunction { uri =>
// uri modification logic,
// for example appending path, adding query params etc
},
...
)
```

**Important**: Number of `URIModifier` should be equal to desired number of partitions. Each URI will be used as a
base URI for separate partition

Then we should define the way how we will work with http endpoint responses. By default it expect to receive line
separated number of rows where each row will be processed as separate entity during process of response mapping

```scala
val mapping: String => T = ???
```

And then you can create our RDD:

```scala
val rdd: RDD[T] =
HttpRDD.create(
sc,
baseUri,
uriPartitioner,
mapping
)
```

More details available in the source code. Also as an example you can use integration tests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fsanaulla/spark-http-rdd

Awesome Lists containing this project

README