Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nwtgck/spark-wikipedia-dump-loader
Wikipedia Dump Loader for Spark
https://github.com/nwtgck/spark-wikipedia-dump-loader
scala spark wikipedia-dump
Last synced: 8 days ago
JSON representation
Wikipedia Dump Loader for Spark
- Host: GitHub
- URL: https://github.com/nwtgck/spark-wikipedia-dump-loader
- Owner: nwtgck
- Created: 2018-01-21T02:04:18.000Z (about 7 years ago)
- Default Branch: develop
- Last Pushed: 2018-10-10T02:56:09.000Z (over 6 years ago)
- Last Synced: 2024-12-13T10:35:10.567Z (2 months ago)
- Topics: scala, spark, wikipedia-dump
- Language: Scala
- Homepage:
- Size: 430 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# spark-wikipedia-dump-loader
[![Build Status](https://travis-ci.org/nwtgck/spark-wikipedia-dump-loader.svg?branch=develop)](https://travis-ci.org/nwtgck/spark-wikipedia-dump-loader)A [Wikipedia Dump](https://dumps.wikimedia.org/) Loader for Spark in Scala
## How to import `spark-wikipedia-dump-loader`
Add the following to your `build.sbt`.
```scala
// Add dependency of `spark-wikipedia-dump-loader` in GitHub
dependsOn(RootProject(uri("https://github.com/nwtgck/spark-wikipedia-dump-loader.git#e6e358dd8cdd5b6200b89f5d2aa76c74b5c1d0d7")))
```
(from: )## Example Usage
Here is a complete code to use `spark-wikipedia-dump-loader`.
```scala
package io.github.nwtgck.spark_wikipedia_dump_loader_exampleimport org.apache.spark.sql.{Dataset, SparkSession}
import io.github.nwtgck.spark_wikipedia_dump_loader.{Page, Redirect, Revision, WikipediaDumpLoader}object Main {
def main(args: Array[String]): Unit = {
// Create spark session
val sparkSession: SparkSession = SparkSession
.builder()
.appName("Wikipedia Dump Loader Test [Spark session]")
.master("local[*]")
.config("spark.executor.memory", "1g")
.getOrCreate()// Create page Dataset
val pageDs: Dataset[Page] = WikipediaDumpLoader.readXmlFilePath(
sparkSession,
filePath = "./wikidump.xml"
)// Print all pages
for (page <- pageDs) {
println(page)
}
}
}
````wikidump.xml` above is found in [HERE](https://raw.githubusercontent.com/nwtgck/spark-wikipedia-dump-loader-example/master/wikidump.xml).
## Example Repositories
* [spark-wikipedia-dump-loader-example](https://github.com/nwtgck/spark-wikipedia-dump-loader-example)
* [wikipedia-word2vec-playground-spark](https://github.com/nwtgck/wikipedia-word2vec-playground-spark)