Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tzolov/crunch-xmlsource

Xml Source for Apache Crunch
https://github.com/tzolov/crunch-xmlsource

Last synced: 4 days ago
JSON representation

Xml Source for Apache Crunch

Awesome Lists containing this project

README

        

Apache Crunch XML Source
===================

The [Apache Crunch](https://crunch.apache.org) Java library provides a framework for writing, testing, and running MapReduce and Apache Spark pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

Crunch supports various input formats via its [Source](https://crunch.apache.org/user-guide.html#sources) abstraction. The XmlSource employs the Mahout's [XmlInputFormat](https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java) to adds XML read capabilities to Crunch.

## Build and Installation

```
mvn clean install
```

## Usage

Add the tzolov GitHub maven repository to the POM file of the project:

```xml

git-tzolov
tzolov's Git based repo
https://github.com/tzolov/maven-repo/raw/master/

```

Add the crunch XML source dependecy:
```xml

org.apache.crunch.io.xml
crunch-xmlsource
0.0.1

```

Sample data:

```xml


Bloodroot
Sanguinaria canadensis

.......

Columbine
Aquilegia canadensis

```

Sample code:

```java
XmlSource xmlSource = new XmlSource(xmlInFile, "");

MRPipeline pipeline = new MRPipeline(XmlSourceIT.class);

PCollection in = pipeline.read(xmlSource);

PTable out = in.by(new MapFn() {
@Override
public String map(String input) {
return input;
}
}, Writables.strings());

out.write(To.textFile(outFile));

pipeline.done();
```