Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tzolov/crunch-xmlsource
Xml Source for Apache Crunch
https://github.com/tzolov/crunch-xmlsource
Last synced: 4 days ago
JSON representation
Xml Source for Apache Crunch
- Host: GitHub
- URL: https://github.com/tzolov/crunch-xmlsource
- Owner: tzolov
- Created: 2014-10-23T15:25:12.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2015-02-06T15:28:19.000Z (almost 10 years ago)
- Last Synced: 2023-03-23T02:46:45.385Z (over 1 year ago)
- Language: Java
- Homepage:
- Size: 293 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Apache Crunch XML Source
===================The [Apache Crunch](https://crunch.apache.org) Java library provides a framework for writing, testing, and running MapReduce and Apache Spark pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
Crunch supports various input formats via its [Source](https://crunch.apache.org/user-guide.html#sources) abstraction. The XmlSource employs the Mahout's [XmlInputFormat](https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java) to adds XML read capabilities to Crunch.
## Build and Installation
```
mvn clean install
```## Usage
Add the tzolov GitHub maven repository to the POM file of the project:
```xml
git-tzolov
tzolov's Git based repo
https://github.com/tzolov/maven-repo/raw/master/
```Add the crunch XML source dependecy:
```xml
org.apache.crunch.io.xml
crunch-xmlsource
0.0.1
```Sample data:
```xml
Bloodroot
Sanguinaria canadensis
.......
Columbine
Aquilegia canadensis
```
Sample code:
```java
XmlSource xmlSource = new XmlSource(xmlInFile, "");MRPipeline pipeline = new MRPipeline(XmlSourceIT.class);
PCollection in = pipeline.read(xmlSource);
PTable out = in.by(new MapFn() {
@Override
public String map(String input) {
return input;
}
}, Writables.strings());out.write(To.textFile(outFile));
pipeline.done();
```