Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ashwanthkumar/anna
General purpose Data Ingestion system
https://github.com/ashwanthkumar/anna
Last synced: about 8 hours ago
JSON representation
General purpose Data Ingestion system
- Host: GitHub
- URL: https://github.com/ashwanthkumar/anna
- Owner: ashwanthkumar
- Created: 2015-05-31T08:54:58.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2015-05-31T16:48:08.000Z (over 9 years ago)
- Last Synced: 2024-04-14T09:19:00.647Z (7 months ago)
- Language: Scala
- Homepage:
- Size: 121 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[![Build Status](https://snap-ci.com/ashwanthkumar/anna/branch/master/build_image)](https://snap-ci.com/ashwanthkumar/anna/branch/master)
# anna
Hobby project inspired from [Goblin](https://github.com/linkedin/gobblin).## Introduction
Anna is a data ingestion system for making ingesting data from various data sources a solved problem. It has an easy
to use DSL to define your job along with a configuration file that goes along with it. With the same job, you can
change the configurations at runtime based on various environments / systems.## Job Specification
You can define an ingestion job like
```
JOB "readTsvAndParse"
SOURCE USING "a.b.c.FromTSV" as "tsvSource"
TRANSFORM USING "a.b.c.TSVToHtml"
TRANSFORM USING "a.b.c.HtmlToParseResult" AS "parser"
VALIDATE USING "d.e.f.ParseResultValidator"
SINK USING "x.y.z.PailWriter" AS "parseResult"
END
```
You could also have FORKs in your processing pipeline as in the following example - `readTsvAndParse.ingest`. These are
useful when you want to sink the data to multiple locations on HDFS / multiple sources.
```
JOB "readTsvAndParse"
SOURCE USING "a.b.c.FromTSV" as "tsvSource"
TRANSFORM USING "a.b.c.TSVToHtml"
TRANSFORM USING "a.b.c.HtmlToParseResult" AS "parser"
VALIDATE USING "d.e.f.ParseResultValidator"
FORK
TRANSFORM USING "a.b.c.ParseResultToProduct"
VALIDATE USING "d.e.f.ProductValidator"
SINK USING "x.y.z.PailWriter" AS "product"
FORK
TRANSFORM USING "a.b.c.ParseResultToPrice"
VALIDATE USING "d.e.f.PriceValidator"
SINK USING "x.y.z.PailWriter" AS "price"
END
```
Every job specification is associated with a job configuration that's used for passing parameters to the specification.
```hocon
tsvSource {
input.path = "/path/to/input"
input.frequency = "hourly"
}
product {
output.path = "/path/to/product/output"
output.frequency = "hourly"
}
price {
output.path = "/path/to/price/output"
output.frequency = "hourly"
}
```
In the specification - `SOURCE`, `TRANSFORM`, `VALIDATE` and `SINK` are called operators. These operators can be marked
with an identifier which can later be used as a reference to pick the right configurations from the job conf. Example
you could find `SOURCE` is identified as `tsvSource` in the job spec. The configuration(s) for the `SOURCE`
implementation are namespaced in the job conf under `tsvSource`. This helps to keep the job specification and the values
for the job separate. For the same job specification we can swap in multiple configurations for various environments /
users / teams etc.## TODOs
- Add more descriptive documentation
- Add notes on API endpoints