https://github.com/andr83/parsek

Library for parse, validate and transform log files in different formats.
https://github.com/andr83/parsek

Last synced: about 1 year ago
JSON representation

Library for parse, validate and transform log files in different formats.

Host: GitHub
URL: https://github.com/andr83/parsek
Owner: andr83
License: mit
Created: 2015-10-05T15:03:41.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2018-10-30T15:38:33.000Z (over 7 years ago)
Last Synced: 2025-03-30T07:22:26.931Z (over 1 year ago)
Language: Scala
Size: 241 KB
Stars: 2
Watchers: 2
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

#Parsek

[![Build Status](https://travis-ci.org/andr83/parsek.svg)](https://travis-ci.org/andr83/parsek)

Parsek designed for parse, validate and transform log files in different formats. It can be used as a library or standalone [Apache Spark](https://spark.apache.org) application.

### [Documentation](https://github.com/andr83/parsek/wiki)

##Overview

![Parsek workflow](https://lh3.googleusercontent.com/tVOLaNiDTV6RKignddM5IKlRcw7HZyvBekm4lQwjbHJqxMJEohDsdfvjmc-Sjc-znE36AobGqhDptfTWm3uPf1r-r9xopEizubPAhIbGmwMAHrCjJ7jlHOQ0aKS080uvg8EyeuKlVhtmJLGfr0wG8-QQ3zttdk023eQGY37BRVCId72PbZ-tQ6VxlErXKvAvsQUlEJ5UdoMONo3bqmHhC3N2e62Drf7Jf2idFsBoERUQHNle3MHTHBgserE-EishfgS9M8svif1o1vr969haL7soJ9_NtdJS7-Ba3llf7ET_HgTCygnCqkuI2smSVRiFI9HRfk3rPy2mD4KDPbpI7NgzDrLJj6pwsgkL6Um6EafVo0w0GC_l7DS2wTSeLs8ZoX8FXLaIIzL_Mpqf5MCiZOvRez64n-auTep6xx3YSDTHJJIAVQeLl4-naEAPvIIfh_-wan2EtksHeJ1Q0BbLeYWYAW4i3b9F_AJZjUY-NaVLLKCPmjvcO0HHWyQZSeaS26w4vGh71wr9_Bff4Jwco8GBeRaa-Veped4RGl0h-9l7SsZT8eQwbMlaxMMYeJ4db-Qrew=w929-h266-no)

Parsek allow organise work process in pipes. Where each pipe is a unit of work and multiple pipes can be join in pipeline.

In Parsek data internally presented as JSON like [AST](https://github.com/andr83/parsek/blob/master/core/src/main/scala/com/github/andr83/parsek/PValue.scala). On every step pipe accept PValue and must transform it to other PValue.

Example of pipes: parseJson, parseCsv, flatten, merge, validate and etc.

Source can read data from different source type and convert to Parsek AST. Currently supported sources:

- Local text files
- Hadoop text/sequence* files
- Kafka stream*

> marked with * not implemented yet

Sink allow to output data in AST format to external sources. Supported sinks:

- Local text files with csv/json serialization.
- Hadoop files with csv/json/avro* serialization.

> marked with * not implemented yet

##Spark application usage

To run assembly jar just type:

java -jar parsek-assembly-xx-SNAPSHOT.jar --config /path/to/config_file.conf

Parsek spark application use configuration file to define job task. More about config format [read here](https://github.com/typesafehub/config).

Example of configuration file:
```yaml
sources: [{
type: textFile
path: "events.log"
}]

pipes: [
{
type: parseRegex
pattern: ".*\\[(?[\\w\\d-_=]+)\\].*"
},{
type: parseJson
field: body
},{
type: validate
fields: [{
type: Date
name: time
format: "dd-MMM-yyyy HH:mm:ss Z"
toTimeZone: UTC
},{
type: String
name: ip
pattern: ${patterns.ip}
},{
type: Record
name: body
fields: [{
type: Date
format: timestamp
name: timestamp
isRequired: true
},{
type: List
name: events
field: {
type: Map
name: event
field: [{
type: String
name: name
as: event_name
}]
}
}]
}]
},{
type: flatten
field: body.events
}
]

sinks: [{
type: textFile
path: /output
serializer: {
type: csv
fields: [time,ip,timestamp,event_name]
}
}]
```

In this example configuration file we define:

1. Read lines from `events.log` file
2. Parse each line with regular expression and extract field `body`
3. Parse `body` field as json
4. Validate json value
5. Flatten embeded list in `body.events` field
6. Save result as csv to `/output` directory.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/andr83/parsek

Awesome Lists containing this project

README