https://github.com/andr83/parsek
Library for parse, validate and transform log files in different formats.
https://github.com/andr83/parsek
Last synced: about 1 year ago
JSON representation
Library for parse, validate and transform log files in different formats.
- Host: GitHub
- URL: https://github.com/andr83/parsek
- Owner: andr83
- License: mit
- Created: 2015-10-05T15:03:41.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2018-10-30T15:38:33.000Z (over 7 years ago)
- Last Synced: 2025-03-30T07:22:26.931Z (over 1 year ago)
- Language: Scala
- Size: 241 KB
- Stars: 2
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
#Parsek
[](https://travis-ci.org/andr83/parsek)
Parsek designed for parse, validate and transform log files in different formats. It can be used as a library or standalone [Apache Spark](https://spark.apache.org) application.
### [Documentation](https://github.com/andr83/parsek/wiki)
##Overview

Parsek allow organise work process in pipes. Where each pipe is a unit of work and multiple pipes can be join in pipeline.
In Parsek data internally presented as JSON like [AST](https://github.com/andr83/parsek/blob/master/core/src/main/scala/com/github/andr83/parsek/PValue.scala). On every step pipe accept PValue and must transform it to other PValue.
Example of pipes: parseJson, parseCsv, flatten, merge, validate and etc.
Source can read data from different source type and convert to Parsek AST. Currently supported sources:
- Local text files
- Hadoop text/sequence* files
- Kafka stream*
> marked with * not implemented yet
Sink allow to output data in AST format to external sources. Supported sinks:
- Local text files with csv/json serialization.
- Hadoop files with csv/json/avro* serialization.
> marked with * not implemented yet
##Spark application usage
To run assembly jar just type:
java -jar parsek-assembly-xx-SNAPSHOT.jar --config /path/to/config_file.conf
Parsek spark application use configuration file to define job task. More about config format [read here](https://github.com/typesafehub/config).
Example of configuration file:
```yaml
sources: [{
type: textFile
path: "events.log"
}]
pipes: [
{
type: parseRegex
pattern: ".*\\[(?[\\w\\d-_=]+)\\].*"
},{
type: parseJson
field: body
},{
type: validate
fields: [{
type: Date
name: time
format: "dd-MMM-yyyy HH:mm:ss Z"
toTimeZone: UTC
},{
type: String
name: ip
pattern: ${patterns.ip}
},{
type: Record
name: body
fields: [{
type: Date
format: timestamp
name: timestamp
isRequired: true
},{
type: List
name: events
field: {
type: Map
name: event
field: [{
type: String
name: name
as: event_name
}]
}
}]
}]
},{
type: flatten
field: body.events
}
]
sinks: [{
type: textFile
path: /output
serializer: {
type: csv
fields: [time,ip,timestamp,event_name]
}
}]
```
In this example configuration file we define:
1. Read lines from `events.log` file
2. Parse each line with regular expression and extract field `body`
3. Parse `body` field as json
4. Validate json value
5. Flatten embeded list in `body.events` field
6. Save result as csv to `/output` directory.