{"id":20743490,"url":"https://github.com/andr83/parsek","last_synced_at":"2025-04-24T05:33:01.233Z","repository":{"id":86492907,"uuid":"43691537","full_name":"andr83/parsek","owner":"andr83","description":"Library for parse, validate and transform log files in different formats.","archived":false,"fork":false,"pushed_at":"2018-10-30T15:38:33.000Z","size":247,"stargazers_count":2,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-30T07:22:26.931Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andr83.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-10-05T15:03:41.000Z","updated_at":"2018-10-30T15:38:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"639cd8b1-dc51-42b3-97b7-819c05ca7487","html_url":"https://github.com/andr83/parsek","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andr83%2Fparsek","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andr83%2Fparsek/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andr83%2Fparsek/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andr83%2Fparsek/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andr83","download_url":"https://codeload.github.com/andr83/parsek/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250572638,"owners_count":21452334,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-17T07:11:10.048Z","updated_at":"2025-04-24T05:33:01.220Z","avatar_url":"https://github.com/andr83.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"#Parsek \n\n[![Build Status](https://travis-ci.org/andr83/parsek.svg)](https://travis-ci.org/andr83/parsek)\n\nParsek designed for parse, validate and transform log files in different formats. It can be used as a library or standalone [Apache Spark](https://spark.apache.org) application.\n\n### [Documentation](https://github.com/andr83/parsek/wiki)\n\n##Overview\n\n![Parsek workflow](https://lh3.googleusercontent.com/tVOLaNiDTV6RKignddM5IKlRcw7HZyvBekm4lQwjbHJqxMJEohDsdfvjmc-Sjc-znE36AobGqhDptfTWm3uPf1r-r9xopEizubPAhIbGmwMAHrCjJ7jlHOQ0aKS080uvg8EyeuKlVhtmJLGfr0wG8-QQ3zttdk023eQGY37BRVCId72PbZ-tQ6VxlErXKvAvsQUlEJ5UdoMONo3bqmHhC3N2e62Drf7Jf2idFsBoERUQHNle3MHTHBgserE-EishfgS9M8svif1o1vr969haL7soJ9_NtdJS7-Ba3llf7ET_HgTCygnCqkuI2smSVRiFI9HRfk3rPy2mD4KDPbpI7NgzDrLJj6pwsgkL6Um6EafVo0w0GC_l7DS2wTSeLs8ZoX8FXLaIIzL_Mpqf5MCiZOvRez64n-auTep6xx3YSDTHJJIAVQeLl4-naEAPvIIfh_-wan2EtksHeJ1Q0BbLeYWYAW4i3b9F_AJZjUY-NaVLLKCPmjvcO0HHWyQZSeaS26w4vGh71wr9_Bff4Jwco8GBeRaa-Veped4RGl0h-9l7SsZT8eQwbMlaxMMYeJ4db-Qrew=w929-h266-no)\n\nParsek allow organise work process in pipes. Where each pipe is a unit of work and multiple pipes can be join in pipeline. \n\nIn Parsek data internally presented as JSON like [AST](https://github.com/andr83/parsek/blob/master/core/src/main/scala/com/github/andr83/parsek/PValue.scala). On every step pipe accept PValue and must transform it to other PValue.\n\nExample of pipes: parseJson, parseCsv,  flatten, merge, validate and etc. \n\nSource can read data from different source type and convert to Parsek AST.  Currently supported sources:\n\n - Local text files\n - Hadoop text/sequence* files\n - Kafka stream*\n\n\u003e marked with * not implemented yet\n\nSink allow to output data in AST format to external sources. Supported sinks:\n\n- Local text files with csv/json serialization.\n- Hadoop files with csv/json/avro* serialization.\n\n\u003e marked with * not implemented yet\n\n##Spark application usage\n\nTo run assembly jar just type:\n\n    java -jar parsek-assembly-xx-SNAPSHOT.jar --config /path/to/config_file.conf\n\nParsek spark application use configuration file to define job task.  More about config format [read here](https://github.com/typesafehub/config).\n\nExample of configuration file:\n```yaml\nsources: [{\n\ttype: textFile\n\tpath: \"events.log\"\n}]\n\npipes: [\n{\n\ttype: parseRegex\n\tpattern: \".*\\\\[(?\u003cbody\u003e[\\\\w\\\\d-_=]+)\\\\].*\"\n},{\n    type: parseJson\n\tfield: body\n},{\n    type: validate\n\tfields: [{\n\t\t\t\ttype: Date\n\t\t\t\tname: time\n\t\t\t\tformat: \"dd-MMM-yyyy HH:mm:ss Z\"\n\t\t\t\ttoTimeZone: UTC\n\t\t\t},{\n\t\t\t\ttype: String\n\t\t\t\tname: ip\n\t\t\t\tpattern: ${patterns.ip}\n\t\t\t},{\n\t\t\t\ttype: Record\n\t\t\t\tname: body\n\t\t\t\tfields: [{\n\t\t\t\t\ttype: Date\n\t\t\t\t\tformat: timestamp\n\t\t\t\t\tname: timestamp\n\t\t\t\t\tisRequired: true\n\t\t\t\t},{\n\t\t\t\t\ttype: List\n\t\t\t\t\tname: events\n\t\t\t\t\tfield: {\n\t\t\t\t\t\ttype: Map\n\t\t\t\t\t\tname: event\n\t\t\t\t\t\tfield: [{\n\t\t\t\t\t\t\ttype: String\n\t\t\t\t\t\t\tname: name\n\t\t\t\t\t\t\tas: event_name\n\t\t\t\t\t\t}]\n\t\t\t\t\t}\n\t\t\t\t}]\n\t\t\t}]\n},{\n\ttype: flatten\n\tfield: body.events\n}\n]\n\nsinks: [{\n\ttype: textFile\n\tpath: /output\n\tserializer: {\n\t\ttype: csv\n\t\tfields: [time,ip,timestamp,event_name]\n\t}\n}]\n```\n\nIn this example configuration file we define:\n\n1. Read lines from `events.log` file\n2. Parse each line with regular expression and extract field `body`\n3. Parse `body` field as json\n4. Validate json value\n5. Flatten embeded list in `body.events` field\n6. Save result as csv to `/output` directory.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandr83%2Fparsek","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandr83%2Fparsek","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandr83%2Fparsek/lists"}