{"id":15908099,"url":"https://github.com/ignalina/shredder","last_synced_at":"2025-03-22T00:31:33.902Z","repository":{"id":46559470,"uuid":"413636105","full_name":"Ignalina/shredder","owner":"Ignalina","description":"Fixed column file to avro/kafka ","archived":false,"fork":false,"pushed_at":"2021-12-19T21:40:41.000Z","size":231,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-10-13T14:33:10.750Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ignalina.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-10-05T01:29:07.000Z","updated_at":"2021-12-20T11:58:03.000Z","dependencies_parsed_at":"2022-08-31T12:04:08.439Z","dependency_job_id":null,"html_url":"https://github.com/Ignalina/shredder","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ignalina%2Fshredder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ignalina%2Fshredder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ignalina%2Fshredder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ignalina%2Fshredder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ignalina","download_url":"https://codeload.github.com/Ignalina/shredder/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221820725,"owners_count":16886224,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-06T14:09:40.052Z","updated_at":"2024-10-28T11:21:06.926Z","avatar_url":"https://github.com/Ignalina.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# shredder\nshredds Fixed column file to avro/kafka .  \nImplementation uses Avro schema and multicore   \nSpeed around 220mb/sec per Core using 4 core on a 1Gb/s kafka connection\n\nNotes current features/limitations:\n* You kafka partition amount must be equal to core's used  \n* Multicore implementation.\n* Each go routine sends to corresponding partition. ie. 8 cores -\u003e 8 go routiens -\u003e 8 partitions\n* Fixed/supported input format is utf8  and utf8 output (iso8859-1 etc will be supported)\n\n# syntax\n```console\nshredder.exe \u003ckafka broker\u003e \u003cchemaregistry\u003e \u003cschema file url\u003e \u003cschema id\u003e \u003ctopic\u003e \u003ccores=partitions\u003e \u003cdata file\u003e\n```\n\n# Performance example 1 using 12 cores ( output snappy avro files 30 columns)\nHardware: 12 core (Amd Threadripper 5960X),1Gb kafka connection  , Samsung 980 pro 7/5 Gb r/w sec.  \nDatafile: 1.3Gb , 30 columns, total 528 chars (runes)  row width.\n\n```console\nrickard@Oden-Threadripper:~/GolandProjects/shredder2$ ./shredder /tmp/avrofiles 10.1.1.90:8081 schema1.json 2 table_x14 12 test.last111\nSchema = {\n\"type\": \"record\",\n\"name\": \"weblog\",\n\"fields\" : [\n... \u003c30 columns removed from readme \u003e\nTime spend in total     : 1.845484353s  parsing  4960143  lines from  2620609413  bytes\nTroughput bytes/s total : 1.32GB /s\nTroughput lines/s total : 2.56M  Lines/s\nTroughput lines/s toAvro: 3.28M  Lines/s\nTime spent toReadChunks : 0.0282819985 s\nTime spent toAvro       : 1.4421747465 s\nTime spent WaitDoneExport      : 0.042694734 s\n```\n\n# Performance example 2 using 6 cores ( output avro to kafka 30 columns)\nHardware: 6 core (Amd Threadripper 5960X),1Gb kafka connection  , Samsung 980 pro 7/5 Gb r/w sec.  \nDatafile: 1.3Gb , 30 columns, total 528 chars (runes)  row width.\n```console\n\nrickard@Oden-Threadripper:~/GolandProjects/shredder2$ ./shredder 10.1.1.90:9092 10.1.1.90:8081 schema1.json 2 table_x14 8 test.last111\nSchema = {\n\"type\": \"record\",\n\"name\": \"weblog\",\n\"fields\" : [\n... \u003c30 columns removed from readme \u003e\nskipping footer\nTime spend in total     : 1.713205915s  parsing  2590562  lines from  1367816800  bytes\nTroughput bytes/s total : 761.41MB /s\nTroughput lines/s total : 1.44M  Lines/s\nTroughput lines/s toAvro: 2.82M  Lines/s\nTime spent toReadChunks : 0.0215806745 s\nTime spent toAvro       : 0.87478208175 s\nTime spent toKafka      : 0.59487903675 s\n```\n\n# Performance example 3 using 48 cores ( output avro to kafka 30 columns)\nHardware: 48 core (Amd Threadripper 5960X),1Gb kafka connection  , Samsung 980 pro 7/5 Gb r/w sec.  \nDatafile: 1.3Gb , 30 columns, total 528 chars (runes)  row width.\n```console\n\nrickard@Oden-Threadripper:~/GolandProjects/shredder2$ ./shredder /tmp/avrofiles 10.1.1.90:8081 schema1.json 2 table_x14 48 test.last111\nSchema = {\n\"type\": \"record\",\n\"name\": \"weblog\",``\n\"fields\" : [\n... \u003c30 columns removed from readme \u003e\nskipping footer\nTime spend in total     : 954.359385ms  parsing  4960143  lines from  2620609413  bytes\nTroughput bytes/s total : 2.56GB /s\nTroughput lines/s total : 4.96M  Lines/s\nTroughput lines/s toAvro: 9.03M  Lines/s\nTime spent toReadChunks : 0.0084119796875 s\nTime spent toAvro       : 0.5240676388958333 s\nTime spent WaitDoneExport      : 0.012015946 s```\nNOTE: Time spent ToKafka is the the transfer time from \"Shredder\" to librd the underlying the kafka client library)\n\n# Example schema\nNote that column name needs a capital first character.\n```console\n\n{\n\"type\": \"record\",\n\"name\": \"weblog\",\n\"fields\" : [\n    {\"name\": \"Idnr\", \"type\":{\"type\": \"long\",\"name\": \"Idnr\", \"len\":8}},\n    {\"name\": \"Event_time\", \"type\":{\"type\" : \"long\", \"logicalType\" : \"timestamp-micros\",\"name\":\"Event_time\", \"len\":26}},\n    {\"name\": \"Idnr2\", \"type\":{\"type\": \"int\",\"name\": \"Idnr2\", \"len\":6}},\n    {\"name\": \"Ok\", \"type\":{\"type\": \"boolean\",\"name\": \"Ok\", \"len\":1}},\n    {\"name\": \"Some_text1\", \"type\":{\"type\": \"string\",\"name\": \"Some_text1\", \"len\":30}},\n    {\"name\": \"Some_text2\", \"type\":{\"type\": \"string\",\"name\": \"Some_text2\", \"len\":30}},\n   ]\n}\n```\n\n# Credits\n* Included kafka/avro client code origins from https://github.com/mycujoo/go-kafka-avro from mycujoo.tv \"Democratizing football broadcasting.\"  \n* Imported go module hamba/avro gives excellent speed and their team have been helpful on upcoming optimizations  https://github.com/hamba/avro  \n\n# Future\n* Improve speed by by taking inspiration from this https://teivah.medium.com/go-and-cpu-caches-af5d32cc5592\n* Further speed improvements possible from a slight correction of Shredders usage of hamba/avro \n* Once Go \"port\" of apache arrow / parquet is done (jira ARROW-7905) ,merge in apache arrow based shredder , that adds Parquet as output.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fignalina%2Fshredder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fignalina%2Fshredder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fignalina%2Fshredder/lists"}