{"id":25948765,"url":"https://github.com/soerlemans/pcap-parser-simba2json-processor","last_synced_at":"2026-04-29T06:38:54.514Z","repository":{"id":279427815,"uuid":"938775115","full_name":"soerlemans/pcap-parser-simba2json-processor","owner":"soerlemans","description":"Highly performant C++23 PCAP parser that converts SIMBA messages to JSON. Done as a coding challenge for a High Frequency Trading (HFT) firm.","archived":false,"fork":false,"pushed_at":"2025-02-26T19:50:34.000Z","size":4438,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-24T19:46:32.950Z","etag":null,"topics":["callgrind","cpp","cpp23","high-frequency-trading","high-performance-computing","json","optimization","parser","pcap","performance","simba"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/soerlemans.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-25T13:37:26.000Z","updated_at":"2025-02-26T19:50:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"35f60ae9-ba21-48d8-9854-6ef6b829feaf","html_url":"https://github.com/soerlemans/pcap-parser-simba2json-processor","commit_stats":null,"previous_names":["soerlemans/pcap-parser-simba2json-processor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/soerlemans/pcap-parser-simba2json-processor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soerlemans%2Fpcap-parser-simba2json-processor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soerlemans%2Fpcap-parser-simba2json-processor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soerlemans%2Fpcap-parser-simba2json-processor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soerlemans%2Fpcap-parser-simba2json-processor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/soerlemans","download_url":"https://codeload.github.com/soerlemans/pcap-parser-simba2json-processor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soerlemans%2Fpcap-parser-simba2json-processor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32414422,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T06:29:02.080Z","status":"ssl_error","status_checked_at":"2026-04-29T06:29:00.631Z","response_time":110,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["callgrind","cpp","cpp23","high-frequency-trading","high-performance-computing","json","optimization","parser","pcap","performance","simba"],"created_at":"2025-03-04T11:22:16.276Z","updated_at":"2026-04-29T06:38:54.510Z","avatar_url":"https://github.com/soerlemans.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"README\n======\nSo I did this project a while ago for the take home exercise of a High Frequency Trading (HFT) firm.\nI had to write a PCAP parser, which would extract SIMBA messages and convert these to JSON.\nSince this was a HFT I payed special attention towards performance.\n\n## Implementation\nSince we are dealing with large blobs of binary data, and performance is of the matter.\nWe use packed struct punning in order to quickly make sense of large blobs of binary data.\n\nWe start by reading the entire PCAP file into memory, in a single file read.\nWe then start to process/parse this by keeping non owning pointer references to this large binary blob.\nAnd masking/punning padded structs over the data.\n\n### Mixed endianess\nSo a big issue in this approach is that some networking protocols use big endian.\nSince Intel, AMD and ARM CPU's use little endian we need to account for this.\n\nAs when we use struct punning some fields byte order might be reversed.\nSo I wrote a function called `be2native()`:\n```cpp\n/*!\n * Network protocols are usually big endian, convert to native endian if needed.\n * @note: Function name is short for `Big Endian to Native`.\n */\ntemplate \u003ctypename T\u003e\ninline constexpr auto be2native(const T t_int) -\u003e T {\n  using std::endian;\n\n  T result{t_int};\n\n  if constexpr (endian::native == endian::little) {\n    result = std::byteswap(t_int);\n  } else if constexpr (endian::native != endian::big) {\n    // TODO: Error handle, unhandeld endianess.\n  }\n\n  return result;\n}\n```\n\nNow I kind of assume that the native endian format is little endian.\nSince big endian is only really common in more niche systems.\nWhich I do not expect to support C++23 at all.\n\nBut I do assert at compile time that the system we are compiling for is little endian:\n```cpp\n  // Edge case but it is good to account for this.\n  constexpr bool is_little_endian{endian::native == endian::little};\n  static_assert(\n      is_little_endian,\n      \"The parser currently only supports systems using little endian.\");\n```\n\nThe usage is quite simpel, any big endian field must go through this function.\n```cpp\nauto payload_is_ipv4(const Ethernet2Frame\u0026 t_frame) -\u003e bool {\n  bool result{false};\n\n  // Normally I would make this an enumeration but I have time constraints.\n  constexpr u16 ipv4_ether_type{0x0800};\n\n  // Must convert from big endian.\n  const u16 ether_type{be2native(t_frame.m_header-\u003em_ether_type)};\n  result = (ipv4_ether_type == ether_type);\n\n  return result;\n}\n```\n\n### Big endian system support\nIn order to support big endian systems I would just need to write a `le2native()` function.\nWhich does the exact opposite.\nSince is use `if constexpr()` statements, the compiler would be able to decide which branch to take at compile time.\n\n## How to compile\nHow to run the project:\n```bash\nbash test/download.sh\ncd src/ \u0026\u0026 make\n./parser.out\n```\n\nThe `parser.out` binary default runs on the `.pcap` files downloaded by `test/download.sh`.\nPlease compile and run on a little endian system.\n\nI validated the JSON files by running them through `jq`, with the basic filter:\n```\n$ jq . \u003cJSON file\u003e\n\u003cNo errors\u003e\n```\n\nThis is just to confirm then I did not make any mistakes in the formatting of the JSON.\n\n## Performance and profiling\n### Lazy profiling\nSo I did some lazy profiling using just the `time` command:\n\n```\n$ time ./parser.out\nNo arguments given running defaults.\nfile: ../test/2023-10-09.1849-1906.pcap, packets captured: 2637416\nfile: ../test/2023-10-09.2349-2355.pcap, packets captured: 1133958\nfile: ../test/2023-10-10.0439-0450.pcap, packets captured: 1783642\nfile: ../test/2023-10-10.0845-0905.pcap, packets captured: 4294773\nfile: ../test/2023-10-10.0959-1005.pcap, packets captured: 1855725\nfile: ../test/2023-10-10.1359-1406.pcap, packets captured: 1388173\nfile: ../test/2023-10-10.1849-1906.pcap, packets captured: 2656226\n./parser.out  10.31s user 3.98s system 88% cpu 16.137 total\n```\n\nOveral the program took 10 seconds.\nThis is not bad as this is 7.470G of data:\n\n```\n$ ls -lh test/\n-rw-rw-r-- 1 user user 1.4G Oct 11  2023 ../test/2023-10-09.1849-1906.pcap\n-rw-rw-r-- 1 user user 556M Oct 11  2023 ../test/2023-10-09.2349-2355.pcap\n-rw-rw-r-- 1 user user 843M Oct 11  2023 ../test/2023-10-10.0439-0450.pcap\n-rw-rw-r-- 1 user user 1.9G Oct 11  2023 ../test/2023-10-10.0845-0905.pcap\n-rw-rw-r-- 1 user user 696M Oct 11  2023 ../test/2023-10-10.0959-1005.pcap\n-rw-rw-r-- 1 user user 675M Oct 11  2023 ../test/2023-10-10.1359-1406.pcap\n-rw-rw-r-- 1 user user 1.4G Oct 11  2023 ../test/2023-10-10.1849-1906.pcap\n```\n\nThis means we process roughly 724.539M of data per second.\nThis is really good performance, but note that we do not extensively process most of the fields.\nIn the networking part, we just skip through those as quickly as possible to start processing/parsing SIMBA messages.\n\n## Perf profiling\nLets also do some more extensive profiling, using the sampling profiler `perf`.\n```\n# Sample profile the program.\n$ sudo perf record -F 999 --call-graph dwarf -- ./parser.out\n\n# Create a flamegraph from the perf.data.\n$ perf script | ~/Projects/Git/Public/FlameGraph/stackcollapse-perf.pl | ~/Projects/Git/Public/FlameGraph/flamegraph.pl \u003e perf.svg\n\n# Text report of the profiling.\n$ perf report --stdio \u003e perf.report\n```\n\nThis generates the following flamegraph:\n\n![Regular perf](prof/perf/perf.svg)\n\nThis flamegraph also shows us how much time is spent in various system level functions.\nIn order to create a flamegraph of just the time spent in various functions within the binary.\nWe can add the `--all-user` flag to `perf record`.\nThis generates the following svg:\n\n![Only user perf](prof/perf/perf-only-user.svg)\n\n## Callgrind analysis\nLets also run `callgrind` on the binary to get report of how many clock cycles each function costs.\n\n```\n$ valgrind --tool=callgrind  ./parser.out\n```\n\nNow lets view the callgrind output using `kcachegrind`:\n```\n$ kcachegrind callgrind.out.198304\n```\n\nHere we get a nice graph of all the calls and how much clock cycles each call costs.\n\n![Callgrind overview of global time spent](prof/callgrind/callgrind_overview.png)\n\nWe spend by far most of our time dealing with disk IO.\n\nLets analyze where we spend the most time in our program.\nWe do this by grouping on `ELF Object` and then selecting `parser.out` and then sorting on self.\nThis will show us where we spent the most time in the function itself, discounting callee's.\n\n![Callgrind overview of self spent time in parser.out](prof/callgrind/callgrind_overview_self.png)\n\nWe spend the most time of our time in the SIMBA extraction/parsing functions.\nIt is not really a surprise we spend a ton of time here.\nAs this is the main domain logic or our project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoerlemans%2Fpcap-parser-simba2json-processor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoerlemans%2Fpcap-parser-simba2json-processor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoerlemans%2Fpcap-parser-simba2json-processor/lists"}