{"id":14062654,"url":"https://github.com/hrbrmstr/ndjson","last_synced_at":"2025-05-09T01:33:45.857Z","repository":{"id":62459169,"uuid":"67601145","full_name":"hrbrmstr/ndjson","owner":"hrbrmstr","description":":hotsprings: Wicked-Fast Streaming 'JSON' ('ndjson') Reader in R","archived":false,"fork":false,"pushed_at":"2022-10-16T17:26:31.000Z","size":726,"stargazers_count":56,"open_issues_count":5,"forks_count":10,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-05-03T17:47:13.220Z","etag":null,"topics":["json","ndjson","r","r-cyber","rstats"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hrbrmstr.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-09-07T11:38:38.000Z","updated_at":"2025-03-22T11:21:33.000Z","dependencies_parsed_at":"2022-11-02T00:45:26.125Z","dependency_job_id":null,"html_url":"https://github.com/hrbrmstr/ndjson","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fndjson","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fndjson/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fndjson/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fndjson/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hrbrmstr","download_url":"https://codeload.github.com/hrbrmstr/ndjson/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253174827,"owners_count":21865918,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["json","ndjson","r","r-cyber","rstats"],"created_at":"2024-08-13T07:01:37.993Z","updated_at":"2025-05-09T01:33:45.833Z","avatar_url":"https://github.com/hrbrmstr.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"---\noutput: rmarkdown::github_document\n---\n```{r pkg-knitr-opts, include=FALSE}\nhrbrpkghelpr::global_opts()\n```\n\n```{r badges, results='asis', echo=FALSE, cache=FALSE}\nhrbrpkghelpr::stinking_badges()\n```\n\n```{r description, results='asis', echo=FALSE, cache=FALSE}\nhrbrpkghelpr::yank_title_and_description()\n```\n\nPretty much an Rcpp/C++17 wrapper for \u003chttps://github.com/nlohmann/json\u003e\n\nThe goal is to create a completely \"flat\" `data.frame`-like structure from ndjson records in plain text ndjson files or gzip'd ndjson files.\n\n### Installation guidance for Linux/BSD-ish systems\n\nCRAN has binaries for Windows and macOS. To build this on UNIX-like\nsystems, you need at least g++4.9 or clang++. This is a forced requirement by the ndjson library.\n\nThe least painful way to do this is to install gcc \u003e= 4.9 (and you should install `ccache` while you're at it) and mmodfiy `~/.R/Makevars` thusly:\n\n    # Use whatever version of (g++ \u003e=4.9 or clang++) that you downloaded\n    VER=-4.9\n    CC=ccache gcc$(VER)\n    CXX=ccache g++$(VER)\n    SHLIB_CXXLD=g++$(VER)\n    FC=ccache gfortran\n    F77=ccache gfortran\n\n## Why `ndjson` + Examples\n\nAn example of such files are the output from Rapid7 internet-wide scans, such as their [HTTPS study](https://opendata.rapid7.com/sonar.https/). A gzip'd extract of 100,000 of one of those scans weighs in abt about 171MB. The records sometimes contain heavily nested JSON elements depending on how comprehensive the certificate data and other fields were. A typical record will look like this:\n\n    {\n      \"vhost\": \"teamchat.buzzpoints.com\",\n      \"host\": \"52.87.143.83\",\n      \"certsubject\": {\n        \"CN\": \"teamchat.buzzpoints.com\"\n      },\n      \"ip\": \"52.87.143.83\",\n      \"data\": \"SFRUUC8xLjEgMjAwIE9LDQpTZXJ2ZXI6IG5naW54LzEuNC42IChVYnVudHUpDQpEYXRlOiBNb24sIDIyIEF1ZyAyMDE2IDE3OjE3OjAwIEdNVA0KQ29udGVudC1UeXBlOiB0ZXh0L2h0bWw7IGNoYXJzZXQ9dXRmLTgNClRyYW5zZmVyLUVuY29kaW5nOiBjaHVua2VkDQpDb25uZWN0aW9uOiBjbG9zZQ0KVmFyeTogQWNjZXB0LUVuY29kaW5nDQpYLVBvd2VyZWQtQnk6IEV4cHJlc3MNClN0cmljdC1UcmFuc3BvcnQtU2VjdXJpdHk6IG1heC1hZ2U9NjMwNzIwMDA7IGluY2x1ZGVTdWJkb21haW5zOyBwcmVsb2FkDQpYLUZyYW1lLU9wdGlvbnM6IERFTlkNClgtQ29udGVudC1UeXBlLU9wdGlvbnM6IG5vc25pZmYNCkNvbnRlbnQtRW5jb2Rpbmc6IGd6aXANCg0KNTVjDQofiwgAAAAAAAADrVdbb5tIFH5ufgVlFanVFsNwZ2u7ap10N6tuE7nOqvtkDcPBngQYFsYu6a/fA/iCE8ci0j5gi5nvfOd+Zhi+vriezP65uVSWMk3GZ8PtH9BofKYow4Rn90oByUgt5UMC5RJAqop8yGGkSqikzspSVVhCy3KkzucpSBCFhovzuaosC4hHqg4GiV3f9v3IsYAZcewRMCJiBZZNTMsKPBITEofxAMU+tAzzmqGAUqwKBiNZrEAd/0/WyCWk0KgipgchGhISJ7ZjYtHANwzDsplrOyGzqG0Rw6CUquOzs7NhyQqey67rd3RN21V1vHV9XqwyyVOYM5HFfDGfKyPlz2/XXwc5LUp4EwETEdxOryYizUUGmXyjnnufzk2z9XsKCdAS8P3c+oi/f13OLq+n57ZBBuaA1MvmBH9vbj99uZrMv13OZldff/+2gSOPd9ECptfXs/nt9MtmxzSXUuZlw/n53PwsgaZsSeUgXP38mQueyXLARLrj34rPbz7O/pjfTC8/X33fUe1QlDGB3paTxtUJTRKIWlSdsNYQupJilUdUwt9QlFxkOxov8GLDZrFtOUEMvmEDlgSYnu+5rmfEAXguJczdO/2EagoxlsiShsk+YK7nYOKAmMyyXcNzAz8w/TCwfZdaoUsj0/SsyAncvROPDZyIIhJrurOTxq4RmBHWbhg74GLJeKFDQjfyIKCmFWFhh07oxbWAd6G+fft+qLdVgWWDNXuybpSyYHWH+L4PIWMmw5o0DDPwTayUmFBKbMNFLa6JHhFncLdrkLsn/dFRi+UquUxgPBXsHuRggrke6u3S2ash1hpVMP9YkXKkrmSs+aqij7c7da1o8O+Kr0cqlrHEKtXqjsc+b982rV/Pivc7npM0UOUcc9Vh0MhzKr9rtx+1uj+o5JjajszV5QiiBY6CraUZTXEOxQVdpGhkB/m6S96iIl7KgocriUXYQS4SEdLkKbxA7dmiC4QMimPINYcfuSi66n/wSC5HEaw5A615eafwjEtOE61kNIEReaektOLpKt0vrEoomre6okeZeGpUWtI8TzhD20SmzXgCE5GIomPlL4ZtW249sjZpbp1/KniVUozkPqM6rxdKPRELoaelRG6N2HaFzyDPFh/WI+s0aTvwnmMMC/ED3WtBgypNjhE2o1ljJ1zaH0cpzXgMJUZ9c8oc2L/ZxH4R2U7TXpgtC5FiZiAspShA4xLSLVEzKX/T9RYzWAixSPC8EKm+hesR9g9P9EywOMyy9C7LovuQ5/c0FFGGJ+QdhwVjcROvvVKOzqtKyX8CHpU0e9geo43herle/Iph2Vqh44EKstRjijUksgFuH/HjgNJ03AqfQ1pM3TOUc8QeZPYZS0lgVvj0pkVsL1rTr4iJc6e9S7RBOGEtYvvQBm4V9A9B0CsCrl25dm9D3cN+ea1p2IrPxNb2K/tECLolvSkErRHpEwnLrKwTWTvG3Yj04SZuRU5E+Rh3I9Ll1rR6Ru0DUy5xhrKVVA6KuhFTGsOUI9GqtBa9mQGPmgb3mqZpz7a9qnqIgoYXE7bcyG+60vEqx9v1S9eNxyJaA+3603HlMXjX9a5RuUY//gb6Un7PrDzM+ZGJ+NgkrYG+mN+tPMx7L/4a+lJ+QvDAIdhrfTRswC/WYRo4eHpmAYE1+MU62oOzpx9HTtketUocnMtOz2xvwC/2w0f3/b6xasEdHUN92XxHDvFcfMDb8KthxNfbj8UcrxtaImhUX7NwFBxsljnP8LrVrB9sFMAkUcdDHZlqoSeb5qlNvMI8L2mf2nQ6m1uKzT9etvXWQfS3+Yr+DxJBEERWDwAADQowDQoNCg==\",\n      \"port\": \"443\"\n    }\n\nA `system.time(df \u003c- stream_in(\"https-extract.json.gz\"))` results in:\n\n       user  system elapsed \n     14.822   0.224  15.189 \n\non a 13\" MacBook Pro and produces:\n\n    Classes ‘data.table’ and 'data.frame': 100000 obs. of  36 variables:\n     $ certsubject.CN                 : chr  \"*.tio.ch\" \"*.starwoodhotels.com\" \"a.ssl.fastly.net\" \"a.ssl.fastly.net\" ...\n     $ data                           : chr  \"SFRUUC8xLjEgNDAzIEZvcmJpZGRlbg0KU2VydmVyOiBjbG91ZGZsYXJlLW5naW54DQpEYXRlOiBNb24sIDIyIEF1ZyAyMDE2IDE3OjE2OjE2IEdNVA0KQ29udGVudC1\"| __truncated__ \"SFRUUC8xLjAgNDAwIEJhZCBSZXF1ZXN0DQpTZXJ2ZXI6IEFrYW1haUdIb3N0DQpNaW1lLVZlcnNpb246IDEuMA0KQ29udGVudC1UeXBlOiB0ZXh0L2h0bWwNCkNvbnR\"| __truncated__ \"SFRUUC8xLjEgNTAwIERvbWFpbiBOb3QgRm91bmQNClNlcnZlcjogVmFybmlzaA0KUmV0cnktQWZ0ZXI6IDANCmNvbnRlbnQtdHlwZTogdGV4dC9odG1sDQpDYWNoZS1\"| __truncated__ \"SFRUUC8xLjEgNTAwIERvbWFpbiBOb3QgRm91bmQNClNlcnZlcjogVmFybmlzaA0KUmV0cnktQWZ0ZXI6IDANCmNvbnRlbnQtdHlwZTogdGV4dC9odG1sDQpDYWNoZS1\"| __truncated__ ...\n     $ host                           : chr  \"104.20.28.6\" \"104.80.186.186\" \"151.101.255.54\" \"151.101.158.15\" ...\n     $ ip                             : chr  \"104.20.28.6\" \"104.80.186.186\" \"151.101.255.54\" \"151.101.158.15\" ...\n     $ port                           : chr  \"443\" \"443\" \"443\" \"443\" ...\n     $ vhost                          : chr  \"104.20.28.6\" \"104.80.186.186\" \"a.ssl.fastly.net\" \"a.ssl.fastly.net\" ...\n     $ certsubject.C                  : chr  NA \"US\" \"US\" \"US\" ...\n     $ certsubject.L                  : chr  NA \"Stamford\" \"San Francisco\" \"San Francisco\" ...\n     $ certsubject.O                  : chr  NA \"STARWOOD HOTELS AND RESORTS WORLDWIDE, INC.\" \"Fastly, Inc.\" \"Fastly, Inc.\" ...\n     $ certsubject.OU                 : chr  NA \"IT Solutions\" NA NA ...\n     $ certsubject.ST                 : chr  NA \"Connecticut\" \"California\" \"California\" ...\n     $ certsubject.emailAddress       : chr  NA NA NA NA ...\n     $ certsubject.UNDEF              : chr  NA NA NA NA ...\n     $ certsubject.businessCategory   : chr  NA NA NA NA ...\n     $ certsubject.postalCode         : chr  NA NA NA NA ...\n     $ certsubject.serialNumber       : chr  NA NA NA NA ...\n     $ certsubject.street             : chr  NA NA NA NA ...\n     $ certsubject.SN                 : chr  NA NA NA NA ...\n     $ certsubject.unstructuredName   : chr  NA NA NA NA ...\n     $ certsubject.ITU-T              : chr  NA NA NA NA ...\n     $ certsubject.GN                 : chr  NA NA NA NA ...\n     $ certsubject.description        : chr  NA NA NA NA ...\n     $ certsubject.subjectAltName     : chr  NA NA NA NA ...\n     $ certsubject.name               : chr  NA NA NA NA ...\n     $ certsubject.DC                 : chr  NA NA NA NA ...\n     $ certsubject.postOfficeBox      : chr  NA NA NA NA ...\n     $ certsubject.dnQualifier        : chr  NA NA NA NA ...\n     $ certsubject.generationQualifier: chr  NA NA NA NA ...\n     $ certsubject.initials           : chr  NA NA NA NA ...\n     $ certsubject.pseudonym          : chr  NA NA NA NA ...\n     $ certsubject.title              : chr  NA NA NA NA ...\n     $ certsubject                    : int  NA NA NA NA NA NA NA NA NA NA ...\n     $ certsubject.unstructuredAddress: chr  NA NA NA NA ...\n     $ certsubject.UID                : chr  NA NA NA NA ...\n     $ certsubject.mail               : chr  NA NA NA NA ...\n     $ certsubject.Mail               : chr  NA NA NA NA ...\n     - attr(*, \".internal.selfref\")=\u003cexternalptr\u003e \n\nAll of the certificate sub-field data elements have been expanded and we have a highly performant `data.table` to work with. Just go see what you have to do in `jsonlite` to get a similar output (and how long it will take).\n\n`pryr::object_size(df)` for that shows it's consuming `394 MB`, which means we can read in many more extracts comfortably on a reasonably configured system and most (if not all) of it on a well-configured AWS box.\n\nHowever, if you do end up trying to work with that scan data, it's highly recommended that you use `jq` to filter out the fields or records you want into a more compact ndjson file.\n\n## What's inside the tin?\n\nThe following functions are implemented:\n\n- `stream_in`:\tStream in ndjson from a file (handles `.gz` files)\n- `validate`:\tValidate JSON records in an ndjson file (handles `.gz` files)\n- `flatten`: Flatten a character vector of individual JSON lines\n\nThere are no current plans for a `stream_out()` function since `jsonlite::stream_out()` does a great job tossing `data.frame`-like structures out to an ndjson file.\n\n## What's Inside The Tin\n\nThe following functions are implemented:\n\n```{r ingredients, results='asis', echo=FALSE, cache=FALSE}\nhrbrpkghelpr::describe_ingredients()\n```\n\n## Installation\n\n```{r install-ex, results='asis', echo=FALSE, cache=FALSE}\nhrbrpkghelpr::install_block()\n```\n\n## Usage\n\n```{r vers, message=FALSE, warning=FALSE, error=FALSE, cache=FALSE}\nlibrary(ndjson)\n\n# current version\npackageVersion(\"ndjson\")\n```\n\n## Usage\n\n```{r ex1}\nflatten('{\"top\":{\"next\":{\"final\":1,\"end\":true},\"another\":\"yes\"},\"more\":\"no\"}')\n\nf \u003c- system.file(\"extdata\", \"test.json\", package=\"ndjson\")\ngzf \u003c- system.file(\"extdata\", \"testgz.json.gz\", package=\"ndjson\")\n\ndplyr::glimpse(ndjson::stream_in(f))\ndplyr::glimpse(ndjson::stream_in(gzf))\n\ndplyr::glimpse(jsonlite::stream_in(file(f), flatten=TRUE, verbose=FALSE))\ndplyr::glimpse(jsonlite::stream_in(gzfile(gzf), flatten=TRUE, verbose=FALSE))\n```\n\n## ndjson Metrics\n\n```{r cloc, echo=FALSE}\ncloc::cloc_pkg_md()\n```\n\n## Code of Conduct\n\nPlease note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fndjson","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhrbrmstr%2Fndjson","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fndjson/lists"}