{"id":26160702,"url":"https://github.com/mrizaln/octave-ndjson","last_synced_at":"2026-04-20T17:33:07.059Z","repository":{"id":279364483,"uuid":"938557313","full_name":"mrizaln/octave-ndjson","owner":"mrizaln","description":"Newline Delimited JSON (ndjson) or JSON Lines (jsonl) parser for Octave","archived":false,"fork":false,"pushed_at":"2025-08-20T08:57:15.000Z","size":148,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-20T10:36:08.890Z","etag":null,"topics":["json","jsonl","multithreading","ndjson","octave","parser"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mrizaln.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-25T06:22:22.000Z","updated_at":"2025-08-20T08:57:19.000Z","dependencies_parsed_at":"2025-08-22T01:31:26.669Z","dependency_job_id":null,"html_url":"https://github.com/mrizaln/octave-ndjson","commit_stats":null,"previous_names":["mrizaln/octave-ndjson"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mrizaln/octave-ndjson","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrizaln%2Foctave-ndjson","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrizaln%2Foctave-ndjson/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrizaln%2Foctave-ndjson/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrizaln%2Foctave-ndjson/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mrizaln","download_url":"https://codeload.github.com/mrizaln/octave-ndjson/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrizaln%2Foctave-ndjson/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32057651,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T11:35:06.609Z","status":"ssl_error","status_checked_at":"2026-04-20T11:34:48.899Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["json","jsonl","multithreading","ndjson","octave","parser"],"created_at":"2025-03-11T12:19:21.592Z","updated_at":"2026-04-20T17:33:07.054Z","avatar_url":"https://github.com/mrizaln.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# octave-ndjson\n\nMultithreaded Newline Delimited JSON (`.ndjson`) or JSON Lines (`.jsonl`) parser for Octave, powered by [`simdjson`](https://github.com/simdjson/simdjson).\n\nThis library is inspired by [`octave-rapidjson`](https://github.com/Andy1978/octave-rapidjson) but the design follows [`jsondecode`](https://docs.octave.org/latest/JSON-data-encoding_002fdecoding.html#index-jsondecode) function.\n\n## Motivation\n\nI have a simulation that emits data every few seconds. The data is encoded in [Newline Delimited JSON](https://github.com/ndjson/ndjson-spec)/[JSON Lines](https://jsonlines.org/).\n\nBefore I created this, I need to always convert my JSONL into a JSON using `jq`. For a while it was okay, but the more the JSONL contains data, the longer the conversion and the higher the memory consumption is. Until after my JSONL file contains about `100k` JSON document entries in it and about `200MiB` in size, `jq` consume so much memory that the OOM killer got triggered and killed it.\n\nAfter looking for a while on the internet I found that there is no parser for such JSON format for Octave, so I decided to create my own.\n\n## Usage\n\n\u003e For more comprehensive usage information, read the [Help](#help) section.\n\nYou can imagine JSONL as a JSON with an array as its root :\n\n\u003e JSON\n\n```json\n[\n  { \"a\": 1, \"b\": 2 },\n  { \"a\": 3, \"b\": 4 }\n]\n```\n\n\u003e JSONL\n\n```json\n{ \"a\": 1, \"b\": 2 }\n{ \"a\": 3, \"b\": 4 }\n```\n\nThis format is used primarily in streaming data (like my simulation).\n\nWhile yes there is `jq` tool that can convert the data from `json` to `jsonl` (and vice versa) it adds the overhead of converting them which takes quite a while if the data is big (and may use too much resource, as illustrated above). Using this format is also just easier to handle since we don't need to add and remove the square brackets every time data is transferred to be appended to already received data.\n\n### Parsing JSONL\n\nThere are two functions that you can use to parse a JSONL document.\n\n- `ndjson_load_string`, which takes a JSON string as parameter;\n\n  ```\n  octave:2\u003e x = ndjson_load_string(\"{ \\\"a\\\": 1, \\\"b\\\": 3.14 }\\n{ \\\"a\\\": 2, \\\"b\\\": 2.71 }\")\n  x =\n\n    2x1 struct array containing the fields:\n\n      a\n      b\n\n  octave:3\u003e x(1)\n  ans =\n\n    scalar structure containing the fields:\n\n      a = 1\n      b = 3.1400\n\n  octave:4\u003e x(2)\n  ans =\n\n    scalar structure containing the fields:\n\n      a = 2\n      b = 2.7100\n\n  octave:5\u003e\n\n  ```\n\n- and `ndjson_load_file`, which takes a filepath (string) as its parameter.\n\n  Given a `data.jsonl` file in the Octave working directory with following content\n\n  ```json\n  { \"a\": 1, \"b\": 3.14 }\n  { \"a\": 2, \"b\": 2.71 }\n  ```\n\n  you can load it with the function like so:\n\n  ```\n  octave:5\u003e x = ndjson_load_file('data.jsonl')\n  x =\n\n    2x1 struct array containing the fields:\n\n      a\n      b\n\n  octave:6\u003e x(1)\n  ans =\n\n    scalar structure containing the fields:\n\n      a = 1\n      b = 3.1400\n\n  octave:7\u003e x(2)\n  ans =\n\n    scalar structure containing the fields:\n\n      a = 2\n      b = 2.7100\n\n  ```\n\n### Parsing JSON\n\nSince A JSON file is just a JSONL file with single document, you can also use this library to parse them.\n\n\u003e A JSON with the name `data.json`\n\n```json\n{ \"a\": 1, \"b\": 3.14 }\n```\n\n```\noctave:3\u003e ndjson_load_file('data.jsonl')\nans =\n\n  scalar structure containing the fields:\n\n    a = 1\n    b = 3.1400\n\n```\n\n## Building\n\nThis is a C++ code so you need to compile the code first before using it.\n\n### Build dependencies\n\n- C++20 capable compiler\n- CMake 3.16+\n\n### Library dependencies\n\n- Octave header\n- simdjson\n\nThe simdjson library is fetched directly using CMake so no need to prepare for that one but for the Octave headers you need to install them first on your system:\n\n- Fedora\n\n  `sudo dnf install octave-devel`\n\n- Ubuntu\n\n  `sudo apt install octave-dev`\n\nFor other distros, you might want to search for the information in your distro's repository and adjust accordingly.\n\n\u003e For Windows... idk\n\n### Compiling\n\nThe compile step is quite simple, you just need to run these command one after another:\n\n```sh\ncmake -S . -B build -DCMAKE_BUILD_TYPE=Release      # you can use -G Ninja if you want to use Ninja instead of Make\ncmake --build build\n```\n\nThe resulting `.oct` files will be in the `./build` directory relative to the project root. You can move them to your Octave working directory and use it!\n\n### Generate code documentation\n\n\u003e - The documentation is only needed if you want to understand how the code works. Do use the it if you want to contribute back to the repo :D\n\u003e - The codebase is short enough, so you should be able to understand everything by looking at the comment in source code directly, so this step might be not necessary for you\n\nTo generate the documentation, you need to have `doxygen` installed on your system.\n\n```sh\ndoxygen docs/Doxygen\n```\n\nThe documentation will be generated inside `docs/doxygen/html` directory. Use your favorite browser (or html viewer) to view it.\n\n```sh\nfirefox docs/doxygen/html/index.html\n```\n\n## Note\n\n- The multithreaded parser is sensitive to newlines. Please reserve newline for separating documents only (like a good NDJSON/JSONL file).\n- If you have a prettified JSON, you probably want to unprettify them first if you want to use the multithreading capability, but if you don't want to, you can always fallback to the single-thread mode by setting parameter `'threading'` to `'single'`.\n\n  \u003e A JSON with the name `data.json`\n\n  ```json\n  {\n    \"a\": 1,\n    \"b\": 3.14\n  }\n  ```\n\n  ```\n  octave:12\u003e a = ndjson_load_file('data.jsonl', 'threading', 'single')\n  a =\n\n    scalar structure containing the fields:\n\n      a = 1\n      b = 3.1400\n  ```\n\n## Benchmark\n\nThe benchmark is done on an Intel(R) Core(TM) i5-10500H (6 core/12 thread) with the frequency locked to 2.5GHz. The file used to benchmark the functions is a JSON/JSONL file with `199034` document entries (array elements if JSON). Each document is `1969.17 ± 269.162` bytes long, amounts to a file `374 MiB` big.\n\n| function             | note                   |        time | speedup |\n| :------------------- | :--------------------- | ----------: | ------: |\n| `jsondecode`         | native octave function | `16.35900s` | `1.00x` |\n| `ndjson_load_string` | single thread, relaxed |  `9.17047s` | `1.78x` |\n| `ndjson_load_string` | multi thread, relaxed  |  `3.22453s` | `5.07x` |\n| `ndjson_load_file`   | single thread, relaxed |  `8.01613s` | `2.04x` |\n| `ndjson_load_file`   | multi thread, relaxed  |  `2.54787s` | `6.42x` |\n| `load_json`          | octave-rapidjson       | `27.85150s` | `0.59x` |\n\n\u003e - I'm actually quite disappointed with the result. The speedup from single thread to multi thread is only `3.15x`. This is not a very good value, considering the number of cores and threads my test computer has. But, the increase is not marginal either, so it's still a win.\n\u003e - `simdjson`'s dom parser on `ndjson` is multithreaded by default (2 threads: main thread and worker thread--It is detailed [here](https://github.com/simdjson/simdjson/blob/f3b034ac38060303c856c83f51f4156a4d1da8c1/doc/parse_many.md#threads)). So even when `ndjson_load_string` or `ndjson_load_file` ran in single thread mode, it may spawn two threads (it can be disabled when compiling, refer to `simdjson` documentation).\n\n## Help\n\nThis is the full usage information of the two functions\n\n\u003e `ndjson_load_string`\n\n````\n============================== ndjson_load_string help page ==============================\nsignature:\n    ndjson_load_string(\n        json_string: string,         % positional\n        [mode      : enum_string],   % optional property\n        [threading : enum_string]    % optional property\n    )\n\nparameters:\n    \u003e json_string : An NDJSON/JSON Lines string.\n\n    \u003e mode : Enumeration that specifies the strictness of the schema comparison.\n        - strict   : Documents must have the same schema.\n        - dynarray : Documents have the same schema but the number of elements in array\n                     and its types can vary.\n        - relaxed  : Documents can have different schemas.\n\n    \u003e threading : Threading mode.\n        - single : Run in single-thread mode.\n        - multi  : Run in multi-thread mode.\n\nbehavior:\n    By default the [ndjson_load_string] function will parse NDJSON/JSON Lines ([jsonl] from\n    hereon) in strict mode i.e. all the documents on the [jsonl] must have the same JSON\n    structure (the number of elements of an array, the type of each element, type type\n    of object value, and the order of the occurence of the key in the document).\n\n    The [ndjson_load_string] function will run in multithreaded mode by default. The only\n    caveat is that you must have each JSON document at each line (don't prettify). So, the\n    input must be like this:\n\n    ```\n        { \"a\": 1, \"b\": [4, 5] }\n        { \"a\": 2, \"b\": [6, 7] }\n    ```\n\n    This one will result in an error:\n\n    ```\n        {                           // \u003c- parsing ends here: incomplete object\n            \"a\": 1,\n            \"b\": [4, 5]\n        }\n        {\n            \"a\": 2,\n            \"b\": [6, 7]\n        }\n    ```\n\n    The single-thread mode don't have this constraint.\n\nexample:\n    For example, a variable [data] which is a string with content:\n    ```\n        { \"a\": 1, \"b\": [4, 5] }\n        { \"a\": 2, \"b\": [6, 7, 8] }\n    ```\n\n    if parsed with default parameters will return an error with message:\n\n    ```\n        octave\u003e ndjson_load_string(data)\n\n        error: Parsing error\n            \u003e Mismatched schema, all documents must have same schema (dynamic_array: false)\n\n        % rest of the message...\n    ```\n\n    You can relax the schema comparison by setting the `mode` parameter to 'dynarray'\n    (or 'relaxed' if you want to ignore the schema comparison entirely):\n\n    ```\n        octave\u003e a = ndjson_load_string(data, 'mode', 'dynarray');\n        octave\u003e % success!\n    ```\n==========================================================================================\n````\n\n\u003e `ndjson_load_file`\n\n````\n=============================== ndjson_load_file help page ===============================\nsignature:\n    ndjson_load_file(\n        filepath  : string,         % positional\n        [mode     : enum_string],   % optional property\n        [threading: enum_string]    % optional property\n    )\n\nparameters:\n    \u003e filepath : Must be a string that points to a file.\n\n    \u003e mode : Enumeration that specifies the strictness of the schema comparison.\n        - strict   : Documents must have the same schema.\n        - dynarray : Documents have the same schema but the number of elements in array\n                     and its types can vary.\n        - relaxed  : Documents can have different schemas.\n\n    \u003e threading : Threading mode.\n        - single : Run in single-thread mode.\n        - multi  : Run in multi-thread mode.\n\nbehavior:\n    By default the [ndjson_load_file] function will parse NDJSON/JSON Lines ([jsonl] from\n    hereon) in strict mode i.e. all the documents on the [jsonl] must have the same JSON\n    structure (the number of elements of an array, the type of each element, type type\n    of object value, and the order of the occurence of the key in the document).\n\n    The [ndjson_load_file] function will run in multithreaded mode by default. The only\n    caveat is that you must have each JSON document at each line (don't prettify). So, the\n    input must be like this:\n\n    ```\n        { \"a\": 1, \"b\": [4, 5] }\n        { \"a\": 2, \"b\": [6, 7] }\n    ```\n\n    This one will result in an error:\n\n    ```\n        {                           // \u003c- parsing ends here: incomplete object\n            \"a\": 1,\n            \"b\": [4, 5]\n        }\n        {\n            \"a\": 2,\n            \"b\": [6, 7]\n        }\n    ```\n\n    The single-thread mode don't have this constraint.\n\nexample:\n    For example, a [data.jsonl] file with content:\n    ```\n        { \"a\": 1, \"b\": [4, 5] }\n        { \"a\": 2, \"b\": [6, 7, 8] }\n    ```\n\n    if parsed with default parameters will return an error with message:\n\n    ```\n        octave\u003e ndjson_load_file('data.jsonl')\n\n        error: Parsing error\n            \u003e Mismatched schema, all documents must have same schema (dynamic_array: false)\n\n        % rest of the message...\n    ```\n\n    You can relax the schema comparison by setting the `mode` parameter to 'dynarray'\n    (or 'relaxed' if you want to ignore the schema comparison entirely):\n\n    ```\n        octave\u003e a = ndjson_load_file('data.jsonl', 'mode', 'dynarray');\n        octave\u003e % success!\n    ```\n==========================================================================================\n````\n\n## TODO\n\n- [ ] ~~Eliminate the constraint of each JSON document needed to be separated by newline.~~\n  \u003e I essentially need to create a simpler JSON parser for this, not worth it (I've tried).\n- [x] Optimize ~~`parse_json_value`~~ `parse_octave_value` function.\n  \u003e Using dom parser is better apparently. Also, I kinda copied `jsondecode` source code, so that's that.\n- [ ] Add on-demand file read approach.\n  \u003e Line-by-line buffering mechanism is the best approach I guess.\n- [ ] Add the ability to set number of threads at runtime.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrizaln%2Foctave-ndjson","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmrizaln%2Foctave-ndjson","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrizaln%2Foctave-ndjson/lists"}