{"id":27624301,"url":"https://github.com/ffdev-info/jsonid","last_synced_at":"2026-01-04T19:09:48.715Z","repository":{"id":287783885,"uuid":"964720703","full_name":"ffdev-info/jsonid","owner":"ffdev-info","description":"Identification of JSON (JSONL, YAML, and TOML) objects: JSONID","archived":false,"fork":false,"pushed_at":"2025-12-15T16:07:52.000Z","size":623,"stargazers_count":8,"open_issues_count":46,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-18T21:48:31.732Z","etag":null,"topics":["archives","code4lib","digipres","digital-preservation","file-formats","format-identification","glam","json","jsonl","toml","yaml"],"latest_commit_sha":null,"homepage":"https://ffdev-info.github.io/jsonid/registry/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ffdev-info.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-04-11T17:13:00.000Z","updated_at":"2025-12-15T16:06:47.000Z","dependencies_parsed_at":"2025-06-09T21:24:59.051Z","dependency_job_id":"6e9e954f-a6b9-4203-b813-4820822050d8","html_url":"https://github.com/ffdev-info/jsonid","commit_stats":null,"previous_names":["ffdev-info/jsonid"],"tags_count":27,"template":false,"template_full_name":null,"purl":"pkg:github/ffdev-info/jsonid","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ffdev-info%2Fjsonid","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ffdev-info%2Fjsonid/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ffdev-info%2Fjsonid/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ffdev-info%2Fjsonid/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ffdev-info","download_url":"https://codeload.github.com/ffdev-info/jsonid/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ffdev-info%2Fjsonid/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28206372,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2026-01-04T02:00:06.065Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archives","code4lib","digipres","digital-preservation","file-formats","format-identification","glam","json","jsonl","toml","yaml"],"created_at":"2025-04-23T11:28:28.855Z","updated_at":"2026-01-04T19:09:48.707Z","avatar_url":"https://github.com/ffdev-info.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# JSONID\n\n\u003c!-- markdownlint-disable --\u003e\n\u003cimg\n    src=\"https://github.com/ffdev-info/jsonid/blob/main/static/images/JSON_logo-crockford.png?raw=true\"\n    alt=\"JSON ID logo based on JSON Logo by Douglas Crockford\"\n    width=\"200px\" /\u003e\n\n\u003c!-- markdownlint-enable --\u003e\n\n[**JSONID**][json-1]entification tool and ruleset. JSONID can be downloaded\nfrom pypi.org.\n\n\u003c!-- markdownlint-disable--\u003e\n\n[![PyPI - Version](https://img.shields.io/pypi/v/jsonid?style=plastic\u0026color=purple)][pypi-json-id-1]\u0026nbsp;[![Static Badge][shield-io]][coptr-1]\n\n[json-1]: https://www.json.org/json-en.html\n[pypi-json-id-1]: https://pypi.org/project/jsonid/\n[shield-io]: https://img.shields.io/badge/COPTR-JSONID-purple\n[coptr-1]: https://coptr.digipres.org/JSONID\n\n## Contents\n\n\u003c!-- via: https://luciopaiva.com/markdown-toc/ --\u003e\n\n- [Before you begin](#before-you-begin)\n  - [MacOS](#macos)\n  - [Windows](#windows)\n  - [Linux](#linux)\n- [Introduction to JSONID](#introduction-to-jsonid)\n- [Why?](#why)\n  - [Encodings](#encodings)\n  - [Enxodings explained](#enxodings-explained)\n- [What does JSONID get you?](#what-does-jsonid-get-you)\n- [Ruleset](#ruleset)\n  - [Backed by tests](#backed-by-tests)\n- [Sample files](#sample-files)\n  - [Integration files](#integration-files)\n  - [Fundamental examples](#fundamental-examples)\n- [Registry](#registry)\n  - [Registry examples](#registry-examples)\n  - [Local rules](#local-rules)\n- [PRONOM](#pronom)\n- [Output format](#output-format)\n  - [Agent out](#agent-out)\n- [Lookup](#lookup)\n  - [Core formats](#core-formats)\n  - [Doctype formats](#doctype-formats)\n- [JSONL](#jsonl)\n  - [Handling JSONL](#handling-jsonl)\n- [Analysis](#analysis)\n  - [Example analysis](#example-analysis)\n  - [JSONL technical metadata](#jsonl-technical-metadata)\n- [Utils](#utils)\n  - [json2json](#json2json)\n- [Docs](#docs)\n- [Developer install](#developer-install)\n  - [pip](#pip)\n  - [tox](#tox)\n  - [pre-commit](#pre-commit)\n- [Packaging](#packaging)\n  - [pyproject.toml](#pyprojecttoml)\n  - [Versioning](#versioning)\n  - [Local packaging](#local-packaging)\n  - [Publishing](#publishing)\n\n## Before you begin\n\nJSONID should run out of the box but it can be difficult to get everything\nrunning correctly across all platforms without some effort. Additional\ninstall instructions are listed below before you dig into everything in\nmore detail.\n\n### MacOS\n\n* MacOS users may need to run `brew install libmagic` to install libmagic\ndependencies associated with compressed JSONL.\n\n### Windows\n\n* There are no known exceptions for Windows at present, that being said\nan absence of libmagic on Windows means compressed JSONL cannot be\nidentified just yet.\n\n### Linux\n\n* There are no known exceptions for Linux users.\n\n## Introduction to JSONID\n\nJSONID is designed to identify data we recognize as JSON, YAML, TOML,\nand JSONL (currently a non-exhaustive list that can easily be extended).\n\nJSON, YAML, TOML, and JSONL, are all serialization formats; that is, when a\ncomputer program wants to persist data from memory to disk, it serializes\nthe data into a serialization format. When the same program wants to read\nthe data back into memory, it deserializes the structures from disk. These\nformats are also called \"serde,\" a combination the words\nser(ialize) and de(serialize).\n\nJSONID borrows from the Python approach to ask forgiveness rather than\npermission (EAFP) to attempt to open every object it scans and see if it\nparses as JSON. If it doesn't, we move along. If it does, we then have an\nopportunity to identify the characteristics of the JSON we have opened.\n\nPython being high-level also provides an easier path to processing files\nand parsing JSON quickly with very little other knowledge required\nof the underlying data structure.\n\n## Why?\n\nConsider these equivalent forms:\n\n```json\n{\n    \"key 1\": \"value\",\n    \"key 2\": \"value\"\n}\n```\n\n```json\n{\n    \"key 2\": \"value\",\n    \"key 1\": \"value\"\n}\n```\n\nPRONOM signatures are not expressive enough for complicated JSON objects.\n\nIf I want DROID to find `key 1` I have to use a wildcard, so I would write\nsomething like:\n\n```text\nBOF: \"7B*226B6579203122\"\nEOF: \"7D\"\n```\n\nBut if I then want to match on `key 2` as well as `key 1` things start getting\ncomplicated as they aren't guaranteed by the JSON specification to be in the\nsame \"position\" (if we think about order visually). When other keys are used in\nthe object they aren't even guaranteed to be next to one another.\n\nThis particular example is a 'map' object whose most important property\nis consistent retrieval of information through its \"keys\". Further\ncomplexity can be added when we are dealing with maps embedded in a \"list\" or\n\"array\", or simply just maps of arbitrary depth.\n\nJSONID tries to compensate for JSON's complexities by using the format's\nown strengths to parse binary data as JSON and then if is successful,\nuse a JSON-inspired grammar to describe keys and key-value pairs as \"markers\"\nthat can potentially identify the JSON objects that we are looking at.\nCertainly narrow down the potential instances of JSON objects that we might\nbe looking at.\n\n### Encodings\n\nA better reason might appear if we look at encodings. Look at the following\nhexdumps:\n\n#### UTF-16LE\n\n```text\n$ hexdump -C UTF-16LE-map.json\n00000000  7b 00 22 00 61 00 22 00  3a 00 20 00 22 00 62 00  |{.\".a.\".:. .\".b.|\n00000010  22 00 7d 00 0a 00                                 |\".}...|\n00000016\n```\n\n#### UTF-16BE\n\n```text\n$ hexdump -C UTF-16BE-map.json\n00000000  00 7b 00 22 00 61 00 22  00 3a 00 20 00 22 00 62  |.{.\".a.\".:. .\".b|\n00000010  00 22 00 7d 00 0a                                 |.\".}..|\n00000016\n```\n\n#### UTF-32BE\n\n```text\n$ hexdump -C UTF-32BE-map.json\n00000000  00 00 00 7b 00 00 00 22  00 00 00 61 00 00 00 22  |...{...\"...a...\"|\n00000010  00 00 00 3a 00 00 00 20  00 00 00 22 00 00 00 62  |...:... ...\"...b|\n00000020  00 00 00 22 00 00 00 7d  00 00 00 0a              |...\"...}....|\n0000002c\n```\n\n#### UTF-32LE\n\n```text\n$ hexdump -C UTF-32LE-map.json\n00000000  7b 00 00 00 22 00 00 00  61 00 00 00 22 00 00 00  |{...\"...a...\"...|\n00000010  3a 00 00 00 20 00 00 00  22 00 00 00 62 00 00 00  |:... ...\"...b...|\n00000020  22 00 00 00 7d 00 00 00  0a 00 00 00              |\"...}.......|\n0000002c\n```\n\n#### UTF-8\n\n```text\n$ hexdump -C UTF-8-map.json\n00000000  7b 22 61 22 3a 20 22 62  22 7d 0a                 |{\"a\": \"b\"}.|\n0000000b\n```\n\n### Enxodings explained\n\nEach of the hexdumps shown are the equivalent of different files on disk, but\neach of those files encodes exactly the same information. The difference is\nhow the information was encoded using a character set that could *potentially*\nstore more information, but ultimately does not.\n\nIt is entirely possible to deserialize these files to exactly the same\nstructure in memory, at which point JSONID can begin to assert equivalence\nand output the underlying document type as well as other information.\n\nIf we want to understand the complexity required to write a PRONOM signature\nto account for different encodings, look at how the `00` bytes are\nplaced between examples. We have to be able to encode the bytes we know about,\nas well as the bytes that are output as part of the encoding. If we treat\nthe UTF-8 version of the file above as the primary object, but say, we want\nto find all versions if they exist, we need to write four other signatures\n(at least, before we even talk about whitespace). This is difficult for\na human.\n\nI hope with JSONID we will soon be able to use the JSONID declarative\nlanguage to automatically output PRONOM-compatible signatures.\n\nThis means, in the example above, we can encode one ruleset in JSONID which\nwill work for all variants given to JSONID. We can then output five or more\nPRONOM-compatible signatures that can be given to the PRONOM registry\nto improve their capabilities in the future. That's five signatures for\nthe price of one and it gives JSONID a unique way of contributing back to\na core resource in digital preservation.\n\n## What does JSONID get you?\n\nTo begin, JSONID should identify JSON files on your system as JSON.\nThat's already a pretty good position to be in.\n\nThe ruleset should then allow you to identify a decent number of JSON objects,\nespecially those that have a well-defined structure. Examples we have in the\n[registry data][registry-htm-1] include things like ActivityPub streams,\nRO-CRATE metadata, IIIF API data and so on.\n\nIf the ruleset works for JSON we might be able to apply it to other formats\nthat can represent equivalent data structures in the future\nsuch as [YAML][yaml-spec], and [TOML][toml-spec].\n\n[yaml-spec]: https://yaml.org/\n[toml-spec]: https://toml.io/en/\n\n## Ruleset\n\nJSONID currently defines a small set of rules that help us to identify JSON\ndocuments.\n\nThe rules are described in their own data-structures. The structures are\nprocessed as a list (they need not necessarily be in order) and each must\nmatch for a given set of ruls to determine what kind of JSON document we might\nbe looking at.\n\nJSONID can identify the existence of information but you can also use\nwildcards and provide some negation as required, e.g. to remove\nfalse-positives between similar JSON entities.\n\n| rule       | meaning                                               |\n|------------|-------------------------------------------------------|\n| INDEX      | index (from which to read when structure is an array) |\n| GOTO       | goto key (read key at given key)                      |\n| KEY        | key to read                                           |\n| CONTAINS   | value contains string                                 |\n| STARTSWITH | value startswith string                               |\n| ENDSWITH   | value endswith string                                 |\n| IS         | value matches exactly                                 |\n| REGEX      | value matches a regex pattern                         |\n| EXISTS     | key exists                                            |\n| NOEXIST    | key doesn't exists                                    |\n| ISTYPE     | key is a specific type (string, number, dict, array)  |\n\nStored in a list within a `RegistryEntry` object, they are then processed\nin order.\n\nFor example:\n\n```json\n    [\n        { \"KEY\": \"name\", \"IS\": \"value\" },\n        { \"KEY\": \"schema\", \"CONTAINS\": \"/schema/version/1.1/\" },\n        { \"KEY\": \"data\", \"IS\": { \"more\": \"data\" } },\n    ]\n```\n\nAll rules need to match for a positive ID.\n\n\u003e **NB.**: JSONID is a\nwork-in-progress and requires community input to help determine the grammar\nin its fullness and so there is a lot of opportunity to add/remove to these\nmethods as its development continues. Additionally, help formalizing the\ngrammar/ruleset would be greatly appreciated 🙏.\n\n### Backed by tests\n\nThe ruleset has been developed using test-driven-development practices (TDD)\nand the current set of tests can be reviewed in the repository's\n[test folder][testing-1]. More tests should be added, in general, and over\ntime.\n\nRun `just coverage` to see JSONID's current level of test coverage.\n\n[testing-1]: https://github.com/ffdev-info/jsonid/tree/main/tests\n\n## Sample files\n\n### Integration files\n\nFiles used in the development of JSONID are available in their\n[own repository][integration-1].\n\n[integration-1]: https://github.com/ffdev-info/jsonid-integration-files\n\n### Fundamental examples\n\nThere is a small [samples directory][samples-1] included with this\nepository demonstrating some fundamental differences in encoding and\nJSON types.\n\n[samples-1]: samples/\n\n## Registry\n\nA temporary \"registry\" module is used to store JSON markers.\nThe registry is a work in progress and must be exported and\nrewritten somewhere more centralized (and easier to manage) if JSONID can\nprove useful to the communities that might use it (*see notes on PRONOM below*).\n\nThe registry web-page is here:\n\n* [JSONID registry][registry-htm-1].\n\n[registry-htm-1]: https://ffdev-info.github.io/jsonid/registry/\n\nThe registry's source is here:\n\n* [Registry](https://github.com/ffdev-info/jsonid/blob/main/src/jsonid/registry_data.py).\n\n### Registry examples\n\n#### Identifying JSON-LD Generic\n\n```python\n    RegistryEntry(\n        identifier=\"id0009\",\n        name=[{\"@en\": \"JSON-LD (generic)\"}],\n        markers=[\n            {\"KEY\": \"@context\", \"EXISTS\": None},\n            {\"KEY\": \"id\", \"EXISTS\": None},\n        ],\n    ),\n```\n\n\u003e **Pseudo code**:\nTest for the existence of keys: `@context` and `id` in the primary JSON object.\n\n#### Identifying Tika Recursive Metadata\n\n```python\n    RegistryEntry(\n        identifier=\"id0024\",\n        name=[{\"@en\": \"tika recursive metadata\"}],\n        markers=[\n            {\"INDEX\": 0, \"KEY\": \"Content-Length\", \"EXISTS\": None},\n            {\"INDEX\": 0, \"KEY\": \"Content-Type\", \"EXISTS\": None},\n            {\"INDEX\": 0, \"KEY\": \"X-TIKA:Parsed-By\", \"EXISTS\": None},\n            {\"INDEX\": 0, \"KEY\": \"X-TIKA:parse_time_millis\", \"EXISTS\": None},\n        ],\n```\n\n\u003e **Pseudo code**:\nTest for the existence of keys: `Content-Length`, `Content-Type`,\n`X-TIKA:Parsed-By` and `X-TIKA:parse_time_millis` in the `zeroth` (first)\nJSON object where the primary document is a list of JSON objects.\n\n#### Identifying SOPS encrypted secrets file\n\n```python\n    RegistryEntry(\n        identifier=\"id0012\",\n        name=[{\"@en\": \"sops encrypted secrets file\"}],\n        markers=[\n            {\"KEY\": \"sops\", \"EXISTS\": None},\n            {\"GOTO\": \"sops\", \"KEY\": \"kms\", \"EXISTS\": None},\n            {\"GOTO\": \"sops\", \"KEY\": \"pgp\", \"EXISTS\": None},\n        ],\n    ),\n```\n\n\u003e **Pseudo code**:\nTest for the existence of keys `sops` in the primary JSON object.\n\u003e\n\u003e Goto the `sops` key and test for the existence of keys: `kms` and `pgp`\nwithin the `sops` object/value.\n\n### Local rules\n\nThe plan is to allow local rules to be run alongside the global ruleset. I\nexpect this will be a bit further down the line when the ruleset and\nmetaddata is more stabilised.\n\n## PRONOM\n\nIdeally JSON can generate evidence enough to warrant the creration of\nPRONOM IDs that can then be referenced in the JSONID output.\n\nEvantually, PRONOM or a PRONOM-like tool might host an authoritative version\nof the JSONID registry.\n\n### JSONID for PRONOM Signature Development\n\nJSONID provides a high-level language for output of PRONOM compatible\nsignatures. The feature set is still in its BETA phase but JSONID provides\ntwo distinct capabilities:\n\n#### 1. Registry output\n\nJSONID's registry can be output using the `--pronom` flag. A signature file\nwill be created under `jsonid_pronom.xml` which can be imported into DROID\nfor identification of document types registered with JSONID.\n\nJSONID's registry is output alongisde a handful of baseline JSON signatures\ndesigned to capture \"plain\"-JSON that is not yet encoded in the registry.\n\n#### 2. Signature development\n\nA standalone `json2pronom` utility is provided for creation of potentially\nrobust DROID compatible signatures.\n\nAs a high-level language, signatures can be defined in easy to understand\nsyntax and then output consistently via the `json2pronom` utility. Signatures\ninclude sensible defaults for whitespace and other aspects that are\ndifficult for signature developers to consistently anticipate when writing\nJSON based signatures. One particular benefit of using `json2pronom` is that\nit can automatically output JSON signatures using different\ncharacter encodings which use a lot of excess characters difficult for\nhumans to format correctly into a DROID compatible format.\n\nGiven a [sample pattern file](./pronom_example/patterns_example.json) a DROID\ncompatible snippet can be output as follows (UTF-8 shown for brevity):\n\n\u003c!--markdownlint-disable--\u003e\n\n```xml\n\u003c?xml version=\"1.0\" ?\u003e\n\u003cFFSignatureFile xmlns=\"http://www.nationalarchives.gov.uk/pronom/SignatureFile\" Version=\"1\" DateCreated=\"2026-01-04T16:14:16Z\"\u003e\n  \u003cInternalSignatureCollection\u003e\n    \u003cInternalSignature ID=\"1\" Specificity=\"Specific\"\u003e\n      \u003cByteSequence Reference=\"BOF\" Sequence=\"{0-4095}7B\" MinOffset=\"0\" MaxOffset=\"4095\"/\u003e\n      \u003cByteSequence Reference=\"VAR\" Sequence=\"226B65793122{0-16}3A\" MinOffset=\"\" MaxOffset=\"\"/\u003e\n      \u003cByteSequence Reference=\"VAR\" Sequence=\"226B65793222{0-16}3A\" MinOffset=\"\" MaxOffset=\"\"/\u003e\n      \u003cByteSequence Reference=\"EOF\" Sequence=\"7D{0-4095}\" MinOffset=\"0\" MaxOffset=\"4095\"/\u003e\n    \u003c/InternalSignature\u003e\n  \u003c/InternalSignatureCollection\u003e\n  \u003cFileFormatCollection\u003e\n    \u003cFileFormat ID=\"1\" Name=\"JSONID2PRONOM Conversion (UTF-8)\" PUID=\"jsonid2pronom/1\" Version=\"\" MIMEType=\"application/json\" FormatType=\"structured text\"\u003e\n      \u003cInternalSignatureID\u003e1\u003c/InternalSignatureID\u003e\n      \u003cExtension\u003ejson\u003c/Extension\u003e\n    \u003c/FileFormat\u003e\n\u003c/FFSignatureFile\u003e\n```\n\n\u003c!--markdownlint-enable--\u003e\n\nFeedback on this utility is welcome. Hopefully we can build on this\napproach for other structured formats such as XML.\n\n## Output format\n\nPreviously JSONID output YAML containing all result object metadata. It has\nsince coalesced on a MIME based output approximating that of `$file --mime`.\nExceptions include the ability to output multiple IDs separated by `|` and\na count describing how many identifiers were returned.\n\n\u003e NB. it is still a goal of JSONID to avoid multiple IDs but serde formats\nare as flexible as they need to be and will not always behave well.\n\nAn example output looks as follows.\n\n\u003c!--markdownlint-disable--\u003e\n\n```text\nsamples/encoding/UTF-16LE-map.json\t[1]\tapplication/json; charset=UTF-16; doctype=\"JavaScript Object Notation (JSON)\"; ref=jrid:JSON\nsamples/encoding/UTF-32LE-list.json\t[1]\tapplication/json; charset=UTF-32; doctype=\"JavaScript Object Notation (JSON)\"; ref=jrid:JSON\n```\n\n```text\nintegration_files/json/mame/mame-hiscore-plugin.json    [1]\tapplication/json; charset=UTF-8; doctype=\"MAME Plugin (JSON)\"; ref=jrid:0073\nintegration_files/jsonl/asciicast/asciicast-v2.jsonl    [1]\tapplication/jsonl; charset=UTF-8; doctype=\"asciicast (asciinema.org) v2\"; ref=jrid:0079\n```\n\nYou can see a demonstration of multiple identification output in the\nintegration tests.\n\n```text\ntest_file.json  [2] application/json; charset=UTF-8; doctype=\"MULTI_ID_1\"; ref=jrid:0001 | application/json; charset=UTF-8; doctype=\"MULTI_ID_2\"; ref=jrid:0002\n```\n\n### Agent out\n\nThe `--agentout` arg makes JSONID output a full JSON snippet complete with\nversion information for integration into workflows in digital preservation\nsystems. Example output:\n\n```json\n{\n  \"path\": \"samples/encoding/UTF-16-map_whitespace.json\",\n  \"results\": [\n    \"application/json; charset=UTF-16; doctype=\\\"JavaScript Object Notation (JSON)\\\"; ref=jrid:JSON\"\n  ],\n  \"count\": 1,\n  \"agent\": \"jsonid/0.0.0 (ffdev-info)\"\n}\n```\n\n\u003c!--markdownlint-enable--\u003e\n\n## Lookup\n\nRegistry metadata is no longer output in results. As such, a `lookup` function\nis provided to return that information. See for example:\n\n### Core formats\n\n```text\npython jsonid.py core JSON\n```\n\n```yaml\nname:\n- '@en': JavaScript Object Notation (JSON)\nmime:\n- application/json\ndocumentation:\n- archive_team: http://fileformats.archiveteam.org/wiki/JSON\nidentifiers:\n- rfc: https://datatracker.ietf.org/doc/html/rfc8259\n- pronom: http://www.nationalarchives.gov.uk/PRONOM/fmt/817\n- loc: https://www.loc.gov/preservation/digital/formats/fdd/fdd000381.shtml\n- wikidata: https://www.wikidata.org/entity/Q2063\n```\n\n### Doctype formats\n\n```text\npython jsonid.py lookup jrid:0055\n```\n\n```yaml\nname:\n- '@en': Lottie vector graphics\nmime: []\ndescription:\n- '@en': a animated file format using JSON also known as Bodymovin JSON\ndocumentation:\n- archive_team: http://fileformats.archiveteam.org/wiki/Lottie\nidentifiers:\n- rfc: ''\n- pronom: ''\n- loc: ''\n- wikidata: http://www.wikidata.org/entity/Q98855048\n```\n\n## JSONL\n\n[JSONL][jsonl-1] aka JSON Lines is a format that requires some special\nhandling in the code, first to detect whether content is in an\n\"archive format\" (archive in computer science terms) or aggregate (in\nPRONOM terms); and then process the content reliably.\n\n### Handling JSONL\n\nJSONL will be treated as follows:\n\n1. if a file is identified as JSONL a JSONL identification will always be\nreturned. This will always be reliable.\n1. the first line of the JSONL file is treated as the authoritative object,\nthat is, all other lines are expected to conform to the same schema. If\nthe object can be matched against a ruleset the ID will be returned. If the\nobject cannoot be matched against a ruleset then an identification of\nJSONL will be returned. All other lines are ignored.\n\n[jsonl-1]: https://jsonlines.org/\n\n## Analysis\n\nJSONID provides an analysis mechanism to help developers of identifiers. It\nmight also help users talk about interesting properties about the objects\nbeing analysed, and provide consistent fingerprinting for data that has\ndifferent byte-alignment but is otherwise identical.\n\n\u003e **NB.**: Comments on existing statistics or ideas for new ones are\nappreciated.\n\n### Example analysis\n\n```json\n{\n  \"content_length\": 329,\n  \"number_of_lines\": 32,\n  \"line_warning\": false,\n  \"top_level_keys_count\": 4,\n  \"top_level_keys\": [\n    \"key1\",\n    \"key2\",\n    \"key3\",\n    \"key4\"\n  ],\n  \"top_level_types\": [\n    \"list\",\n    \"map\",\n    \"list\",\n    \"list\"\n  ],\n  \"depth\": 8,\n  \"heterogeneous_list_types\": true,\n  \"fingerprint\": {\n    \"unf\": \"UNF:6:sAsKNmjOtnpJtXi3Q6jVrQ==\",\n    \"cid\": \"bafkreibho6naw5r7j23gxu6rzocrud4pc6fjsnteyjveirtnbs3uxemv2u\"\n  },\n  \"encoding\": \"UTF-8\"\n}\n```\n\n### JSONL technical metadata\n\nAnalysing JSONL should yield some useful information. Like many of the\nanalyses output by this tool this information is a work in progress and\ntime will tell if its useful.\n\n#### Line length\n\nLine length might not be a useful output for JSONL as the specification\nitself determines JSONL files are very likely to have long lines. The\noutput is therefore disabled.\n\n#### Fingerptinting\n\nFingreprinting JSONL versus standard JSON is done by treating the\nJSONL file as a list of objects in memory. An important distinction to make is\nthat while this is technically correct, it's not structurally correct, i.e.\na JSONL file is not serialized as a list, nor, need it be deserialized into\nmemory as a `list` object. That being said, using a list structure in JSONID\nas a small concession enabling fingerprinting makes it a convenient choice\nand I hope it will prove beneficial.\n\n#### Example JSONL analysis\n\nJSONL analysis output is, therefore, a little more sparse than the\nstandard JSONID output, an example, at present, looks as follows:\n\n```json\n{\n  \"number_of_lines\": 4,\n  \"fingerprint\": {\n    \"unf\": \"UNF:6:iBedoWLhyVzfXOM0OcXWBg==\",\n    \"cid\": \"bafkreigjgec7pbdao3ilk2pqe3tp3qg5bu426wyebnrelbm34ebhcbxs6q\"\n  },\n  \"doctype\": \"JSONL\",\n  \"encoding\": \"UTF-8\",\n  \"compression\": \"application/gzip\"\n}\n```\n\n## Utils\n\n### json2json\n\nUTF-16 can be difficult to read as UTF-16 uses two bytes per every one, e.g.\n`..{.\".a.\".:. .\".b.\".}.` is simply `{\"a\": \"b\"}`. The utility `json2json.py`\nin the utils folder will output UTF-16 as UTF-8 so that signatures can be\nmore easily derived. A signature derived for UTF-16 looks exactly the same\nas UTF-8.\n\n`json2json` can be called from the command line when installed via pip, or\nfind it in [src.utils][json2json-1].\n\n[json2json-1]: src/utils/json2json.py\n\n## Docs\n\nDev docs are [available][dev-docs-1].\n\n[dev-docs-1]: https://ffdev-info.github.io/jsonid/jsonid/\n\n----\n\n## Developer install\n\n### pip\n\nSetup a virtual environment `venv` and install the local development\nrequirements as follows:\n\n```bash\npython3 -m venv venv\nsource venv/bin/activate\npython -m pip install -r requirements/local.txt\n```\n\n### tox\n\n#### Run tests (all)\n\n```bash\npython -m tox\n```\n\n#### Run tests-only\n\n```bash\npython -m tox -e py3\n```\n\n#### Run linting-only\n\n```bash\npython -m tox -e linting\n```\n\n\u003c!--markdownlint-disable--\u003e\n\n### pre-commit\n\nPre-commit can be used to provide more feedback before committing code. This\nreduces reduces the number of commits you might want to make when working on\ncode, it's also an alternative to running tox manually.\n\nTo set up pre-commit, providing `pip install` has been run above:\n\n* `pre-commit install`\n\nThis repository contains a default number of pre-commit hooks, but there may\nbe others suited to different projects. A list of other pre-commit hooks can be\nfound [here][pre-commit-1].\n\n[pre-commit-1]: https://pre-commit.com/hooks.html\n\n## Packaging\n\nThe [`justfile`][just-1] contains helper functions for packaging and release.\nRun `just help` for more information.\n\n[just-1]: https://github.com/casey/just\n\n### pyproject.toml\n\nPackaging consumes the metadata in `pyproject.toml` which helps to describe\nthe project on the official [pypi.org][pypi-2] repository. Have a look at the\ndocumentation and comments there to help you create a suitably descriptive\nmetadata file.\n\n### Versioning\n\nVersioning in Python can be hit and miss. You can label versions for\nyourself, but to make it reliaable, as well as meaningful is should be\ncontrolled by your source control system. We assume git, and versions can\nbe created by tagging your work and pushing the tag to your git repository,\ne.g. to create a release candidate for version 1.0.0:\n\n```sh\ngit tag -a 1.0.0-rc.1 -m \"release candidate for 1.0.0\"\ngit push origin 1.0.0-rc.1\n```\n\nWhen you build, a package will be created with the correct version:\n\n```sh\njust package-source\n### build process here ###\nSuccessfully built python_repo_jsonid-1.0.0rc1.tar.gz and python_repo_jsonid-1.0.0rc1-py3-none-any.whl\n```\n\n### Local packaging\n\nTo create a python wheel for testing locally, or distributing to colleagues\nrun:\n\n* `just package-source`\n\nA `tar` and `whl` file will be stored in a `dist/` directory. The `whl` file\ncan be installed as follows:\n\n* `pip install \u003cyour-package\u003e.whl`\n\n### Publishing\n\nPublishing for public use can be achieved with:\n\n* `just package-upload-test` or `just package-upload`\n\n`just-package-upload-test` will upload the package to [test.pypi.org][pypi-1]\nwhich provides a way to look at package metadata and documentation and ensure\nthat it is correct before uploading to the official [pypi.org][pypi-2]\nrepository using `just package-upload`.\n\n\u003c!--markdownlint-enable--\u003e\n\n[pypi-1]: https://test.pypi.org\n[pypi-2]: https://pypi.org\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fffdev-info%2Fjsonid","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fffdev-info%2Fjsonid","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fffdev-info%2Fjsonid/lists"}