{"id":27642016,"url":"https://github.com/bathos/hardcore-xml","last_synced_at":"2025-04-23T23:52:11.243Z","repository":{"id":34309810,"uuid":"38226713","full_name":"bathos/hardcore-xml","owner":"bathos","description":"hardcore XML parser, builder \u0026 transformer for node","archived":false,"fork":false,"pushed_at":"2017-02-15T12:13:50.000Z","size":559,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-23T23:52:05.430Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bathos.png","metadata":{"files":{"readme":"README.MD","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-06-29T03:54:22.000Z","updated_at":"2023-03-20T14:29:42.000Z","dependencies_parsed_at":"2022-09-14T03:40:58.197Z","dependency_job_id":null,"html_url":"https://github.com/bathos/hardcore-xml","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bathos%2Fhardcore-xml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bathos%2Fhardcore-xml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bathos%2Fhardcore-xml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bathos%2Fhardcore-xml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bathos","download_url":"https://codeload.github.com/bathos/hardcore-xml/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250535081,"owners_count":21446506,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-23T23:52:10.774Z","updated_at":"2025-04-23T23:52:11.225Z","avatar_url":"https://github.com/bathos.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/bathos/hardcore-xml.svg)](https://travis-ci.org/bathos/hardcore-xml)\n[![Coverage Status](https://coveralls.io/repos/github/bathos/hardcore-xml/badge.svg?branch=master)](https://coveralls.io/github/bathos/hardcore-xml?branch=master)\n[![npm version](https://badge.fury.io/js/hardcore.svg)](https://badge.fury.io/js/hardcore)\n\n# hardcore xml\n\nA validating xml parser / processor / builder thing.\n\n\n\u003c!-- MarkdownTOC autolink=true bracket=round depth=4 --\u003e\n\n- [wat](#wat)\n  - [why](#why)\n  - [how hardcore](#how-hardcore)\n  - [stuff to know](#stuff-to-know)\n- [i want to use this](#i-want-to-use-this)\n  - [i have a string make it be xml](#i-have-a-string-make-it-be-xml)\n  - [i have a stream that all menafiefiewfheiniupu9](#i-have-a-stream-that-all-menafiefiewfheiniupu9)\n- [options](#options)\n  - [`opts.dereference`](#optsdereference)\n  - [`opts.encoding`](#optsencoding)\n  - [`opts.maxExpansionCount` / `opts.maxExpansionSize`](#optsmaxexpansioncount--optsmaxexpansionsize)\n  - [`opts.path`](#optspath)\n- [ast nodes](#ast-nodes)\n\n\u003c!-- /MarkdownTOC --\u003e\n\n## wat\n\nIn the XML spec, two kinds of processors are defined: validating and\nnon-validating. A validating processor can handle external references and is\nguaranteed to apply all effects of markup declarations found in the internal DTD\nor introduced via external entities — some of which may influence the actual\noutput in additon to defining validity constraints.\n\n### why\n\nI don’t have a particularly good answer.\n\nAFAIK, there is no existing validating XML processor available for node. In\nfact there maybe isn’t any XML processor available for node which fits the\nformal definition of even a non-validating parser. Most are forgiving of illegal\nsequences even in \"strict mode\".\n\nBut why do you want it be strict? And who actually _uses_ DTDs?\n\n1. You don’t\n2. Nobody\n\nI’ve spent a lot of time working with XML, and I enjoy projects like this ...\nbut yeah, you probably don’t need these features.\n\nThat said, I think it’s a pretty good XML library! Even if you don’t need DTDs\nand aren’t worried about the sanctimony of formal XML spec compliance, the AST\nit generates has an API for manipulation, querying, reserializing, and\ntransformation that may prove useful to you.\n\n### how hardcore\n\nI’ve tried to follow the XML 1.0 specification closely. There’s a whole lot more\nin there than I think most people realize. It’s a monster if you’re going\nwhole-hog. Realistically, it’s unlikely that I got every last detail correct. In\ntheory it can do the following though:\n\n- Decoding\n - Handles a wide range of encodings out of the box\n - Honors the encoding sniffing algorithm outlined in the spec\n - Honors the encoding specified by an xml/text declaration\n - Permits user-override of encoding with out-of-band info\n - Normalizes newlines according to spec rules\n- Parsing\n - Applies well-formedness constraints\n - Outputs error messages with informative messages and context\n - Applies validity constraints at the earliest possible time\n - Normalizes attribute whitespace according to spec rules\n - Normalizes tokenized attribute whitespace according to spec rules\n - Provides default attribute values when defined by a markup declaration\n - Recognizes the difference between whitespace CDATA and non-semantic\n   whitespace in elements with declared content specifications\n - Resolves external parsed entities, including external DTD subsets\n-- Expands entities, including both parameter and general entities\n - Places upper limits on entity expansion to avoid Billion Laughs attacks\n- AST\n - Provides simplified DOM tree\n - Nodes inherit from Array. Proxy lets us maintain knowledge of lineage.\n - Nodes are entirely mutable\n - Everything is ‘live’ — for example, alterations to an element declaration\n   node will effect subsequent behavior of instances of that element.\n - `node.validate()` may be called at any time to confirm validity again.\n - `node.serialize()` lets you output XML as a string again.\n - `node.findDeep()`, `node.filterDeep()` allow querying in plain JS.\n - `node.prevSibling`, `node.parent`, etc help you navigate the tree.\n - `node.toJSON()` returns a POJO representation of the tree.\n - You can also create new documents using the AST node constructors even\n   without processing a document.\n\n### stuff to know\n\nThere is a caveat about reserialization to XML. XML is, in a sense, a lossy\nformat. Once entities have been expanded, especially in a context where\nsubsequent manipulation of the AST is permitted, it’s tough to say how one would\nturn them back into references safely. When you made some edit, did it unlink\nthe reference, or did it alter the definition of the entity replacement text\nitself? Likewise we do not retain knowledge of which attribute values were\nsupplied as defaults and which were not, and we do not retain knowledge of the\npre-normalized linebreaks or pre-normalized attribute value text.\n\nOne thing which is not supported is _not_ validating. Well, more specifically,\nit does not support ignoring a doctype declaration and its consequences. You\nactually can do non-validation simply by not having a doctype declaration in a\ndocument; in such a case, undeclared elements and attributes are inherently\nvalid, all being content type \"ANY\" and attribute type \"CDATA\" implicitly.\n\nNote that the existence of a doctype declaration immediately makes any element\nor attribute which was not declared an error. Thus a document with the a DTD\nlike `\u003c!DOCTYPE foo\u003e` (which is technically grammatically valid) nonetheless\nwill fail validation by definition, as the element `foo` is not declared.\n\nIn the future I might introduce granular customization of validation behavior\nsuch that users may enable or disable the application of select constraints. I’m\nnot sure yet if this corresponds to important use cases so I have not attempted\nit yet.\n\nEven if that does get introduced, it should be noted that hardcore cannot be\nused to parse HTML. XHTML would be fine, but HTML is not a form of XML. HTML has\ndifferent parsing rules and a different set of possible AST nodes from XML.\nDespite appearances, it’s not just a matter of degrees of ‘strictness’. The\nexistence of super XML-looking things like `\u003c!DOCTYPE` in html is just a vestige\nof the language’s heritage and these constructs do not mean the same things,\nfollow the same grammar, or have the same effects.\n\nHowever, MathML and SVG, both of which can appear within HTML documents, do\nfollow the XML grammar (and they have DTDs as well). Both are suitable for\nprocessing with hardcore.\n\n## i want to use this\n\nokay, first I should mention it is node 7+ only (maybe 6?) cause I am an\nasshole.\n\nthe module exposes several objects.\n\n- `hardcore.ast`: namespace object with AST node constructors\n- `hardcore.parse`: convenience method wrapping `Processor`\n- `hardcore.Processor`: main meats\n- `hardcore.Decoder`: base class underlying `Processor`\n\n### i have a string make it be xml\n\n```\nimport hardcore from 'hardcore';\n\nhardcore\n  .parse(myString, opts)\n  .then(ast =\u003e ...)\n  .catch(err =\u003e ...);\n```\n\nIn this form, the first argument may be a string, a `Buffer`, or a `Readable`\nstream. We’ll come back to what the options are.\n\n### i have a stream that all menafiefiewfheiniupu9\n\nIf your input is a stream, you might prefer to use `Processor` directly since it\nalways feels really good when you get to type `pipe`:\n\n```\nimport fs from 'fs';\nimport hardcore from 'hardcore';\n\nconst processor = new Processor(opts);\n\nprocessor.on('ast', ast =\u003e ...);\nprocessor.on('error', err =\u003e ...);\n\nfs.createReadStream('poop.svg').pipe(xmlProcessor);\n```\n\nThe incoming stream chunks must be buffers, not strings.\n\n`Processor` is a `Writable` stream. It’s also _actually_ a writable stream, by\nwhich I mean it processes incoming data as it becomes available. I mention this\nbecause I’ve seen other parsers that implement the stream interface but actually\njust accrete the sum of all data before beginning the parsing itself.\n\nI haven’t benchmarked hardcore or anything and don’t actually care to, but it’s\nadmittedly probably considerably heavier than alternatives — I favor code I can\nstill read later over optimization, or at least that’s what I tell myself. That\nsaid, you could say that the ability to process chunks asynchronously makes\nother optimizations unnecessary: you can always just adjust the incoming flow if\nyou have to worry about eight megabyte xml documents or something.\n\n## options\n\nThese are the options for both `new Processor(opts)` and `parse(thing, opts)`.\n\n### `opts.dereference`\n\nThis is optional only if the document is known to contain no references to\nexternal entities. Otherwise, it is necessary that you provide it.\n\nThe value should be a function like this:\n\n```\nopts.dereference = ({ name, path, pathEncoded, publicID, systemID, type }) =\u003e {\n  /* ... */\n  return {\n    encoding: optionalEncodingString,\n    entity: stringOrbufferOrStreamOrPromiseThatResolvesToStringOrBufferOrStream\n  };\n};\n```\n\nThe `type` will either be `'DTD'` or `'ENTITY'` (technically both are entities\nin XML parlance, but here we mean the kind declared with `\u003c!ENTITY...`).\n\nDepending on the specific declaration that defined the external reference,\n`publicID` may not be defined. The other members should always be defined.\n\n\u003e Notation declarations are the one kind of external reference that may lack a\n\u003e system ID, but notations are not entities and do not get dereferenced during\n\u003e processing.\n\nDespite the amount of data provided, most likely you will only be interested in\neither `path` or `pathEncoded` (which is just a url-encoded version of the\nformer). Before I explain why, first some background:\n\nIn practice (and quite confusingly), `systemID` is usually a public http URL\nwhile `publicID` is one of those weird theoretical \"-//\" strings that appears in\nstandards specifications a lot but nobody ever actually uses. So `systemID` is\nthe one you want.\n\nBut the system ID (i.e., the url) might be relative to the referencing context\n(which might itself be an external entity), so you need to know that context.\nThat’s why `path` is provided — it is the resolved path of the ‘requesting\ncontext’ plus the systemID, and normally that means it will end up being an\nabsolute URL.\n\nYou may wonder: why not just have hardcore fetch these resources automatically?\n\nThere are a few reasons. First is security, I guess. I mean it’s just kind of\nweird to have a parser making random HTTP requests behind the scenes, it’s not\nsomething you do.\n\nBut also, in practice, you usually know in advance what external resources you\nneed. Rather than fetch them over the wire (again, it would be weird to have a\nparser error caused by network traffic conditions), you likely will have made\nthem locally available as part of your application. Or even if you do decide to\nfetch them online, you might want to keep a whitelist of permitted hosts or\ncache the responses. So the particular implementation is in your hands.\n\nI would expect that in the majority of cases you would just do something like\n\n```\nopts.dereference = ({ path }) =\u003e {\n  const filePath = myMapOfEntities[path];\n  return { entity: fs.createReadStream(filePath) };\n};\n```\n\nNote that a general or parameter entity is not dereferenced until the first time\nit is actually referenced, and it will only be requested for the first item.\n\nIt is permitted to specify encoding here since it can technically vary by file\nand might come from out-of-band info like HTTP headers. By default, if an\nexplicit encoding was declared for the document, that will propagate to entity\nexpansions.\n\n### `opts.encoding`\n\nYou don’t normally need to provide this, but you can supply the input encoding\nas out-of-band information, for example when it comes to you via an HTTP header.\nNote that, if there is an xml or text declaration that declares the encoding in\nthe file itself, it cannot contradict this.\n\nWhen this is not provided and there is no inline encoding declaration, hardcore\nwill still be able to recognize UTF8, UTF16, or UTF32; and if the opening\ncharacter is \"\u003c\", also UTF16le/be and UTF32le/be.\n\nSupported encodings include those above, plus common one-byte encodings and\nSHIFT-JIS. If you need something else, you’ll have to decode it to one of these\nwith iconvlite or similar before piping it in.\n\n### `opts.maxExpansionCount` / `opts.maxExpansionSize`\n\nThese default to 10000 and 20000 respectively. These options exist to prevent\nentity expansion attacks. If you want to disable them, you can set them to\n`Infinity`. The first refers to the total number of entity expansions which may\noccur during the processing of a single entity; the latter refers to the total\nnumber of codepoints (not bytes) that may be the replacement text of a single\nentity which is referenced.\n\n### `opts.path`\n\nThis optional parameter lets you specify the base URI for a document such that\nexternal entities with relative URLs as their system IDs can be understood.\n\nI think it is atypical to need to provide this, since more often relative urls\nare used only for entities which are components of an external DTD. References\nto the external DTDs from documents, in contrast, are typically made using\nabsolute paths, making the document’s location irrelevant.\n\n## ast nodes\n\nSee [AST Nodes](ast.md) for details on the individual node types.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbathos%2Fhardcore-xml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbathos%2Fhardcore-xml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbathos%2Fhardcore-xml/lists"}