Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rubensworks/rdfa-streaming-parser.js
A fast and lightweight streaming RDFa parser for JavaScript
https://github.com/rubensworks/rdfa-streaming-parser.js
hacktoberfest linked-data parser rdf rdfa rdfjs streaming
Last synced: 2 days ago
JSON representation
A fast and lightweight streaming RDFa parser for JavaScript
- Host: GitHub
- URL: https://github.com/rubensworks/rdfa-streaming-parser.js
- Owner: rubensworks
- License: mit
- Created: 2019-06-01T12:51:09.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-09-20T12:35:29.000Z (3 months ago)
- Last Synced: 2024-12-19T02:38:02.093Z (3 days ago)
- Topics: hacktoberfest, linked-data, parser, rdf, rdfa, rdfjs, streaming
- Language: TypeScript
- Homepage:
- Size: 562 KB
- Stars: 20
- Watchers: 5
- Forks: 5
- Open Issues: 16
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# RDFa Streaming Parser
[![Build status](https://github.com/rubensworks/rdfa-streaming-parser.js/workflows/CI/badge.svg)](https://github.com/rubensworks/rdfa-streaming-parser.js/actions?query=workflow%3ACI)
[![Coverage Status](https://coveralls.io/repos/github/rubensworks/rdfa-streaming-parser.js/badge.svg?branch=master)](https://coveralls.io/github/rubensworks/rdfa-streaming-parser.js?branch=master)
[![npm version](https://badge.fury.io/js/rdfa-streaming-parser.svg)](https://www.npmjs.com/package/rdfa-streaming-parser)A [fast](https://gist.github.com/rubensworks/9eaaee548f647be15e98dea2b7d27586) and lightweight _streaming_ and 100% _spec-compliant_ [RDFa 1.1](https://rdfa.info/) parser,
with [RDFJS](https://github.com/rdfjs/representation-task-force/) representations of RDF terms, quads and triples.The streaming nature allows triples to be emitted _as soon as possible_, and documents _larger than memory_ to be parsed.
## Installation
```bash
$ npm install rdfa-streaming-parser
```or
```bash
$ yarn add rdfa-streaming-parser
```This package also works out-of-the-box in browsers via tools such as [webpack](https://webpack.js.org/) and [browserify](http://browserify.org/).
## Require
```javascript
import {RdfaParser} from "rdfa-streaming-parser";
```_or_
```javascript
const RdfaParser = require("rdfa-streaming-parser").RdfaParser;
```## Usage
`RdfaParser` is a Node [Transform stream](https://nodejs.org/api/stream.html#stream_class_stream_transform)
that takes in chunks of RDFa data,
and outputs [RDFJS](http://rdf.js.org/)-compliant quads.It can be used to [`pipe`](https://nodejs.org/api/stream.html#stream_readable_pipe_destination_options) streams to,
or you can write strings into the parser directly.While not required, it is advised to specify the [profile](#profiles) of the parser
by supplying a `contentType` or `profile` constructor option.### Print all parsed triples from a file to the console
```javascript
const myParser = new RdfaParser({ baseIRI: 'https://www.rubensworks.net/', contentType: 'text/html' });fs.createReadStream('index.html')
.pipe(myParser)
.on('data', console.log)
.on('error', console.error)
.on('end', () => console.log('All triples were parsed!'));
```### Manually write strings to the parser
```javascript
const myParser = new RdfaParser({ baseIRI: 'https://www.rubensworks.net/', contentType: 'text/html' });myParser
.on('data', console.log)
.on('error', console.error)
.on('end', () => console.log('All triples were parsed!'));myParser.write('');
myParser.write(``);
myParser.write(``);
myParser.write(``);
myParser.write(``);
myParser.write(``);
myParser.write(``);
myParser.end();
```### Import streams
This parser implements the RDFJS [Sink interface](https://rdf.js.org/#sink-interface),
which makes it possible to alternatively parse streams using the `import` method.```javascript
const myParser = new RdfaParser({ baseIRI: 'https://www.rubensworks.net/', contentType: 'text/html' });const myTextStream = fs.createReadStream('index.html');
myParser.import(myTextStream)
.on('data', console.log)
.on('error', console.error)
.on('end', () => console.log('All triples were parsed!'));
```## Configuration
Optionally, the following parameters can be set in the `RdfaParser` constructor:
* `dataFactory`: A custom [RDFJS DataFactory](http://rdf.js.org/#datafactory-interface) to construct terms and triples. _(Default: `require('@rdfjs/data-model')`)_
* `baseIRI`: An initial default base IRI. _(Default: `''`)_
* `language`: A default language for string literals. _(Default: `''`)_
* `vocab`: The initial vocabulary. _(Default: `''`)_
* `defaultGraph`: The default graph for constructing [quads](http://rdf.js.org/#dom-datafactory-quad). _(Default: `defaultGraph()`)_
* `features`: A hash of features that should be enabled. Defaults to the features defined by the profile. _(Default: all features enabled)_
* `profile`: The [RDFa profile](#profiles) to use. _(Default: profile with all features enabled)_
* `contentType`: The content type of the document that should be parsed. This can be used as an alternative to the 'profile' option. _(Default: profile with all features enabled)_
* `htmlParseListener`: An optional listener for the internal HTML parse events, should implement [`IHtmlParseListener`](https://github.com/rubensworks/rdfa-streaming-parser.js/blob/master/lib/IHtmlParseListener.ts) _(Default: `null`)_```javascript
new RdfaParser({
dataFactory: require('@rdfjs/data-model'),
baseIRI: 'http://example.org/',
language: 'en-us',
vocab: 'http://example.org/myvocab',
defaultGraph: namedNode('http://example.org/graph'),
features: { langAttribute: true },
profile: 'html',
htmlParseListener: new MyHtmlListener(),
});
```### Profiles
On top of [RDFa Core 1.1](https://www.w3.org/TR/rdfa-core/), there are a few RDFa variants that add specific sets of rules,
which are all supported in this library:* [HTML+RDFa 1.1](https://www.w3.org/TR/rdfa-in-html/): Internally identified as the `'html'` profile with `'text/html'` as content type.
* [XHTML+RDFa 1.1](https://www.w3.org/TR/xhtml-rdfa/): Internally identified as the `'xhtml'` profile with `'application/xhtml+xml'` as content type.
* [SVG Tiny 1.2](https://www.w3.org/TR/2008/REC-SVGTiny12-20081222/metadata.html#MetadataAttributes): Internally identified as the `'xml'` profile with `'application/xml'`, `'text/xml'` and `'image/svg+xml'` as content types.This library offers three different ways to define the RDFa profile or setting features:
* **Content type**: Passing a content type such as `'text/html'` to the `contentType` option in the constructor.
* **Profile string**: Passing `''`, `'core'`, `'html'`, `'xhtml'` or `'svg'` to the `profile` option in the constructor.
* **Features object**: A custom combination of features can be defined by passing a `features` option in the constructor.The table below lists all possible RDFa features and in what profile they are available:
| Feature | Core | HTML | XHTML | XML | Description |
| -------------------------------- | ---- |----- | ----- | --- | ----------- |
| `baseTag` | | ✓ | ✓ | | If the baseIRI can be set via the `` tag. |
| `xmlBase` | | | | ✓ | If the baseIRI can be set via the `xml:base` attribute. |
| `langAttribute` | | ✓ | ✓ | ✓ | If the language can be set via the language attribute. |
| `onlyAllowUriRelRevIfProperty` | ✓ | ✓ | ✓ | | If non-CURIE and non-URI rel and rev have to be ignored if property is present. |
| `inheritSubjectInHeadBody` | | ✓ | ✓ | | If the new subject can be inherited from the parent object if we're inside `` or `` if the resource defines no new subject. |
| `datetimeAttribute` | | ✓ | ✓ | ✓ | If the `datetime` attribute must be interpreted as datetimes. |
| `timeTag` | | ✓ | ✓ | ✓ | If the `## How it works
This tool makes use of the highly performant [htmlparser2](https://www.npmjs.com/package/htmlparser2) library for parsing HTML in a streaming way.
It listens to tag-events, and maintains the required tag metadata in a [stack-based datastructure](https://www.rubensworks.net/blog/2019/03/13/streaming-rdf-parsers/),
which can then be emitted as triples as soon as possible.Our algorithm closely resembles the [suggested processing sequence](https://www.w3.org/TR/rdfa-core/#s_sequence),
with a few minor changes to make it work in a streaming way.If you want to make use of a different HTML/XML parser,
you can create a regular instance of `RdfaParser`,
and just call the following methods yourself directly:* `onTagOpen(name: string, attributes: {[s: string]: string})`
* `onText(data: string)`
* `onTagClose()`## Specification Compliance
This parser passes all tests from the [RDFa 1.1 test suite](http://rdfa.info/dev).
More specifically, the following manifests are explicitly tested:* HTML+RDFa 1.1 (HTML4)
* HTML+RDFa 1.1 (HTML5)
* HTML+RDFa 1.1 (XHTML5)
* SVGTiny+RDFa 1.1
* XHTML+RDFa 1.1
* XML+RDFa 1.1The following _optional_ features for RDFa processors are supported:
* [Processing the `@role` attribute.](https://www.w3.org/TR/role-attribute/#using-role-in-conjunction-with-rdfa)
The following _optional_ features for RDFa processors are _not_ supported (yet):
* [Emitting the Processor Status as triples.](https://www.w3.org/TR/rdfa-core/#processor-status)
* [Performing vocabulary expansion based on an OWL subset.](https://www.w3.org/TR/rdfa-core/#s_vocab_expansion)## License
This software is written by [Ruben Taelman](http://rubensworks.net/).This code is released under the [MIT license](http://opensource.org/licenses/MIT).