https://github.com/dataoneorg/sonormal
Schema.org extraction and normalization web service
https://github.com/dataoneorg/sonormal
Last synced: over 1 year ago
JSON representation
Schema.org extraction and normalization web service
- Host: GitHub
- URL: https://github.com/dataoneorg/sonormal
- Owner: DataONEorg
- License: mit
- Created: 2020-11-17T14:51:54.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2023-09-28T20:56:02.000Z (over 2 years ago)
- Last Synced: 2025-01-30T21:17:10.872Z (over 1 year ago)
- Language: Python
- Size: 742 KB
- Stars: 0
- Watchers: 10
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# sonormal
`sonormal` is a python library to assist with extraction and processing of schema.org content with emphasis on the [`Dataset`](https://schema.org/Dataset) class.
Included is a command line tool `jld` for retrieving and extracting JSON-LD from a web page or other resource and performing various operations on JSON-LD.
This library and tool is focussed on supporting Schema.org harvesting for the DataONE infrastructure.
## Operation
```
Usage: jld [OPTIONS] COMMAND [ARGS]...
Retrieve and process JSON-LD.
Options:
-b, --base TEXT Base URI
-p, --profile TEXT JSON-LD Profile
-P, --request-profile TEXT JSON-LD Request Profile
-r, --response Show response information
-R, --relaxed-json Relax strict JSON deserialization
-W, --webpage Render SPA page
--soprod Use schema.org production context instead of v12 https
--help Show this message and exit.
Commands:
cache Cache management, list or purge
canon Normalize and render canonical form
compact Compact the JSON-LD SOURCE
frame Apply frame to source
get Retrieve JSON-LD
identifiers Extract Dataset identifiers
nquads Transform JSON-LD to N-Quads
play Load in JSON-LD Playground
```
`cache` lists entries in the local cache (in folder `~/.local/sonormal/cache`) and optionally purges entries.
`canon` canonicalizes the source JSON-LD by expanding and applying the URDNA 2015 algorithm, then serializes with ordered terms, no new lines, and no spaces between delimiters. Checksums computed on the result are consistent between various arrangements of the same input source.
`compact` applies the JSON-LD compaction algorithm to the source using the context:
```
{"@context": [
"https://schema.org/",
{
"id": "id",
"type": "type"
}
]
}
```
`frame` applies the JSON-LD framing algorithm to structure the JSON-LD for ease of identifier extraction from a `Dataset` instance using the frame:
```
{
"@context": {"@vocab":"https://schema.org/"},
"@type": "Dataset",
"identifier": {},
"creator": {}
}
```
`get` retrieves the document from a file or URL, following redirects and Link headers as appropriate. Content is extracted from HTML pages, and optionally (with the `-W` flag set) from single page applications where the JSON-LD may be generated on the fly.
`identifiers` extracts `Dataset` identifier values and computes checksums of the JSON-LD.
`nquads` serializes the JSON-LD to N-Quads format.
## Examples
Download and extract JSON-LD from [Hydroshare](https://www.hydroshare.org/):
```
jld get "https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257/"
{
"@context": {
"@vocab": "https://schema.org/",
"datacite": "http://purl.org/spar/datacite/"
},
"@id": "https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257",
"url": "https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257",
"@type": "Dataset",
"additionalType": "http://www.hydroshare.org/terms/CompositeResource",
...
```
Download and extract JSON-LD from a DataONE single page application (with JSON-LD rendered by the client):
```
jld -W get "https://search.dataone.org/view/urn%3Auuid%3Add9ad874-ded8-48fe-908a-06732b9a6297"
[
{
"@context": {
"@vocab": "https://schema.org/"
},
"@type": "Dataset",
"@id": "https://dataone.org/datasets/urn%3Auuid%3Add9ad874-ded8-48fe-908a-06732b9a6297",
"datePublished": "2013-10-23T00:00:00Z",
"publisher": {
"@type": "Organization",
"name": "California Ocean Protection Council Data Repository"
},
"identifier": "urn:uuid:dd9ad874-ded8-48fe-908a-06732b9a6297",
...
```
Processing operations can take stdin as input. For example, normalize JSON-LD using the URDNA 2015 algorithm for assigning ids to blank nodes. Note the source is expanded and canonicalized, output is serialized with no new lines and no spaces between delimiters in preparation for calculating checksums.
```
jld get "https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257/" | jld canon
[{"@id":"_:c14n0","@type":["http://purl.org/spar/datacite/ResourceIdentifier","https://schema.org/PropertyValue"],
"http://purl.org/spar/datacite/usesIdentifierScheme":[{"@id":"http://purl.org/spar/datacite/
local-resource-identifier-scheme"}],"https://schema.org/propertyId":[{"@value":"UUID"}],"https://schema.org/value":
[{"@value":"uuid:058d173af80a4784b471d29aa9ad7257"}]},{"@id":"_:c14n1","@type":["https://schema.org/Place"],
...
```
Extract identifiers and compute checksums:
```
jld get "https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257/" | jld identifiers -c
[
{
"@id": [
"https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257"
],
"url": [
"https://www.hydroshare.org/resource/058d173af80a4784b471d29aa9ad7257"
],
"identifier": [
"uuid:058d173af80a4784b471d29aa9ad7257"
],
"hashes": {
"sha256": "a8cb4e5806045032fc2e7ad0b762336ff76f3792271ddc071c0d8c85d6b69ac5",
"sha1": "f6abef03156a5adb6d395f385628a2894e7b920e",
"md5": "03a357ba8043ac734aa3b9e9bb514ff9"
}
}
]
```
Open the canonical form of the BCO-DMO dataset `https://www.bco-dmo.org/dataset/839373` in [JSON-LD Playground](https://json-ld.org/playground/):
```
jld get "https://www.bco-dmo.org/dataset/839373" | jld canon | jld play -B
New public gist created at:
https://gist.github.com/datadavev/4f3cad1a104263bcf1c1bb96723911fc
Link to JSON-LD playground:
https://json-ld.org/playground/#startTab=tab-expanded&json-ld=https%3A%2F%2Fgist.githubusercontent.com%2Fdatadavev%2F4f3cad1a104263bcf1c1bb96723911fc%2Fraw
```
## Installation
Install using [`poetry`](https://python-poetry.org/). For example:
```
git clone https://github.com/datadavev/sonormal.git
cd sonormal
poetry install
```
Then run using:
```
poetry run jld
```
Alternatively, install into a separately created virtual environment:
```
poetry install
```
Then run like:
```
jld
```
Note that the `play` command for uploading to the [JSON-LD Playground](https://json-ld.org/playground/) requires that the GitHub [command line tool `gh`](https://github.com/cli/cli) is available on the path, and that you have authenticated the tool.