An open API service indexing awesome lists of open source software.

https://github.com/datadavev/soscan

Spider for scanning and retrieving schema.org content
https://github.com/datadavev/soscan

Last synced: about 1 year ago
JSON representation

Spider for scanning and retrieving schema.org content

Awesome Lists containing this project

README

          

# soscan

Spider for scanning and retrieving schema.org content.

Extracted schema.org markup is stored to postgres as a `JSONB` field in
a single table `socontent`. Document URL is used as the primary key.

The JSON LD is normalized to always be an array of graphs. This helps
to facilitate consistent access to the stored JSON.

JSON-LD is extracted only from the content retrieved from the URL. Extraction
from client side rendered content is not supported though is straight forward
to support using Selenium.

## Installation

Dependencies:

* link:https://www.python.org/[Python >= 3.8]
* link:https://python-poetry.org/docs/#installation[Poetry >= 1.1.4]
* link:https://www.postgresql.org/[Postgres >= 11]

Complete python dependencies are listed in `pyproject.toml`.

Installing psycopg2 on OS X can be a bit cumbersome. With a brew
installed postgresql, this worked for me:
----
env LDFLAGS='-L/usr/local/lib -L/usr/local/opt/openssl/lib -L/usr/local/opt/readline/lib' poetry add psycopg2
----

Installing and setting up the scanner involves getting the source,
creating a target database, and configuration.

Getting the source:

----
git clone https://github.com/datadavev/soscan.git
cd soscan
poetry install
----

Create the database:
----
psql
CREATE DATABASE soscan;
CREATE USER soscanrw;
GRANT ALL PRIVILEGES ON DATABASE soscan TO soscanrw;
----

The database schema is applied on first run, with a single table created:

----
Table "public.socontent"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+---------
url | character varying | | not null |
time_loc | timestamp with time zone | | |
time_modified | timestamp with time zone | | |
time_retrieved | timestamp with time zone | | |
http_status | integer | | |
jsonld | jsonb | | |
Indexes:
"socontent_pkey" PRIMARY KEY, btree (url)
----

The `jsonld` field holds the normalized JSON-LD retrieved from the page at `url`.

[NOTE]
It is not particularly efficient to search the jsonld column (it's ok
for a few hundred thousand entries). An application reliant on performant
search against values in the jsonld would benefit from appropriate
normalization of the content.

## Operation

The spider crawls entries in a sitemaps, loads pages, extracts
JSON-LD from each page, normalizes the JSON-LD, and stores the
results in a postgres table.

Example, retrieve from BCO-DMO:

----
scrapy crawl JsonldSpider \
-a sitemap_urls="https://www.bco-dmo.org/sitemap.xml"
----

Example, retrieve from Dryad:

----
scrapy crawl JsonldSpider \
-a sitemap_urls="https://datadryad.org/sitemap.xml"
----

or both in parallel (space delimited sitemap URLs to crawl):

----
scrapy crawl JsonldSpider \
-a sitemap_urls="https://datadryad.org/sitemap.xml https://www.bco-dmo.org/sitemap.xml"
----

## JSON-LD Normalization

The extracted json-ld is normalized at a high level with a structure that follows:

----
{
"@context": {
"@vocab":"https://schema.org/"
},
"@graph":[
{
"@id":"id for graph 1",
"@type":"type for graph 1",
...
},
...
]
}
----

## Query notes

The number of entries in `@graph` can be found by:

----
select count(*), jsonb_array_length(jsonld->'@graph') as G from socontent group by G;

count | g
-------+---
782 | 1
----

[NOTE]
The following examples all operate on the first `@graph` in the jsonld (zeroth index).

Types of things that we collected:

----
select count(*), jsonld->'@graph'->0->'@type' as T
from socontent
group by T
order by T;

count | t
-------+-----------------
54 | Dataset
2 | Event
43 | MonetaryGrant
24 | ResearchProject
----

Same but different:

----
select count(*), jsonb_extract_path_text(jsonld,'@graph','0','@type') as T
from socontent
group by T
order by T;

count | t
-------+-----------------
54 | Dataset
2 | Event
43 | MonetaryGrant
24 | ResearchProject
----

List `variableMeasured` names:

----
select distinct
jsonb_array_elements(jsonld->'@graph'->0->'variableMeasured')->'name' as var
from socontent
where jsonb_array_length(jsonld->'@graph'->0->'variableMeasured') > 0
order by var;

var
------------------------------------------------------------------------------
"19'-butanoyloxyfucoxanthin"
"19'-hexanoyloxyfucoxanthin"
"??"
"Additional notes"
"Additional notes/collected organisms"
"Aggregate mass of all 4 edge or interior clusters collected from that cage"
"Alloxanthin"
"Alpha-carotene"
"Ammonium."
...
"volume filtered"
"volume filtered; i.e. how much water went through the net"
"warnings and comments from SAP run"
"water depth at the station according to depth sounder on vessel"
"which instrument was used"
"year"
"zooplankton dry weight"
(547 rows)
----

## Development

TODO:

* Filter by properties such as `@type`