{"id":15447479,"url":"https://github.com/datadavev/soscan","last_synced_at":"2025-03-28T08:44:18.460Z","repository":{"id":83563539,"uuid":"311658220","full_name":"datadavev/soscan","owner":"datadavev","description":"Spider for scanning and retrieving schema.org content","archived":false,"fork":false,"pushed_at":"2020-11-11T19:44:30.000Z","size":82,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-02T09:27:21.679Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datadavev.png","metadata":{"files":{"readme":"README.adoc","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-10T12:55:37.000Z","updated_at":"2020-11-11T19:41:20.000Z","dependencies_parsed_at":"2023-07-02T23:45:34.559Z","dependency_job_id":null,"html_url":"https://github.com/datadavev/soscan","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datadavev%2Fsoscan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datadavev%2Fsoscan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datadavev%2Fsoscan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datadavev%2Fsoscan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datadavev","download_url":"https://codeload.github.com/datadavev/soscan/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245999318,"owners_count":20707554,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-01T20:06:06.822Z","updated_at":"2025-03-28T08:44:18.454Z","avatar_url":"https://github.com/datadavev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# soscan\n\nSpider for scanning and retrieving schema.org content.\n\nExtracted schema.org markup is stored to postgres as a `JSONB` field in\na single table `socontent`. Document URL is used as the primary key.\n\nThe JSON LD is normalized to always be an array of graphs. This helps\nto facilitate consistent access to the stored JSON.\n\nJSON-LD is extracted only from the content retrieved from the URL. Extraction\nfrom client side rendered content is not supported though is straight forward\nto support using Selenium.\n\n\n## Installation\n\nDependencies:\n\n* link:https://www.python.org/[Python \u003e= 3.8]\n* link:https://python-poetry.org/docs/#installation[Poetry \u003e= 1.1.4]\n* link:https://www.postgresql.org/[Postgres \u003e= 11]\n\nComplete python dependencies are listed in `pyproject.toml`.\n\nInstalling psycopg2 on OS X can be a bit cumbersome. With a brew\ninstalled postgresql, this worked for me:\n----\nenv LDFLAGS='-L/usr/local/lib -L/usr/local/opt/openssl/lib -L/usr/local/opt/readline/lib' poetry add psycopg2\n----\n\nInstalling and setting up the scanner involves getting the source,\ncreating a target database, and configuration.\n\nGetting the source:\n\n----\ngit clone https://github.com/datadavev/soscan.git\ncd soscan\npoetry install\n----\n\nCreate the database:\n----\npsql\nCREATE DATABASE soscan;\nCREATE USER soscanrw;\nGRANT ALL PRIVILEGES ON DATABASE soscan TO soscanrw;\n----\n\nThe database schema is applied on first run, with a single table created:\n\n----\n                          Table \"public.socontent\"\n     Column     |           Type           | Collation | Nullable | Default\n----------------+--------------------------+-----------+----------+---------\n url            | character varying        |           | not null |\n time_loc       | timestamp with time zone |           |          |\n time_modified  | timestamp with time zone |           |          |\n time_retrieved | timestamp with time zone |           |          |\n http_status    | integer                  |           |          |\n jsonld         | jsonb                    |           |          |\nIndexes:\n    \"socontent_pkey\" PRIMARY KEY, btree (url)\n----\n\nThe `jsonld` field holds the normalized JSON-LD retrieved from the page at `url`.\n\n[NOTE]\nIt is not particularly efficient to search the jsonld column (it's ok\nfor a few hundred thousand entries). An application reliant on performant\nsearch against values in the jsonld would benefit from appropriate\nnormalization of the content.\n\n## Operation\n\nThe spider crawls entries in a sitemaps, loads pages, extracts\nJSON-LD from each page, normalizes the JSON-LD, and stores the\nresults in a postgres table.\n\nExample, retrieve from BCO-DMO:\n\n----\nscrapy crawl JsonldSpider \\\n  -a sitemap_urls=\"https://www.bco-dmo.org/sitemap.xml\"\n----\n\nExample, retrieve from Dryad:\n\n----\nscrapy crawl JsonldSpider \\\n  -a sitemap_urls=\"https://datadryad.org/sitemap.xml\"\n----\n\nor both in parallel (space delimited sitemap URLs to crawl):\n\n----\nscrapy crawl JsonldSpider \\\n  -a sitemap_urls=\"https://datadryad.org/sitemap.xml https://www.bco-dmo.org/sitemap.xml\"\n----\n\n## JSON-LD Normalization\n\nThe extracted json-ld is normalized at a high level with a structure that follows:\n\n----\n{\n    \"@context\": {\n        \"@vocab\":\"https://schema.org/\"\n    },\n    \"@graph\":[\n        {\n            \"@id\":\"id for graph 1\",\n            \"@type\":\"type for graph 1\",\n            ...\n        },\n        ...\n    ]\n}\n----\n\n## Query notes\n\nThe number of entries in `@graph` can be found by:\n\n----\nselect count(*), jsonb_array_length(jsonld-\u003e'@graph') as G from socontent group by G;\n\n count | g\n-------+---\n   782 | 1\n----\n\n[NOTE]\nThe following examples all operate on the first `@graph` in the jsonld (zeroth index).\n\nTypes of things that we collected:\n\n----\nselect count(*), jsonld-\u003e'@graph'-\u003e0-\u003e'@type' as T\n  from socontent\n  group by T\n  order by T;\n\n count |        t\n-------+-----------------\n    54 | Dataset\n     2 | Event\n    43 | MonetaryGrant\n    24 | ResearchProject\n----\n\nSame but different:\n\n----\nselect count(*), jsonb_extract_path_text(jsonld,'@graph','0','@type') as T\n  from socontent\n  group by T\n  order by T;\n\n count |        t\n-------+-----------------\n    54 | Dataset\n     2 | Event\n    43 | MonetaryGrant\n    24 | ResearchProject\n----\n\nList `variableMeasured` names:\n\n----\nselect distinct\n  jsonb_array_elements(jsonld-\u003e'@graph'-\u003e0-\u003e'variableMeasured')-\u003e'name' as var\n  from socontent\n  where jsonb_array_length(jsonld-\u003e'@graph'-\u003e0-\u003e'variableMeasured') \u003e 0\n  order by var;\n\n                                     var\n------------------------------------------------------------------------------\n \"19'-butanoyloxyfucoxanthin\"\n \"19'-hexanoyloxyfucoxanthin\"\n \"??\"\n \"Additional notes\"\n \"Additional notes/collected organisms\"\n \"Aggregate mass of all 4 edge or interior clusters collected from that cage\"\n \"Alloxanthin\"\n \"Alpha-carotene\"\n \"Ammonium.\"\n...\n \"volume filtered\"\n \"volume filtered; i.e. how much water went through the net\"\n \"warnings and comments from SAP run\"\n \"water depth at the station according to depth sounder on vessel\"\n \"which instrument was used\"\n \"year\"\n \"zooplankton dry weight\"\n(547 rows)\n----\n\n## Development\n\nTODO:\n\n* Filter by properties such as `@type`\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatadavev%2Fsoscan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatadavev%2Fsoscan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatadavev%2Fsoscan/lists"}