{"id":13474401,"url":"https://github.com/openschemas/extractors","last_synced_at":"2026-02-20T00:41:41.477Z","repository":{"id":141670800,"uuid":"157426056","full_name":"openschemas/extractors","owner":"openschemas","description":"generic extraction recipes to get you started extracting schema.org entities for your software, data, and all things","archived":false,"fork":false,"pushed_at":"2019-04-06T23:18:54.000Z","size":92,"stargazers_count":14,"open_issues_count":1,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-10-30T07:47:43.338Z","etag":null,"topics":["extractors","metadata","schemaorg"],"latest_commit_sha":null,"homepage":"https://cloud.docker.com/u/openschemas/repository/docker/openschemas/extractors","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openschemas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-11-13T18:22:21.000Z","updated_at":"2023-12-26T19:51:27.000Z","dependencies_parsed_at":"2024-01-02T21:53:15.972Z","dependency_job_id":"445efa87-ded0-4c8b-ad2d-7ad30a4727ef","html_url":"https://github.com/openschemas/extractors","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openschemas%2Fextractors","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openschemas%2Fextractors/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openschemas%2Fextractors/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openschemas%2Fextractors/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openschemas","download_url":"https://codeload.github.com/openschemas/extractors/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245738632,"owners_count":20664320,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extractors","metadata","schemaorg"],"created_at":"2024-07-31T16:01:12.079Z","updated_at":"2026-02-20T00:41:36.457Z","avatar_url":"https://github.com/openschemas.png","language":"HTML","readme":"# Extractors\n\n![https://github.com/openschemas/spec-template/raw/master/img/hexagon_square_small.png](https://github.com/openschemas/spec-template/raw/master/img/hexagon_square_small.png)\n\nThis is a repository with example extractors and recipes intended to be used\nwith [schemaorg](https://openschemas.github.io/schemaorg/#Usage) Python \nto help you to extract metadata from your datasets,\nsoftware and other entities described in [schema.org](https://www.schema.org).\n\n\n## Specifications\n\nThe following specifications have Dockerfiles (and associated Github actions)\nfor you to use! See the subdirectories to get usage:\n\n - [Dataset](Dataset) is an example starter script to extract a Dataset.\n - [ImageDefinition](ImageDefinition) is a kind of SoftwareSourceCode extended to describe containers. We provide a Dockerfile that builds the extractor to generate a static page for an input Dockerfile.\n - [ContainerTree](ContainerTree) is an extended ImageDefinition to also include a filesystem listing that can be used to generate a container tree.\n\nFor both of the above, when you deploy to Github pages for the first time, you\nneed to switch Github Pages to deploy from master and then back to the `gh-pages`\nbranch on deploy. There is a known issue with Permissions if you deploy\nto the brain without activating it (as an admin) from the respository first.\n\n## Extractors (without Containers)\n\nThe following examples for entities (children of \"Thing\")\ndefined in schema.org are also provided. These specifications don't yet have Docker\ncontainers or Github Action extractors.\n\n - [DataCatalog](DataCatalog): a collection or grouping of Datasets\n - [Organization](Organization): a complete organization, with a ContactPoint\n - [SoftwareSourceCode](SoftwareSourceCode) an example extraction shown [here](https://openbases.github.io/extract-dockerfile/SoftwareSourceCode/) for a Dockerfile.\n\n\n## What is special about those pages?\nFor each of the above, the metadata shown is also embedded in the page as json-ld\n(when you \"View Source.\") \n\n## What files are included in each folder?\nEach folder above includes an example python script to extract metadata (`extract.py`), \na recipe to follow (`recipe.yml`), and the specification in yaml format (in the \ncase of a specification not served by production schema.org).\n\n# Usage\nFor the Docker and Github Actions usage, see inside the [ImageDefinition](ImageDefinition)\nfolder. For all other schema.org entities and local usage, details are provided here.\nBefore running these examples, make sure you have installed the module (and note\nthis module is under development, contributions are welcome!)\n\n```bash\npip install schemaorg\n```\n\nTo extract a recipe for a particular datatype, you can modify `extract.py` and the \n`recipe.yml` for your particular needs, or use as is. Generally we:\n\n 1. Read in a specific version of the *schemaorg definitions* provided by the library\n 2. Read in a *recipe* for a template that we want to populate (e.g., google/dataset)\n 3. Use helper functions provided by the template (or our own) to *extract*\n 4. Extract, *validate*, and generate the final dataset\n\nThe goal of the software is to provide enough structure to help the user (typically a developer)\nbut not so much as to be annoying to use generally.\n\n## What are the files in each folder?\n\n### recipe.yml Files\n\nIf I am a provider of a service and want my users to label their data for my service,\nI need to tell them how to do this. I do this by way of a recipe file, in each\nexample folder there is a file called `recipe.yml` that is a simple listing of required fields defined for the entities that are needed. For example, the [recipe.yml](SoftwareSourceCode/recipe.yml) in the \n\"SoftwareSourceCode\" folder tells the parser that we need to define\nproperties for \"SoftwareSourceCode\" and an Organization or Person. For example.\nwith the [schemaorg](https://www.github.com/openschemas/schemaorg) Python module \nI can learn that the \"SoftwareSourceCode\" definition has 121 properties, \nbut the recipe tells us that we only need a subset of those\nproperties for a valid extraction.\n\n### extract.py\n\nThis is the code snippet that shows how you extract metadata and use the \n[schemaorg](https://www.github.com/openschemas/schemaorg) Python module\nto generate the final template page. This file could be run in multiple places!\n\n - In a continuous integration setup so that each change to master updates the Github Pages metadata.\n - Using a tool like [datalad](https://datalad.org) that allows for version control of such metadata, and definition of extractors (also in Python).\n - As a Github hook (or action) that is run at any stage in the development process.\n - Rendered by a web server that provides Container Recipes for users that should be indexed with Google Search (e.g., Singularity Hub).\n\n### Dockerfile\n\nFor the folders with associated containers, you will find a Dockerfile (and associated entrypoint.sh)! These containers\nwill build the extractor into an image that can be used with Github Actions.\n\n## Examples\n\n - [extract-dockerfile Writeup](https://vsoch.github.io/2018/schemaorg/) to demonstrate extraction for a Dockerfile.\n - [extract-dockerfile Repository](https://github.com/openbases/extract-dockerfile)\n - [dockerfiles](https://github.com/openschemas/dockerfiles) A scaled extraction (under development) for ~30-60K Dockerfiles, a subset of the [Dinosaur Dataset](https://vsoch.github.io/datasets/2018/dockerfiles/).\n\n\n## Resources\n\n - [Open Container Initative](https://github.com/opencontainers/)\n - [Google Datasets](https://www.blog.google/products/search/making-it-easier-discover-datasets/)\n - [Schemaorg Discussion](https://github.com/schemaorg/schemaorg/issues/2059#issuecomment-427208907)\n","funding_links":[],"categories":["Community Resources","HTML"],"sub_categories":["GitHub Pages"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenschemas%2Fextractors","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenschemas%2Fextractors","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenschemas%2Fextractors/lists"}