https://github.com/hboisgibault/unicontent

Python module to extract structured metadata from URL, ISBN or DOI
https://github.com/hboisgibault/unicontent

doi extraction google-books isbn metadata open-graph python url

Last synced: 3 months ago
JSON representation

Python module to extract structured metadata from URL, ISBN or DOI

Host: GitHub
URL: https://github.com/hboisgibault/unicontent
Owner: hboisgibault
Created: 2017-02-14T18:32:24.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2018-06-11T13:27:12.000Z (almost 8 years ago)
Last Synced: 2025-07-04T16:46:24.440Z (9 months ago)
Topics: doi, extraction, google-books, isbn, metadata, open-graph, python, url
Language: Python
Size: 34.2 KB
Stars: 11
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # unicontent

Unicontent is a Python library to extract metadata from different types of sources and for different types of objects. The

goal is to normalize metadata and to provide an easy-to-use extractor. Given an identifier (URL, DOI, ISBN), unicontent can retrieve

structured data about the corresponding object.

## Usage

Here is the basic usage if you want to extract metadata with any kind of identifier. unicontent will detect the type of identifier and use the right extractor. Use get_metadata function if you just want metadata.

```python

from unicontent.extractors import get_metadata

data = get_metadata(identifier="http://example.com", format='n3')

```

See below if you want to use the extractor for a specific kind of identifier (URL, DOI or ISBN).

### Extraction from URL

The class ```URLContentExtractor``` is used to extract data from an URL. Several formats are available : RDF formats will return a rdflib graph (n3, turtle, xml). 'dict' and 'json' format will return a dictionary and a JSON file according to the mapping defined. A default mapping is provided.

```python

url = 'http://www.lemonde.fr/big-browser/article/2017/02/13/comment-les-americains-s-informent-oublient-et-reagissent-sur-les-reseaux-sociaux_5079137_4832693.html'

url_extractor = URLContentExtractor(identifier=url, format='dict', schema_names=['opengraph', 'dublincore', 'htmltags']) # 'dict' is the default format

metadata_dict = url_extractor.get_data()

```

The order of the ```schema_names``` parameters defines how the extractor will fetch metadata as explained before. Always use htmltags to get at least the `````` tag in the webpage.

### Extraction from DOI

The module uses the DOI system Proxy Server to extract metadata from DOI codes. The extractor name is DOIContentExtractor.

```python

doi = '10.10.1038/nphys1170'

doi_extractor = DOIContentExtractor(identifier=doi, format='dict')

metadata_dict = doi_extractor.get_data()

```

### Extraction from ISBN

To retrieve metadata from books, the library uses GoogleBooks and OpenLibrary (in this order). The extractor class is called ISBNContentExtractor.

If GoogleBooks does not find the volume corresponding to the ISBN code, a request is sent to OpenLibrary to fetch the data.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hboisgibault/unicontent

Awesome Lists containing this project

README