Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ludbek/webpreview
Extracts OpenGraph, TwitterCard and Schema properties from a webpage.
https://github.com/ludbek/webpreview
open-graph schema twitter-cards web-preview
Last synced: 4 days ago
JSON representation
Extracts OpenGraph, TwitterCard and Schema properties from a webpage.
- Host: GitHub
- URL: https://github.com/ludbek/webpreview
- Owner: ludbek
- License: other
- Created: 2014-12-03T07:22:25.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2024-05-27T20:32:58.000Z (8 months ago)
- Last Synced: 2025-01-06T13:07:54.678Z (5 days ago)
- Topics: open-graph, schema, twitter-cards, web-preview
- Language: Python
- Size: 185 KB
- Stars: 84
- Watchers: 7
- Forks: 18
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
- starred-awesome - webpreview - Extracts OpenGraph, TwitterCard and Schema properties from a webpage. (Python)
README
# webpreview
For a given URL, `webpreview` extracts its **title**, **description**, and **image url** using
[Open Graph](http://ogp.me/), [Twitter Card](https://dev.twitter.com/cards/overview), or
[Schema](http://schema.org/) meta tags, or, as an alternative, parses it as a generic webpage.## Installation
```shell
pip install webpreview
```## Usage
Use the generic `webpreview` method (added in *v1.7.0*) to parse the page independent of its nature.
This method fetches a page and tries to extracts a *title, description, and a preview image* from it.It first attempts to parse the values from **Open Graph** properties, then it falls back to
**Twitter Card** format, and then to **Schema**. If none of these methods succeed in extracting all
three properties, then the web page's content is parsed using a generic HTML parser.```python
>>> from webpreview import webpreview>>> p = webpreview("https://en.wikipedia.org/wiki/Enrico_Fermi")
>>> p.title
'Enrico Fermi - Wikipedia'
>>> p.description
'Italian-American physicist (1901–1954)'
>>> p.image
'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg'# Access the parsed fields both as attributes and items
>>> p["url"] == p.url
True# Check if all three of the title, description, and image are in the parsing result
>>> p.is_complete()
True# Provide page content from somewhere else
>>> content = """
The Dormouse's story
The Dormouse's story
Elsie
"""
# The the function's invocation won't make any external calls,
# only relying on the supplied content, unlike the example above
>>> webpreview("aa.com", content=content)
WebPreview(url="http://aa.com", title="The Dormouse's story", description="A Mad Tea-Party story")
```### Using the command line
When `webpreview` is installed via `pip`, then the accompanying command-line tool is
installed alongside.```shell
$ webpreview https://en.wikipedia.org/wiki/Enrico_Fermi
title: Enrico Fermi - Wikipedia
description: Italian-American physicist (1901–1954)
image: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg$ webpreview https://github.com/ --absolute-url
title: GitHub: Where the world builds software
description: GitHub is where over 83 million developers shape the future of software, together.
image: https://github.githubassets.com/images/modules/site/social-cards/github-social.png
```### Using compatibility API
Before *v1.7.0* the package mainly exposed a different set of the API methods.
All of them are supported and may continue to be used.```python
# WARNING:
# The API below is left for BACKWARD COMPATIBILITY ONLY.from webpreview import web_preview
title, description, image = web_preview("aurl.com")# specifing timeout which gets passed to requests.get()
title, description, image = web_preview("a_slow_url.com", timeout=1000)# passing headers
headers = {'User-Agent': 'Mozilla/5.0'}
title, description, image = web_preview("a_slow_url.com", headers=headers)# pass html content thus avoiding making http call again to fetch content.
content = """Dummy HTML"""
title, description, image = web_preview("aurl.com", content=content)# specifing the parser
# by default webpreview uses 'html.parser'
title, description, image = web_preview("aurl.com", content=content, parser='lxml')
```## Run with Docker
The docker image can be built and ran similarly to the command line.
The default entry point is the `webpreview` command-line function.```shell
$ docker build -t webpreview .
$ docker run -it --rm webpreview "https://en.m.wikipedia.org/wiki/Enrico_Fermi"
title: Enrico Fermi - Wikipedia
description: Enrico Fermi (Italian: [enˈriːko ˈfermi]; 29 September 1901 – 28 November 1954) was an Italian (later naturalized American) physicist and the creator of the world's first nuclear reactor, the Chicago Pile-1. He has been called the "architect of the nuclear age"[1] and the "architect of the atomic bomb".
image: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg
```*Note*: built docker image weighs around 210MB.
## Testing
```shell
# Execute the tests
poetry run pytest webpreview# OR execute until the first failed test
poetry run pytest webpreview -x
```## Setting up development environment
```shell
# Install a correct minimal supported version of python
pyenv install 3.7.13# Create a virtual environment
# By default, the project already contains a .python-version file that points
# to 3.7.13.
python -m venv .venv# Install dependencies
# Poetry will automatically install them into the local .venv
poetry install# If you have errors likes this:
ERROR: Can not execute `setup.py` since setuptools is not available in the build environment.# Then do this:
.venv/bin/pip install --upgrade setuptools
```