{"id":13585245,"url":"https://github.com/ludbek/webpreview","last_synced_at":"2025-04-07T14:12:56.941Z","repository":{"id":24087742,"uuid":"27474796","full_name":"ludbek/webpreview","owner":"ludbek","description":"Extracts OpenGraph, TwitterCard and Schema properties from a webpage.","archived":false,"fork":false,"pushed_at":"2024-05-27T20:32:58.000Z","size":189,"stargazers_count":83,"open_issues_count":9,"forks_count":18,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-31T12:09:06.007Z","etag":null,"topics":["open-graph","schema","twitter-cards","web-preview"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ludbek.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-12-03T07:22:25.000Z","updated_at":"2025-03-15T07:04:44.000Z","dependencies_parsed_at":"2024-06-18T22:53:43.730Z","dependency_job_id":"c55f1a14-e2d4-49c6-aafd-e7c27aa2fef3","html_url":"https://github.com/ludbek/webpreview","commit_stats":{"total_commits":97,"total_committers":11,"mean_commits":8.818181818181818,"dds":"0.45360824742268047","last_synced_commit":"f9f778191cc613599c940aed78fbb5cf28c9a86c"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ludbek%2Fwebpreview","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ludbek%2Fwebpreview/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ludbek%2Fwebpreview/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ludbek%2Fwebpreview/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ludbek","download_url":"https://codeload.github.com/ludbek/webpreview/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247666014,"owners_count":20975788,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["open-graph","schema","twitter-cards","web-preview"],"created_at":"2024-08-01T15:04:49.782Z","updated_at":"2025-04-07T14:12:56.919Z","avatar_url":"https://github.com/ludbek.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# webpreview\n\nFor a given URL, `webpreview` extracts its **title**, **description**, and **image url** using\n[Open Graph](http://ogp.me/), [Twitter Card](https://dev.twitter.com/cards/overview), or\n[Schema](http://schema.org/) meta tags, or, as an alternative, parses it as a generic webpage.\n\n\u003cp\u003e\n    \u003ca href=\"https://pypi.org/project/webpreview/\"\u003e\u003cimg alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/webpreview\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://pypi.org/project/webpreview/\"\u003e\u003cimg alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/webpreview?logo=pypi\u0026color=blue\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/ludbek/webpreview/actions?query=workflow%3Atest\"\u003e\u003cimg alt=\"Build status\" src=\"https://img.shields.io/github/workflow/status/ludbek/webpreview/test?label=build\u0026logo=github\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://codecov.io/gh/ludbek/webpreview\"\u003e\u003cimg alt=\"Code coverage report\" src=\"https://img.shields.io/codecov/c/github/ludbek/webpreview?logo=codecov\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\n## Installation\n\n```shell\npip install webpreview\n```\n\n## Usage\n\nUse the generic `webpreview` method (added in *v1.7.0*) to parse the page independent of its nature.\nThis method fetches a page and tries to extracts a *title, description, and a preview image* from it.\n\nIt first attempts to parse the values from **Open Graph** properties, then it falls back to\n**Twitter Card** format, and then to **Schema**. If none of these methods succeed in extracting all\nthree properties, then the web page's content is parsed using a generic HTML parser.\n\n```python\n\u003e\u003e\u003e from webpreview import webpreview\n\n\u003e\u003e\u003e p = webpreview(\"https://en.wikipedia.org/wiki/Enrico_Fermi\")\n\u003e\u003e\u003e p.title\n'Enrico Fermi - Wikipedia'\n\u003e\u003e\u003e p.description\n'Italian-American physicist (1901–1954)'\n\u003e\u003e\u003e p.image\n'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg'\n\n# Access the parsed fields both as attributes and items\n\u003e\u003e\u003e p[\"url\"] == p.url\nTrue\n\n# Check if all three of the title, description, and image are in the parsing result\n\u003e\u003e\u003e p.is_complete()\nTrue\n\n# Provide page content from somewhere else\n\u003e\u003e\u003e content = \"\"\"\n\u003chtml\u003e\n    \u003chead\u003e\n        \u003ctitle\u003eThe Dormouse's story\u003c/title\u003e\n        \u003cmeta property=\"og:description\" content=\"A Mad Tea-Party story\" /\u003e\n    \u003c/head\u003e\n    \u003cbody\u003e\n        \u003cp class=\"title\"\u003e\u003cb\u003eThe Dormouse's story\u003c/b\u003e\u003c/p\u003e\n        \u003ca href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\"\u003eElsie\u003c/a\u003e\n    \u003c/body\u003e\n\u003c/html\u003e\n\"\"\"\n\n# The the function's invocation won't make any external calls,\n# only relying on the supplied content, unlike the example above\n\u003e\u003e\u003e webpreview(\"aa.com\", content=content)\nWebPreview(url=\"http://aa.com\", title=\"The Dormouse's story\", description=\"A Mad Tea-Party story\")\n```\n\n### Using the command line\n\nWhen `webpreview` is installed via `pip`, then the accompanying command-line tool is\ninstalled alongside.\n\n```shell\n$ webpreview https://en.wikipedia.org/wiki/Enrico_Fermi\ntitle: Enrico Fermi - Wikipedia\ndescription: Italian-American physicist (1901–1954)\nimage: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg\n\n$ webpreview https://github.com/ --absolute-url\ntitle: GitHub: Where the world builds software\ndescription: GitHub is where over 83 million developers shape the future of software, together.\nimage: https://github.githubassets.com/images/modules/site/social-cards/github-social.png\n```\n\n### Using compatibility API\n\nBefore *v1.7.0* the package mainly exposed a different set of the API methods.\nAll of them are supported and may continue to be used.\n\n```python\n# WARNING:\n# The API below is left for BACKWARD COMPATIBILITY ONLY.\n\nfrom webpreview import web_preview\ntitle, description, image = web_preview(\"aurl.com\")\n\n# specifing timeout which gets passed to requests.get()\ntitle, description, image = web_preview(\"a_slow_url.com\", timeout=1000)\n\n# passing headers\nheaders = {'User-Agent': 'Mozilla/5.0'}\ntitle, description, image = web_preview(\"a_slow_url.com\", headers=headers)\n\n# pass html content thus avoiding making http call again to fetch content.\ncontent = \"\"\"\u003chtml\u003e\u003chead\u003e\u003ctitle\u003eDummy HTML\u003c/title\u003e\u003c/head\u003e\u003c/html\u003e\"\"\"\ntitle, description, image = web_preview(\"aurl.com\", content=content)\n\n# specifing the parser\n# by default webpreview uses 'html.parser'\ntitle, description, image = web_preview(\"aurl.com\", content=content, parser='lxml')\n```\n\n## Run with Docker\n\nThe docker image can be built and ran similarly to the command line.\nThe default entry point is the `webpreview` command-line function.\n\n```shell\n$ docker build -t webpreview .\n$ docker run -it --rm webpreview \"https://en.m.wikipedia.org/wiki/Enrico_Fermi\"\ntitle: Enrico Fermi - Wikipedia\ndescription: Enrico Fermi (Italian: [enˈriːko ˈfermi]; 29 September 1901 – 28 November 1954) was an Italian (later naturalized American) physicist and the creator of the world's first nuclear reactor, the Chicago Pile-1. He has been called the \"architect of the nuclear age\"[1] and the \"architect of the atomic bomb\".\nimage: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Enrico_Fermi_1943-49.jpg/1200px-Enrico_Fermi_1943-49.jpg\n```\n\n*Note*: built docker image weighs around 210MB.\n\n## Testing\n\n```shell\n# Execute the tests\npoetry run pytest webpreview\n\n# OR execute until the first failed test\npoetry run pytest webpreview -x\n```\n\n## Setting up development environment\n\n```shell\n# Install a correct minimal supported version of python\npyenv install 3.7.13\n\n# Create a virtual environment\n# By default, the project already contains a .python-version file that points\n# to 3.7.13.\npython -m venv .venv\n\n# Install dependencies\n# Poetry will automatically install them into the local .venv\npoetry install\n\n# If you have errors likes this:\nERROR: Can not execute `setup.py` since setuptools is not available in the build environment.\n\n# Then do this:\n.venv/bin/pip install --upgrade setuptools\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fludbek%2Fwebpreview","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fludbek%2Fwebpreview","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fludbek%2Fwebpreview/lists"}