{"id":16938770,"url":"https://github.com/jvanasco/metadata_parser","last_synced_at":"2025-05-16T02:09:30.294Z","repository":{"id":2947636,"uuid":"3961068","full_name":"jvanasco/metadata_parser","owner":"jvanasco","description":"python library for getting metadata","archived":false,"fork":false,"pushed_at":"2025-03-24T13:51:19.000Z","size":422,"stargazers_count":143,"open_issues_count":5,"forks_count":24,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-08T13:05:58.260Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jvanasco.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.txt","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2012-04-07T22:38:09.000Z","updated_at":"2025-03-24T13:51:23.000Z","dependencies_parsed_at":"2024-06-18T18:19:51.185Z","dependency_job_id":"12fa4790-c675-4f01-b089-a1b681677c01","html_url":"https://github.com/jvanasco/metadata_parser","commit_stats":{"total_commits":151,"total_committers":6,"mean_commits":"25.166666666666668","dds":0.04635761589403975,"last_synced_commit":"d1183b4566a774f28fa4f3e76d42ba60fb44e291"},"previous_names":[],"tags_count":64,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jvanasco%2Fmetadata_parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jvanasco%2Fmetadata_parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jvanasco%2Fmetadata_parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jvanasco%2Fmetadata_parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jvanasco","download_url":"https://codeload.github.com/jvanasco/metadata_parser/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254453667,"owners_count":22073618,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T21:02:29.914Z","updated_at":"2025-05-16T02:09:30.269Z","avatar_url":"https://github.com/jvanasco.png","language":"Python","readme":"MetadataParser\n==============\n\n.. |build_status| image:: https://github.com/jvanasco/metadata_parser/workflows/Python%20package/badge.svg\n\nBuild Status: |build_status|\n\nMetadataParser is a Python module for pulling metadata out of web documents.\n\nIt requires `BeautifulSoup` for parsing. `Requests` is required for installation\nat this time, but not for operation. Additional functionality is automatically\nenabled if the `tldextract` project is installed, but can be disabled by\nsetting an environment variable.\n\nThis project has been used in production for many years, and has successfully\nparsed billions of documents.\n\n\nVersioning, Pinning, and Support\n================================\n\nThis project is using a Semantic Versioning release schedule,\nwith a {MAJOR}.{MINOR}.{PATCH} format.\n\nUsers are advised to pin their installations to \"metadata_parser\u003c{MINOR +1}\"\n\nFor example:\n\n* if the current release is: `0.10.6`\n* the advised pin is:  `metadata_parser\u003c0.11`\n\nPATCH releases will usually be bug fixes and new features that support backwards compatibility with Public Methods.  Private Methods are not guaranteed to be\nbackwards compatible.\n\nMINOR releases are triggered when there is a breaking change to Public Methods.\nOnce a new MINOR release is triggered, first-party support for the previous MINOR\nrelease is EOL (end of life). PRs for previous releases are welcome, but giving\nthem proper attention is not guaranteed.\n\nThe current MAJOR release is `0`.\nA `1` MAJOR release is planned, and will have an entirely different structure and API.\n\nFuture deprecations will raise warnings.\n\nBy populating the following environment variable, future deprecations will raise exceptions:\n    export METADATA_PARSER_FUTURE=1\n\nInstallation\n=============\n\npip install metadata_parser\n\n\nInstallation Recommendation\n===========================\n\nThe ``requests`` library version 2.4.3 or newer is strongly recommended.\n\nThis is not required, but it is better.  On earlier versions it is possible to\nhave an uncaught DecodeError exception when there is an underlying redirect/404.\nRecent fixes to ``requests`` improve redirect handling, urllib3 and urllib3\nerrors.\n\n\nFeatures\n========\n\n* ``metadata_parser`` pulls as much metadata out of a document as possible\n* Developers can set a 'strategy' for finding metadata (i.e. only accept\n  opengraph or page attributes)\n* Lightweight but functional(!) url validation\n* Verbose logging\n\nLogging\n=======\n\nThis file has extensive logging to help developers pinpoint problems.\n\n* ``log.debug``\n  This log level is mostly used to handle library maintenance and\n  troubleshooting, aka \"Library Debugging\".  Library Debugging is verbose, but\n  is nested under ``if __debug__:`` statements, so it is compiled away when\n  PYTHONOPTIMIZE is set.\n  Several sections of logic useful to developers will also emit logging\n  statements at the ``debug`` level, regardless of PYTHONOPTIMIZE.\n\n* ``log.info``\n  Currently unused\n\n* ``log.warning``\n  Currently unused\n\n* ``log.error``\n  This log level is mostly used to alert developers of errors that were\n  encountered during url fetching and document parsing, and often emits a log\n  statement just before an Exception is raised. The log statements will contain\n  at least the exception type, and may contain the active URL and additional\n  debugging information, if any of that information is available.\n\n* ``log.critical``\n  Currently unused\n\n\nIt is STRONGLY recommended to keep Python's logging at ``debug``.\n\n\nOptional Integrations\n=====================\n\n* ``tldextract``\n  This package will attempt to use the package ``tldextract`` for advanced domain\n  and hostname analysis. If ``tldextract`` is not found, a fallback is used.\n\n\nEnvironment Variables\n=====================\n\n* ``METADATA_PARSER__DISABLE_TLDEXTRACT``\n  Default: \"0\".\n  If set to \"1\", the package will not attempt to load ``tldextract``.\n\n* ``METADATA_PARSER__ENCODING_FALLBACK``\n  Default: \"ISO-8859-1\"\n  Used as the fallback when trying to decode a response.\n\n*  ``METADATA_PARSER__DUMMY_URL``\n   Used as the fallback URL when calculating url data.\n\n\nNotes\n=====\n\n1. This package requires BeautifulSoup 4.\n2. For speed, it will instantiate a BeautifulSoup parser with lxml, and\n   fallback to 'none' (the internal pure Python) if it can't load lxml.\n3. URL Validation is not RFC compliant, but tries to be \"Real World\" compliant.\n\nIt is HIGHLY recommended that you install lxml for usage.\nlxml is considerably faster.\nConsiderably faster.\n\nDevelopers should also use a very recent version of lxml.\nsegfaults have been reported on lxml versions \u003c 2.3.x;\nUsing at least the most recent 3.x versions is strongly recommended\n\nThe default 'strategy' is to look in this order::\n\n    og,dc,meta,page\n\nWhich stands for the following::\n\n    og = OpenGraph\n    dc = DublinCore\n    meta = metadata\n    page = page elements\n\nDevelopers can specify a strategy as a comma-separated list of the above.\n\nThe only 2 page elements currently supported are::\n\n    \u003ctitle\u003eVALUE\u003c/title\u003e -\u003e metadata['page']['title']\n    \u003clink rel=\"canonical\" href=\"VALUE\"\u003e -\u003e metadata['page']['link']\n\n'metadata' elements are supported by ``name`` and ``property``.\n\nThe MetadataParser object also wraps some convenience functions, which can be\nused otherwise , that are designed to turn alleged urls into well formed urls.\n\nFor example, you may pull a page::\n\n    http://www.example.com/path/to/file.html\n\nand that file indicates a canonical url which is simple \"/file.html\".\n\nThis package will try to 'remount' the canonical url to the absolute url of\n\"http://www.example.com/file.html\".\nTt will return None if the end result is not a valid url.\n\nThis all happens under-the-hood, and is honestly really useful when dealing\nwith indexers and spiders.\n\n\nURL Validation\n==============\n\n\"Real World\" URL validation is enabled by default.  This is not RFC compliant.\n\nThere are a few gaps in the RFCs that allow for \"odd behavior\".\nJust about any use-case for this package will desire/expect rules that parse\nURLs \"in the wild\", not theoretical.\n\nThe differences:\n\n* If an entirely numeric ip address is encountered, it is assumed to be a\n  dot-notation IPV4 and it is checked to have the right amount of valid octets.\n  \n  The default behavior is to invalidate these hosts::\n\n        http://256.256.256.256\n        http://999.999.999.999.999\n\n  According to RFCs those are valid hostnames that would fail as \"IP Addresses\"\n  but pass as \"Domain Names\".  However in the real world, one would never\n  encounter domain names like those.\n\n* The only non-domain hostname that is allowed, is \"localhost\"\n\n  The default behavior is to invalidate these hosts ::\n\n        http://example\n        http://examplecom\n\n  Those are considered to be valid hosts, and might exist on a local network or\n  custom hosts file.  However, they are not part of the public internet.\n\nAlthough this behavior breaks RFCs, it greatly reduces the number of\n\"False Positives\" generated when analyzing internet pages. If you want to\ninclude bad data, you can submit a kwarg to ``MetadataParser.__init__``\n\n\nHandling Bad URLs and Encoded URIs\n==================================\n\nThis library tries to safeguard against a few common situations.\n\nEncoded URIs and relative urls\n------------------------------\n\nMost website publishers will define an image as a URL::\n\n    \u003cmeta property=\"og:image\" content=\"http://example.com/image.jpg\" /\u003e\n\nSome will define an image as an encoded URI::\n\n    \u003cmeta property=\"og:image\" content=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNM+Q8AAc0BZX6f84gAAAAASUVORK5CYII=\" /\u003e\n\nBy default, the ``get_metadata_link()`` method can be used to ensure a valid link\nis extracted from the metadata payload::\n\n    \u003e\u003e\u003e import metadata_parser\n    \u003e\u003e\u003e page = metadata_parser.MetadataParser(url=\"http://www.example.com\")\n    \u003e\u003e\u003e print page.get_metadata_link('image')\n\nThis method accepts a kwarg ``allow_encoded_uri`` (default False) which will\nreturn the image without further processing::\n\n    \u003e\u003e\u003e print page.get_metadata_link('image', allow_encoded_uri=True)\n\nSimilarly, if a url is local::\n\n    \u003cmeta property=\"og:image\" content=\"/image.jpg\" /\u003e\n\nThe ``get_metadata_link`` method will automatically upgrade it onto the domain::\n\n    \u003e\u003e\u003e print page.get_metadata_link('image')\n    http://example.com/image.jpg\n\nPoorly Constructed Canonical URLs\n---------------------------------\n\nMany website publishers implement canonical URLs incorrectly.  This package\ntries to fix that.\n\nBy default ``MetadataParser`` is constructed with ``require_public_netloc=True``\nand ``allow_localhosts=True``.\n\nThis will require somewhat valid 'public' network locations in the url.\n\nFor example, these will all be valid URLs::\n\n    http://example.com\n    http://1.2.3.4\n    http://localhost\n    http://127.0.0.1\n    http://0.0.0.0\n\nIf these known 'localhost' urls are not wanted, they can be filtered out with\n``allow_localhosts=False``::\n\n    http://localhost\n    http://127.0.0.1\n    http://0.0.0.0\n\nThere are two convenience methods that can be used to get a canonical url or\ncalculate the effective url::\n\n* MetadataParser.get_discrete_url\n* MetadataParser.get_metadata_link\n\nThese both accept an argument ``require_public_global``, which defaults to ``True``.\n\nAssuming we have the following content on the url ``http://example.com/path/to/foo``::\n\n    \u003clink rel=\"canonical\" href=\"http://localhost:8000/alt-path/to/foo\"\u003e\n\nBy default, versions 0.9.0 and later will detect 'localhost:8000' as an\nimproper canonical url, and remount the local part \"/alt-path/to/foo\" onto the\ndomain that served the file.  The vast majority of times this 'behavior'\nhas been encountered, this is the intended canonical::\n\n    print page.get_discrete_url()\n    \u003e\u003e\u003e http://example.com/alt-path/to/foo\n\nIn contrast, versions 0.8.3 and earlier will not catch this situation::\n\n    print page.get_discrete_url()\n    \u003e\u003e\u003e http://localhost:8000/alt-path/to/foo\n\nIn order to preserve the earlier behavior, just submit ``require_public_global=False``::\n\n    print page.get_discrete_url(require_public_global=False)\n    \u003e\u003e\u003e http://localhost:8000/alt-path/to/foo\n\n\nHandling Bad Data\n=================\n\nMany CMS systems (and developers) create malformed content or incorrect\ndocument identifiers.  When this happens, the BeautifulSoup parser will lose\ndata or move it into an unexpected place.\n\nThere are two arguments that can help you analyze this data:\n\n* force_doctype::\n\n    ``MetadataParser(..., force_doctype=True, ...)``\n\n``force_doctype=True`` will try to replace the identified doctype with \"html\"\nvia regex.  This will often make the input data usable by BS4.\n\n* search_head_only::\n\n    ``MetadataParser(..., search_head_only=False, ...)``\n\n``search_head_only=False`` will not limit the search path to the \"\u003chead\u003e\" element.\nThis will have a slight performance hit and will incorporate data from CMS/User\ncontent, not just templates/Site-Operators.\n\n\nWARNING\n=============\n\n1.0 will be a complete API overhaul.  pin your releases to avoid sadness.\n\n\nVersion 0.9.19 Breaking Changes\n===============================\n\nIssue #12 exposed some flaws in the existing package\n\n1. ``MetadataParser.get_metadatas`` replaces ``MetadataParser.get_metadata``\n----------------------------------------------------------------------------\n\nUntil version 0.9.19, the recommended way to get metadata was to use\n``get_metadata`` which will either return a string (or None).\n\nStarting with version 0.9.19, the recommended way to get metadata is to use\n``get_metadatas`` which will always return a list (or None).\n\nThis change was made because the library incorrectly stored a single metadata\nkey value when there were duplicates.\n\n2. The ``ParsedResult`` payload stores mixed content and tracks it's version\n==--------------------------------------------------------------------------\n\nMany users (including the maintainer) archive the parsed metadata. After\ntesting a variety of payloads with an all-list format and a mixed format\n(string or list), a mixed format had a much smaller payload size with a\nnegligible performance hit. A new ``_v`` attribute tracks the payload version.\nIn the future, payloads without a ``_v`` attribute will be interpreted as the\npre-versioning format.\n\n3. ``DublinCore`` payloads might be a dict\n------------------------------------------\n\nTests were added to handle dublincore data. An extra attribute may be needed to\nproperly represent the payload, so always returning a dict with at least a\nname+content (and possibly ``lang`` or ``scheme`` is the best approach.\n\n\n\nUsage\n=====\n\nUntil version ``0.9.19``, the recommended way to get metadata was to use\n``get_metadata`` which will return a string (or None):\n\n**From an URL**::\n\n    \u003e\u003e\u003e import metadata_parser\n    \u003e\u003e\u003e page = metadata_parser.MetadataParser(url=\"http://www.example.com\")\n    \u003e\u003e\u003e print page.metadata\n    \u003e\u003e\u003e print page.get_metadatas('title')\n    \u003e\u003e\u003e print page.get_metadatas('title', strategy=['og',])\n    \u003e\u003e\u003e print page.get_metadatas('title', strategy=['page', 'og', 'dc',])\n\n**From HTML**::\n\n    \u003e\u003e\u003e HTML = \"\"\"\u003chere\u003e\"\"\"\n    \u003e\u003e\u003e page = metadata_parser.MetadataParser(html=HTML)\n    \u003e\u003e\u003e print page.metadata\n    \u003e\u003e\u003e print page.get_metadatas('title')\n    \u003e\u003e\u003e print page.get_metadatas('title', strategy=['og',])\n    \u003e\u003e\u003e print page.get_metadatas('title', strategy=['page', 'og', 'dc',])\n\n\nMalformed Data\n==============\n\nIt is very common to find malformed data. As of version ``0.9.20`` the following\nmethods should be used to allow malformed presentation::\n\n    \u003e\u003e\u003e page = metadata_parser.MetadataParser(html=HTML, support_malformed=True)\n\nor::\n\n    \u003e\u003e\u003e parsed = page.parse(html=html, support_malformed=True)\n    \u003e\u003e\u003e parsed = page.parse(html=html, support_malformed=False)\n\nThe above options will support parsing common malformed options.  Currently\nthis only looks at alternate (improper) ways of producing twitter tags, but may\nbe expanded.\n\nNotes\n=====\n\nwhen building on Python3, a ``static`` toplevel directory may be needed\n\nThis library was originally based on Erik River's\n`opengraph module \u003chttps://github.com/erikriver/opengraph\u003e`_. Something more\naggressive than Erik's module was needed, so this project was started.","funding_links":[],"categories":["[](#table-of-contents) Table of contents"],"sub_categories":["[](#websites-files-metadata-analyze-and-files-downloads)Website's files metadata analyze and files downloads"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjvanasco%2Fmetadata_parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjvanasco%2Fmetadata_parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjvanasco%2Fmetadata_parser/lists"}