{"id":17237784,"url":"https://github.com/tarao/perl5-html-extractcontent","last_synced_at":"2026-01-19T05:33:15.684Z","repository":{"id":28435082,"uuid":"31950129","full_name":"tarao/perl5-HTML-ExtractContent","owner":"tarao","description":null,"archived":false,"fork":false,"pushed_at":"2015-11-30T08:31:02.000Z","size":46,"stargazers_count":1,"open_issues_count":4,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-12T13:54:27.912Z","etag":null,"topics":["perl"],"latest_commit_sha":null,"homepage":null,"language":"Perl","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tarao.png","metadata":{"files":{"readme":"README.md","changelog":"Changes","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-03-10T09:57:54.000Z","updated_at":"2020-07-28T19:09:56.000Z","dependencies_parsed_at":"2022-08-26T14:10:38.252Z","dependency_job_id":null,"html_url":"https://github.com/tarao/perl5-HTML-ExtractContent","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarao%2Fperl5-HTML-ExtractContent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarao%2Fperl5-HTML-ExtractContent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarao%2Fperl5-HTML-ExtractContent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarao%2Fperl5-HTML-ExtractContent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tarao","download_url":"https://codeload.github.com/tarao/perl5-HTML-ExtractContent/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247451656,"owners_count":20940944,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["perl"],"created_at":"2024-10-15T05:43:44.698Z","updated_at":"2026-01-19T05:33:15.646Z","avatar_url":"https://github.com/tarao.png","language":"Perl","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/tarao/perl5-HTML-ExtractContent.svg?branch=master)](https://travis-ci.org/tarao/perl5-HTML-ExtractContent)\n# NAME\n\nHTML::ExtractContent - An HTML content extractor with scoring heuristics\n\n# SYNOPSIS\n\n    use HTML::ExtractContent;\n    use LWP::UserAgent;\n\n    my $agent = LWP::UserAgent-\u003enew;\n    my $res = $agent-\u003eget('http://www.example.com/');\n\n    my $extractor = HTML::ExtractContent-\u003enew;\n    $extractor-\u003eextract($res-\u003edecoded_content);\n    print $extractor-\u003eas_text;\n\n# DESCRIPTION\n\nHTML::ExtractContent is a module for extracting content from HTML with scoring\nheuristics. It guesses which block of HTML looks like content according to\nscores depending on the amount of punctuation marks and the lengths of non-tag\ntexts. It also guesses whether content end in the block or continue to the\nnext block.\n\n# METHODS\n\n- new\n\n        $extractor = HTML::ExtractContent-\u003enew;\n\n    Creates a new HTML::ExtractContent instance.\n\n- extract\n\n        $extractor-\u003eextract($html);\n\n    Extracts content from `$html`.\n    `$html` must have its UTF-8 flag on.\n\n- as\\_text\n\n        $extractor-\u003eextract($html)-\u003eas_text;\n\n    Returns extracted content as a plain text. All tags are eliminated.\n\n- as\\_html\n\n        $extractor-\u003eextract($html)-\u003eas_html;\n\n    Returns extracted content as an HTML text.\n    Note that the returned text is neither fully tagged nor valid HTML.\n    It doesn't contain tags such as \u003chtml\u003e and it may have block tags that are\n    not closed, or closed but not opened.\n    This method is intended for the case that you need to analyse link tags in\n    the text for example.\n\n# ACKNOWLEDGMENT\n\nHiromichi Kishi contributed towards development of this module\nas a partner of pair programming.\n\nImplementation of this module is based on the Ruby module ExtractContent by\nNakatani Shuyo.\n\n# AUTHOR\n\nINA Lintaro \u003ctarao at cpan.org\u003e\n\n# COPYRIGHT\n\nCopyright (C) 2008 INA Lintaro / Hatena. All rights reserved.\n\n## Copyright of the original implementation\n\nCopyright (c) 2007/2008 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.\n\n# LICENCE\n\nThis library is free software; you can redistribute it and/or modify it under\nthe same terms as Perl itself.\n\n# SEE ALSO\n\n[http://rubyforge.org/projects/extractcontent/](http://rubyforge.org/projects/extractcontent/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftarao%2Fperl5-html-extractcontent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftarao%2Fperl5-html-extractcontent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftarao%2Fperl5-html-extractcontent/lists"}