{"id":13584680,"url":"https://github.com/j0k3r/graby","last_synced_at":"2025-05-14T13:07:22.375Z","repository":{"id":31417001,"uuid":"34980383","full_name":"j0k3r/graby","owner":"j0k3r","description":"Graby helps you extract article content from web pages","archived":false,"fork":false,"pushed_at":"2025-04-13T16:39:18.000Z","size":2902,"stargazers_count":374,"open_issues_count":47,"forks_count":74,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-05-12T20:18:53.051Z","etag":null,"topics":["composer","content","extract-website","hacktoberfest","php","readability","text-rss"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/j0k3r.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"j0k3r"}},"created_at":"2015-05-03T09:21:49.000Z","updated_at":"2025-04-19T09:16:23.000Z","dependencies_parsed_at":"2024-12-05T08:04:21.581Z","dependency_job_id":"f46dbbfc-6159-4859-a256-c1e0669ba55c","html_url":"https://github.com/j0k3r/graby","commit_stats":{"total_commits":532,"total_committers":21,"mean_commits":"25.333333333333332","dds":"0.35902255639097747","last_synced_commit":"1281bf3d7045d2f2682d1af6ba3715e492184e9a"},"previous_names":[],"tags_count":93,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/j0k3r%2Fgraby","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/j0k3r%2Fgraby/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/j0k3r%2Fgraby/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/j0k3r%2Fgraby/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/j0k3r","download_url":"https://codeload.github.com/j0k3r/graby/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254149958,"owners_count":22022851,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["composer","content","extract-website","hacktoberfest","php","readability","text-rss"],"created_at":"2024-08-01T15:04:26.611Z","updated_at":"2025-05-14T13:07:17.357Z","avatar_url":"https://github.com/j0k3r.png","language":"PHP","funding_links":["https://github.com/sponsors/j0k3r"],"categories":["PHP"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg width=\"400\" height=\"144\" src=\"https://user-images.githubusercontent.com/62333/67490348-5dfc5280-f673-11e9-9b3d-584e6cbeb9e2.png\" alt=\"Graby logo\" /\u003e\n    \u003cbr\u003e\n    \u003cbr\u003e\n    \u003cbr\u003e\n    \u003cbr\u003e\n\u003c/div\u003e\n\n[![Join the chat at https://gitter.im/j0k3r/graby](https://badges.gitter.im/j0k3r/graby.svg)](https://gitter.im/j0k3r/graby?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n![CI](https://github.com/j0k3r/graby/workflows/CI/badge.svg)\n[![Coverage Status](https://coveralls.io/repos/j0k3r/graby/badge.svg?branch=master\u0026service=github)](https://coveralls.io/github/j0k3r/graby?branch=master)\n[![Total Downloads](https://img.shields.io/packagist/dt/j0k3r/graby.svg)](https://packagist.org/packages/j0k3r/graby)\n[![License](https://poser.pugx.org/j0k3r/graby/license)](https://packagist.org/packages/j0k3r/graby)\n\nGraby helps you extract article content from web pages\n\n- it's based on [php-readability](https://github.com/j0k3r/php-readability)\n- it uses [site_config](http://help.fivefilters.org/customer/portal/articles/223153-site-patterns) to extract content from websites\n- it's a fork of Full-Text RSS v3.3 from [@fivefilters](http://fivefilters.org/)\n\n## Why this fork ?\n\nFull-Text RSS works great as a standalone application. But when you need to encapsulate it in your own library it's a mess. You need this kind of ugly thing:\n\n```php\n$article = 'http://www.bbc.com/news/entertainment-arts-32547474';\n$request = 'http://example.org/full-text-rss/makefulltextfeed.php?format=json\u0026url='.urlencode($article);\n$result  = @file_get_contents($request);\n```\n\nAlso, if you want to understand how things work internally, it's really hard to read and understand. And finally, there are **no tests** at all.\n\nThat's why I made this fork:\n\n1. Easiest way to integrate it (using composer)\n2. Fully tested\n3. (hopefully) better to understand\n4. A bit more decoupled\n\n## How to use it\n\n\u003e **Note**\n\u003e These instructions are for development version of Graby, which has an API incompatible with the stable version. Please check out the [README in the `2.x` branch](https://github.com/j0k3r/graby/blob/2.x/README.md#how-to-use-it) for usage instructions for the stable version.\n\n### Requirements\n\n- PHP \u003e= 7.4\n- [Tidy](https://github.com/htacg/tidy-html5) \u0026 cURL extensions enabled\n\n### Installation\n\nAdd the lib using [Composer](https://getcomposer.org/):\n\n    composer require 'j0k3r/graby dev-master' php-http/guzzle7-adapter\n\nWhy `php-http/guzzle7-adapter`? Because Graby is decoupled from any HTTP client implementation, thanks to [HTTPlug](http://httplug.io/) (see [that list of client implementation](https://packagist.org/providers/php-http/client-implementation)).\n\nGraby is tested \u0026 should work great with:\n\n- Guzzle 7 (using `php-http/guzzle7-adapter`)\n- cURL (using `php-http/curl-client`)\n\nNote: if you want to use Guzzle 5 or 6, use Graby 2 (support has dropped in v3 because of dependencies conflicts)\n\n### Retrieve content from an url\n\nUse the class to retrieve content:\n\n```php\nuse Graby\\Graby;\n\n$article = 'http://www.bbc.com/news/entertainment-arts-32547474';\n\n$graby = new Graby();\n$result = $graby-\u003efetchContent($article);\n\nvar_dump($result-\u003egetResponse()-\u003egetStatus()); // 200\nvar_dump($result-\u003egetHtml()); // \"[Fetched and readable content…]\"\nvar_dump($result-\u003egetTitle()); // \"Ben E King: R\u0026B legend dies at 76\"\nvar_dump($result-\u003egetLanguage()); // \"en-GB\"\nvar_dump($result-\u003egetDate()); // \"2015-05-01T16:24:37+01:00\"\nvar_dump($result-\u003egetAuthors()); // [\"BBC News\"]\nvar_dump((string) $result-\u003egetResponse()-\u003egetEffectiveUri()); // \"http://www.bbc.com/news/entertainment-arts-32547474\"\nvar_dump($result-\u003egetImage()); // \"https://ichef-1.bbci.co.uk/news/720/media/images/82709000/jpg/_82709878_146366806.jpg\"\nvar_dump($result-\u003egetSummary()); // \"Ben E King received an award from the Songwriters Hall of Fame in \u0026hellip;\"\nvar_dump($result-\u003egetIsNativeAd()); // false\nvar_dump($result-\u003egetResponse()-\u003egetHeaders()); /*\n[\n  'server' =\u003e ['Apache'],\n  'content-type' =\u003e ['text/html; charset=utf-8'],\n  'x-news-data-centre' =\u003e ['cwwtf'],\n  'content-language' =\u003e ['en'],\n  'x-pal-host' =\u003e ['pal074.back.live.cwwtf.local:80'],\n  'x-news-cache-id' =\u003e ['13648'],\n  'content-length' =\u003e ['157341'],\n  'date' =\u003e ['Sat, 29 Apr 2017 07:35:39 GMT'],\n  'connection' =\u003e ['keep-alive'],\n  'cache-control' =\u003e ['private, max-age=60, stale-while-revalidate'],\n  'x-cache-action' =\u003e ['MISS'],\n  'x-cache-age' =\u003e ['0'],\n  'x-lb-nocache' =\u003e ['true'],\n  'vary' =\u003e ['X-CDN,X-BBC-Edge-Cache,Accept-Encoding'],\n]\n*/\n```\n\nIn case of error when fetching the url, graby won't throw an exception but will return information about the error (at least the status code):\n\n```php\nvar_dump($result-\u003egetResponse()-\u003egetStatus()); // 200\nvar_dump($result-\u003egetHtml()); // \"[unable to retrieve full-text content]\"\nvar_dump($result-\u003egetTitle()); // \"BBC - 404: Not Found\"\nvar_dump($result-\u003egetLanguage()); // \"en-GB\"\nvar_dump($result-\u003egetDate()); // null\nvar_dump($result-\u003egetAuthors()); // []\nvar_dump((string) $result-\u003egetResponse()-\u003egetEffectiveUri()); // \"http://www.bbc.co.uk/404\"\nvar_dump($result-\u003egetImage()); // null\nvar_dump($result-\u003egetSummary()); // \"[unable to retrieve full-text content]\"\nvar_dump($result-\u003egetIsNativeAd()); // false\nvar_dump($result-\u003egetResponse()-\u003egetHeaders()); // […]\n```\n\nThe `date` result is the same as displayed in the content. If `date` is not `null` in the result, we recommend you to parse it using [`date_parse`](http://php.net/date_parse) (this is what we are using to validate that the date is correct).\n\n### Retrieve content from a prefetched page\n\nIf you want to extract content from a page you fetched outside of Graby, you can call `setContentAsPrefetched()` before calling `fetchContent()`, e.g.:\n\n``` php\nuse Graby\\Graby;\n\n$article = 'http://www.bbc.com/news/entertainment-arts-32547474';\n\n$input = '\u003chtml\u003e[...]\u003c/html\u003e';\n\n$graby = new Graby();\n$graby-\u003esetContentAsPrefetched($input);\n$result = $graby-\u003efetchContent($article);\n```\n\n### Cleanup content\n\nSince the 1.9.0 version, you can also send html content to be cleanup in the same way graby clean content retrieved from an url. The url is still needed to convert links to absolute, etc.\n\n```php\nuse Graby\\Graby;\n\n$article = 'http://www.bbc.com/news/entertainment-arts-32547474';\n// use your own way to retrieve html or to provide html\n$html = ...\n\n$graby = new Graby();\n$result = $graby-\u003ecleanupHtml($html, $article);\n```\n\n### Use custom handler \u0026 formatter to see output log\n\nYou can use them to display graby output log to the end user.\nIt's aim to be used in a Symfony project using Monolog.\n\nDefine the graby handler service (somewhere in a `service.yml`):\n\n```yaml\nservices:\n    # ...\n    graby.log_handler:\n        class: Graby\\Monolog\\Handler\\GrabyHandler\n```\n\nThen define the Monolog handler in your `app/config/config.yml`:\n\n```yaml\nmonolog:\n    handlers:\n        graby:\n            type: service\n            id: graby.log_handler\n            # use \"debug\" to got a lot of data (like HTML at each step) otherwise \"info\" is fine\n            level: debug\n            channels: ['graby']\n```\n\nYou can then retrieve logs from graby in your controller using:\n\n```php\n$logs = $this-\u003eget('monolog.handler.graby')-\u003egetRecords();\n```\n\n### Timeout configuration\n\nIf you need to define a timeout, you must create the `Http\\Client\\HttpClient` manually,\nconfigure it and inject it to `Graby\\Graby`.\n\n- For Guzzle 7:\n\n    ```php\n    use Graby\\Graby;\n    use GuzzleHttp\\Client as GuzzleClient;\n    use Http\\Adapter\\Guzzle7\\Client as GuzzleAdapter;\n\n    $guzzle = new GuzzleClient([\n        'timeout' =\u003e 2,\n    ]);\n    $graby = new Graby([], new GuzzleAdapter($guzzle));\n    ```\n\n\n## Full configuration\n\nThis is the full documented configuration and also the default one.\n\n```php\n$graby = new Graby([\n    // Enable or disable debugging.\n    // This will only generate log information in a file (log/graby.log)\n    'debug' =\u003e false,\n    // use 'debug' value if you want more data (HTML at each step for example) to be dumped in a different file (log/html.log)\n    'log_level' =\u003e 'info',\n    // If enabled relative URLs found in the extracted content are automatically rewritten as absolute URLs.\n    'rewrite_relative_urls' =\u003e true,\n    // If enabled, we will try to follow single page links (e.g. print view) on multi-page articles.\n    // Currently this only happens for sites where single_page_link has been defined\n    // in a site config file.\n    'singlepage' =\u003e true,\n    // If enabled, we will try to follow next page links on multi-page articles.\n    // Currently this only happens for sites where next_page_link has been defined\n    // in a site config file.\n    'multipage' =\u003e true,\n    // Error message when content extraction fails\n    'error_message' =\u003e '[unable to retrieve full-text content]',\n    // Default title when we won't be able to extract a title\n    'error_message_title' =\u003e 'No title found',\n    // List of URLs (or parts of a URL) which will be accept.\n    // If the list is empty, all URLs (except those specified in the blocked list below)\n    // will be permitted.\n    // Example: array('example.com', 'anothersite.org');\n    'allowed_urls' =\u003e [],\n    // List of URLs (or parts of a URL) which will be not accept.\n    // Note: this list is ignored if allowed_urls is not empty\n    'blocked_urls' =\u003e [],\n    // If enabled, we'll pass retrieved HTML content through htmLawed with\n    // safe flag on and style attributes denied, see\n    // http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/htmLawed_README.htm#s3.6\n    // Note: if enabled this will also remove certain elements you may want to preserve, such as iframes.\n    'xss_filter' =\u003e true,\n    // Here you can define different actions based on the Content-Type header returned by server.\n    // MIME type as key, action as value.\n    // Valid actions:\n    // * 'exclude' - exclude this item from the result\n    // * 'link' - create HTML link to the item\n    'content_type_exc' =\u003e [\n       'application/zip' =\u003e ['action' =\u003e 'link', 'name' =\u003e 'ZIP'],\n       'application/pdf' =\u003e ['action' =\u003e 'link', 'name' =\u003e 'PDF'],\n       'image' =\u003e ['action' =\u003e 'link', 'name' =\u003e 'Image'],\n       'audio' =\u003e ['action' =\u003e 'link', 'name' =\u003e 'Audio'],\n       'video' =\u003e ['action' =\u003e 'link', 'name' =\u003e 'Video'],\n       'text/plain' =\u003e ['action' =\u003e 'link', 'name' =\u003e 'Plain text'],\n    ],\n    // How we handle link in content\n    // Valid values :\n    // * preserve: nothing is done\n    // * footnotes: convert links as footnotes\n    // * remove: remove all links\n    'content_links' =\u003e 'preserve',\n    'http_client' =\u003e [\n        // User-Agent used to fetch content\n        'ua_browser' =\u003e 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2',\n        // default referer when fetching content\n        'default_referer' =\u003e 'http://www.google.co.uk/url?sa=t\u0026source=web\u0026cd=1',\n        // Currently allows simple string replace of URLs.\n        // Useful for rewriting certain URLs to point to a single page or HTML view.\n        // Although using the single_page_link site config instruction is the preferred way to do this, sometimes, as\n        // with Google Docs URLs, it's not possible.\n        'rewrite_url' =\u003e [\n            'docs.google.com' =\u003e ['/Doc?' =\u003e '/View?'],\n            'tnr.com' =\u003e ['tnr.com/article/' =\u003e 'tnr.com/print/article/'],\n            '.m.wikipedia.org' =\u003e ['.m.wikipedia.org' =\u003e '.wikipedia.org'],\n            'm.vanityfair.com' =\u003e ['m.vanityfair.com' =\u003e 'www.vanityfair.com'],\n        ],\n        // Prevent certain file/mime types\n        // HTTP responses which match these content types will\n        // be returned without body.\n        'header_only_types' =\u003e [\n           'image',\n           'audio',\n           'video',\n        ],\n        // URLs ending with one of these extensions will\n        // prompt Humble HTTP Agent to send a HEAD request first\n        // to see if returned content type matches $headerOnlyTypes.\n        'header_only_clues' =\u003e ['mp3', 'zip', 'exe', 'gif', 'gzip', 'gz', 'jpeg', 'jpg', 'mpg', 'mpeg', 'png', 'ppt', 'mov'],\n        // User Agent strings - mapping domain names\n        'user_agents' =\u003e [],\n        // AJAX triggers to search for.\n        // for AJAX sites, e.g. Blogger with its dynamic views templates.\n        'ajax_triggers' =\u003e [\n            \"\u003cmeta name='fragment' content='!'\",\n            '\u003cmeta name=\"fragment\" content=\"!\"',\n            \"\u003cmeta content='!' name='fragment'\",\n            '\u003cmeta content=\"!\" name=\"fragment\"',\n        ],\n        // number of redirection allowed until we assume request won't be complete\n        'max_redirect' =\u003e 10,\n    ],\n    'extractor' =\u003e [\n        'default_parser' =\u003e 'libxml',\n        // key is fingerprint (fragment to find in HTML)\n        // value is host name to use for site config lookup if fingerprint matches\n        // \\s* match anything INCLUDING new lines\n        'fingerprints' =\u003e [\n            '/\\\u003cmeta\\s*content=([\\'\"])blogger([\\'\"])\\s*name=([\\'\"])generator([\\'\"])/i' =\u003e 'fingerprint.blogspot.com',\n            '/\\\u003cmeta\\s*name=([\\'\"])generator([\\'\"])\\s*content=([\\'\"])Blogger([\\'\"])/i' =\u003e 'fingerprint.blogspot.com',\n            '/\\\u003cmeta\\s*name=([\\'\"])generator([\\'\"])\\s*content=([\\'\"])WordPress/i' =\u003e 'fingerprint.wordpress.com',\n        ],\n        'config_builder' =\u003e [\n            // Directory path to the site config folder WITHOUT trailing slash\n            'site_config' =\u003e [],\n            'hostname_regex' =\u003e '/^(([a-zA-Z0-9-]*[a-zA-Z0-9])\\.)*([A-Za-z0-9-]*[A-Za-z0-9])$/',\n        ],\n        'readability' =\u003e [\n            // filters might be like array('regex' =\u003e 'replace with')\n            // for example, to remove script content: array('!\u003cscript[^\u003e]*\u003e(.*?)\u003c/script\u003e!is' =\u003e '')\n            'pre_filters' =\u003e [],\n            'post_filters' =\u003e [],\n        ],\n        'src_lazy_load_attributes' =\u003e [\n            'data-src',\n            'data-lazy-src',\n            'data-original',\n            'data-sources',\n            'data-hi-res-src',\n        ],\n        // these JSON-LD types will be ignored\n        'json_ld_ignore_types' =\u003e ['Organization', 'WebSite', 'Person', 'VideoGame'],\n    ],\n]);\n```\n\n## Credits\n\n- [FiveFilters](https://github.com/fivefilters) for [Full-Text-RSS](https://fivefilters.org/content-only/)\n- [Caneco](https://twitter.com/caneco) for the awesome logo ✨\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fj0k3r%2Fgraby","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fj0k3r%2Fgraby","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fj0k3r%2Fgraby/lists"}