{"id":13724924,"url":"https://github.com/scotteh/php-goose","last_synced_at":"2025-10-05T21:31:36.009Z","repository":{"id":21115235,"uuid":"24415875","full_name":"scotteh/php-goose","owner":"scotteh","description":"Readability / Html Content / Article Extractor \u0026 Web Scrapping library written in PHP","archived":true,"fork":false,"pushed_at":"2023-09-05T13:40:43.000Z","size":343,"stargazers_count":459,"open_issues_count":0,"forks_count":121,"subscribers_count":21,"default_branch":"master","last_synced_at":"2024-09-26T08:39:04.991Z","etag":null,"topics":["article","article-extractor","autoloader","composer","php","php-goose","readability","scraper"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scotteh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-09-24T13:25:17.000Z","updated_at":"2024-07-29T01:37:18.000Z","dependencies_parsed_at":"2024-06-18T14:06:49.609Z","dependency_job_id":"e940dc9f-23ef-4503-a3d9-6f94f5be2a47","html_url":"https://github.com/scotteh/php-goose","commit_stats":{"total_commits":230,"total_committers":25,"mean_commits":9.2,"dds":"0.21304347826086956","last_synced_commit":"5a1c36d9550ef3943fb8adf57a42474db0fe6419"},"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scotteh%2Fphp-goose","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scotteh%2Fphp-goose/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scotteh%2Fphp-goose/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scotteh%2Fphp-goose/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scotteh","download_url":"https://codeload.github.com/scotteh/php-goose/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219877064,"owners_count":16554821,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["article","article-extractor","autoloader","composer","php","php-goose","readability","scraper"],"created_at":"2024-08-03T01:02:06.794Z","updated_at":"2025-10-05T21:31:30.666Z","avatar_url":"https://github.com/scotteh.png","language":"PHP","funding_links":[],"categories":["PHP"],"sub_categories":[],"readme":"# PHP Goose - Article Extractor\n\n## Note\n\nThis repository has been archived as of 2023-09-05.\n\n## Intro\n\nPHP Goose is a port of [Goose](https://github.com/GravityLabs/goose/) originally developed in Java and converted to Scala by [GravityLabs](https://github.com/GravityLabs/). Portions have also been ported from the Python port [python-goose](https://github.com/grangier/python-goose). Its mission is to take any news article or article type web page and not only extract what is the main body of the article but also all metadata and most probable image candidate.\n\nThe extraction goal is to try and get the purest extraction from the beginning of the article for servicing flipboard/pulse type applications that need to show the first snippet of a web article along with an image.\n\nGoose will try to extract the following information:\n\n - Main text of an article\n - Main image of article\n - Any YouTube/Vimeo movies embedded in article\n - Meta Description\n - Meta tags\n - Publish Date\n\nThe PHP version was rewritten by:\n\n - Andrew Scott\n\n## Requirement\n\n - PHP 7.1 or later\n - PSR-4 compatible autoloader\n \nThe older 0.x versions with PHP 5.5+ support are still available under [releases](https://github.com/scotteh/php-goose/releases).\n\n## Install\n\nThis library is designed to be installed via [Composer](https://getcomposer.org/doc/).\n\nAdd the dependency into your projects composer.json.\n```\n{\n  \"require\": {\n    \"scotteh/php-goose\": \"^1.0\"\n  }\n}\n```\n\nDownload the composer.phar\n``` bash\ncurl -sS https://getcomposer.org/installer | php\n```\n\nInstall the library.\n``` bash\nphp composer.phar install\n```\n\n## Autoloading\n\nThis library requires an autoloader, if you aren't already using one you can include [Composers autoloader](https://getcomposer.org/doc/01-basic-usage.md#autoloading).\n\n``` php\nrequire('vendor/autoload.php');\n```\n\n## Usage\n\n``` php\nuse \\Goose\\Client as GooseClient;\n\n$goose = new GooseClient();\n$article = $goose-\u003eextractContent('http://url.to/article');\n\n$title = $article-\u003egetTitle();\n$metaDescription = $article-\u003egetMetaDescription();\n$metaKeywords = $article-\u003egetMetaKeywords();\n$canonicalLink = $article-\u003egetCanonicalLink();\n$domain = $article-\u003egetDomain();\n$tags = $article-\u003egetTags();\n$links = $article-\u003egetLinks();\n$videos = $article-\u003egetVideos();\n$articleText = $article-\u003egetCleanedArticleText();\n$entities = $article-\u003egetPopularWords();\n$image = $article-\u003egetTopImage();\n$allImages = $article-\u003egetAllImages();\n```\n\n## Configuration\n\nAll config options are not required and are optional. Default (fallback) values have been used below.\n\n``` php\nuse \\Goose\\Client as GooseClient;\n\n$goose = new GooseClient([\n    // Language - Selects common word dictionary\n    //   Supported languages (ISO 639-1):\n    //     ar, cs, da, de, en, es, fi, fr, hu, id, it, ja,\n    //     ko, nb, nl, no, pl, pt, ru, sv, vi, zh\n    'language' =\u003e 'en',\n    // Minimum image size (bytes)\n    'image_min_bytes' =\u003e 4500,\n    // Maximum image size (bytes)\n    'image_max_bytes' =\u003e 5242880,\n    // Minimum image size (pixels)\n    'image_min_width' =\u003e 120,\n    // Maximum image size (pixels)\n    'image_min_height' =\u003e 120,\n    // Fetch best image\n    'image_fetch_best' =\u003e true,\n    // Fetch all images\n    'image_fetch_all' =\u003e false,\n    // Guzzle configuration - All values are passed directly to Guzzle\n    //   See: http://guzzle.readthedocs.io/en/stable/request-options.html\n    'browser' =\u003e [\n        'timeout' =\u003e 60,\n        'connect_timeout' =\u003e 30\n    ]\n]);\n```\n\n## Licensing\n\nPHP Goose is licensed by Gravity.com under the Apache 2.0 license, see the LICENSE file for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscotteh%2Fphp-goose","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscotteh%2Fphp-goose","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscotteh%2Fphp-goose/lists"}