{"id":13704222,"url":"https://github.com/spekulatius/phpscraper","last_synced_at":"2025-05-15T12:06:32.889Z","repository":{"id":37767273,"uuid":"257710756","full_name":"spekulatius/PHPScraper","owner":"spekulatius","description":"A universal web-util for PHP.","archived":false,"fork":false,"pushed_at":"2024-04-09T15:48:23.000Z","size":6842,"stargazers_count":560,"open_issues_count":27,"forks_count":76,"subscribers_count":17,"default_branch":"master","last_synced_at":"2025-05-15T12:06:24.975Z","etag":null,"topics":["beautifulsoup","chromium","headless-chrome","php","php-crawler","php-scraper","php-spider","php-spiders","puppeteer","pyppeteer","scraper","scraping","scraping-websites","scrapy","web-scraper","web-scraping"],"latest_commit_sha":null,"homepage":"https://phpscraper.de","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spekulatius.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"spekulatius","custom":"https://phpscraper.de/misc/sponsors.html"}},"created_at":"2020-04-21T20:41:07.000Z","updated_at":"2025-04-26T15:06:44.000Z","dependencies_parsed_at":"2023-11-28T20:25:26.961Z","dependency_job_id":"327978b2-fd44-4329-90ef-6b316b1ba23a","html_url":"https://github.com/spekulatius/PHPScraper","commit_stats":{"total_commits":173,"total_committers":11,"mean_commits":"15.727272727272727","dds":0.4219653179190751,"last_synced_commit":"8a2bd12f19102b2f232f57c6f8161785d9925c2e"},"previous_names":[],"tags_count":36,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spekulatius%2FPHPScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spekulatius%2FPHPScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spekulatius%2FPHPScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spekulatius%2FPHPScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spekulatius","download_url":"https://codeload.github.com/spekulatius/PHPScraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254337613,"owners_count":22054253,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","chromium","headless-chrome","php","php-crawler","php-scraper","php-spider","php-spiders","puppeteer","pyppeteer","scraper","scraping","scraping-websites","scrapy","web-scraper","web-scraping"],"created_at":"2024-08-02T21:01:05.889Z","updated_at":"2025-05-15T12:06:27.878Z","avatar_url":"https://github.com/spekulatius.png","language":"PHP","funding_links":["https://github.com/sponsors/spekulatius","https://phpscraper.de/misc/sponsors.html","https://www.buymeacoffee.com/spekulatius"],"categories":["目录"],"sub_categories":["爬虫 Scraping"],"readme":"\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/spekulatius/PHPScraper\"\u003e\n    \u003cpicture style=\"width: 100%;\" alt=\"PHP Scraper: a web utility for PHP\"\u003e\n      \u003csource srcset=\"https://github.com/spekulatius/phpscraper-docs/blob/master/.vuepress/public/logo-dark.png\" media=\"(prefers-color-scheme:dark)\"\u003e\n      \u003cimg src=\"https://github.com/spekulatius/phpscraper-docs/blob/master/.vuepress/public/logo-light.png\" alt=\"PHP Scraper: a web utility for PHP\"\u003e\n    \u003c/picture\u003e\n  \u003c/a\u003e\n  \u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/spekulatius/PHPScraper/actions/workflows/test.yaml\"\u003e\n      \u003cimg src=\"https://github.com/spekulatius/PHPScraper/actions/workflows/test.yaml/badge.svg\" alt=\"Unit Tests\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://packagist.org/packages/spekulatius/PHPScraper\"\u003e\n      \u003cimg src=\"https://poser.pugx.org/spekulatius/PHPScraper/d/total.svg\" alt=\"Total Downloads\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://packagist.org/packages/spekulatius/PHPScraper\"\u003e\n      \u003cimg src=\"https://poser.pugx.org/spekulatius/PHPScraper/v/stable.svg\" alt=\"Latest Version\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://packagist.org/packages/spekulatius/PHPScraper\"\u003e\n      \u003cimg src=\"https://poser.pugx.org/spekulatius/PHPScraper/license.svg\" alt=\"License\"\u003e\n    \u003c/a\u003e\n  \u003c/p\u003e\n  \u003cp align=\"center\"\u003e\n    \u003cstrong\u003eFor full documentation, visit \u003ca href=\"https://phpscraper.de\"\u003ephpscraper.de\u003c/a\u003e\u003c/strong\u003e.\n  \u003c/p\u003e\n\u003c/p\u003e\n\nPHPScraper is a versatile web-utility for PHP. Its primary objective is to streamline the process of extracting information from websites, allowing you to focus on accomplishing tasks without getting caught up in the complexities of selectors, data structure preparation, and conversion.\n\nUnder the hood, it uses\n\n- [BrowserKit](https://symfony.com/doc/current/components/browser_kit.html) (formerly [Goutte](https://github.com/FriendsOfPHP/Goutte)) to access the web\n- [League/URI](https://github.com/thephpleague/uri) to process URLs\n- [donatello-za/rake-php-plus](https://github.com/donatello-za/rake-php-plus) to extract and analyze keywords\n\nSee [composer.json](https://github.com/spekulatius/PHPScraper/blob/master/composer.json) for more details.\n\n\n:timer_clock: PHPScraper in 5 Minutes explained\n-----------------------------------------------\n\nHere are a few impressions of the way the library works. More examples are on the [project website](https://phpscraper.de/examples/scrape-website-title.html).\n\n### Basics: Flexible Calling as an Attribute or Method\n\nAll scraping functionality can be accessed either as a function call or a property call. For example, the title can be accessed in two ways:\n\n```php\n// Prep\n$web = new \\Spekulatius\\PHPScraper\\PHPScraper;\n$web-\u003ego('https://google.com');\n\n// Returns \"Google\"\necho $web-\u003etitle;\n\n// Also returns \"Google\"\necho $web-\u003etitle();\n```\n\n### :battery: Batteries included: Meta data, Links, Images, Headings, Content, Keywords, ...\n\nMany common use cases are covered already. You can find prepared extractors for various HTML tags, including interesting attributes. You can filter and combine these to your needs. In some cases there is an option to get a simple or detailed version, here in the case of `linksWithDetails`:\n\n```PHP\n$web = new \\Spekulatius\\PHPScraper\\PHPScraper;\n\n// Contains:\n// \u003ca href=\"https://placekitten.com/456/500\" rel=\"ugc\"\u003e\n//   \u003cimg src=\"https://placekitten.com/456/400\"\u003e\n//   \u003cimg src=\"https://placekitten.com/456/300\"\u003e\n// \u003c/a\u003e\n$web-\u003ego('https://test-pages.phpscraper.de/links/image-urls.html');\n\n// Get the first link on the page and print the result\nprint_r($web-\u003elinksWithDetails[0]);\n// [\n//     'url' =\u003e 'https://placekitten.com/456/500',\n//     'protocol' =\u003e 'https',\n//     'text' =\u003e '',\n//     'title' =\u003e null,\n//     'target' =\u003e null,\n//     'rel' =\u003e 'ugc',\n//     'image' =\u003e [\n//         'https://placekitten.com/456/400',\n//         'https://placekitten.com/456/300'\n//     ],\n//     'isNofollow' =\u003e false,\n//     'isUGC' =\u003e true,\n//     'isSponsored' =\u003e false,\n//     'isMe' =\u003e false,\n//     'isNoopener' =\u003e false,\n//     'isNoreferrer' =\u003e false,\n// ]\n```\n\nIf there aren't any matching elements (here links) on the page, an empty array will be returned. If a method normally returns a string it might return `null`. Details such as `follow_redirects`, etc. are optional configuration parameters (see below).\n\nMost of the DOM should be covered using these methods:\n\n- several [meta-tags](https://phpscraper.de/examples/scrape-meta-tags.html) and other [`\u003chead\u003e`-information](https://phpscraper.de/examples/scrape-header-tags.html)\n- [Social-Media information](https://phpscraper.de/examples/scrape-social-media-meta-tags.html) like Twitter Card and Facebook Open Graph\n- Content: [Headings](https://phpscraper.de/examples/headings.html), [Outline](https://phpscraper.de/examples/outline.html), [Texts](https://phpscraper.de/examples/paragraphs.html) and [Lists](https://phpscraper.de/examples/lists.html)\n- [Images](https://phpscraper.de/examples/scrape-images.html)\n- [Links](https://phpscraper.de/examples/scrape-links.html)\n- [Keywords](https://phpscraper.de/examples/extract-keywords.html)\n\n **A full list of methods with example code can be found on [phpscraper.de](https://phpscraper.de). Further examples are in the [tests](https://github.com/spekulatius/PHPScraper/tree/master/tests).**\n\n\n### Download Files\n\nBesides processing the content on the page itself, you can download files using `fetchAsset`:\n\n```php\n// Absolute URL\n$csvString = $web-\u003efetchAsset('https://test-pages.phpscraper.de/test.csv');\n\n// Relative URL after navigation\n$csvString = $web\n  -\u003ego('https://test-pages.phpscraper.de/meta/lorem-ipsum.html')\n  -\u003efetchAsset('/test.csv');\n```\n\nYou will only need to write the content into a file or cloud storage.\n\n\n### Process the RSS feeds, `sitemap.xml`, etc.\n\nPHPScraper can assist in collecting feeds such as [RSS feeds, `sitemap.xml`-entries and static search indexes](https://phpscraper.de/examples/scrape-feeds.html). This can be useful when deciding on the next page to crawl or building up a list of pages on a website.\n\nHere we are processing the sitemap into a set of [`FeedEntry`-DTOs](https://github.com/spekulatius/PHPScraper/blob/master/src/DataTransferObjects/FeedEntry.php):\n\n```php\n(new \\Spekulatius\\PHPScraper\\PHPScraper)\n    -\u003ego('https://phpscraper.de')\n    -\u003esitemap\n\n// array(131) {\n//   [0]=\u003e\n//   object(Spekulatius\\PHPScraper\\DataTransferObjects\\FeedEntry)#165 (3) {\n//     [\"title\"]=\u003e\n//     string(0) \"\"\n//     [\"description\"]=\u003e\n//     string(0) \"\"\n//     [\"link\"]=\u003e\n//     string(22) \"https://phpscraper.de/\"\n//   }\n//   [1]=\u003e\n// ...\n```\n\nWhenever post-processing is applied, you can fall back to the underlying `*Raw`-methods.\n\n\n### Process CSV-, XML- and JSON files and URLs\n\nPHPScraper comes out of the box with file / URL processing methods for CSV-, XML- and JSON:\n\n- `parseJson`\n- `parseXml`\n- `parseCsv`\n- `parseCsvWithHeader` (generates an asso. array using the first row)\n\nEach method can process both strings as well as URLs:\n\n```php\n// Parse JSON into array:\n$json = $web-\u003eparseJson('[{\"title\": \"PHP Scraper: a web utility for PHP\", \"url\": \"https://phpscraper.de\"}]');\n// [\n//     'title' =\u003e 'PHP Scraper: a web utility for PHP',\n//     'url' =\u003e 'https://phpscraper.de'\n// ]\n\n// Fetch and parse CSV into a simple array:\n$csv = $web-\u003eparseCsv('https://test-pages.phpscraper.de/test.csv');\n// [\n//     ['date', 'value'],\n//     ['1945-02-06', 4.20],\n//     ['1952-03-11', 42],\n// ]\n\n// Fetch and parse CSV with first row as header into an asso. array structure:\n$csv = $web-\u003eparseCsvWithHeader('https://test-pages.phpscraper.de/test.csv');\n// [\n//     ['date' =\u003e '1945-02-06', 'value' =\u003e 4.20],\n//     ['date' =\u003e '1952-03-11', 'value' =\u003e 42],\n// ]\n```\n\nAdditional CSV parsing parameters such as separator, enclosure and escape are possible.\n\n\n### There is more!\n\nThere are plenty of examples on the [PHPScraper website](https://phpscraper.de) and in the [tests](https://github.com/spekulatius/PHPScraper/tree/master/tests).\n\nCheck the [`playground.php`](https://github.com/spekulatius/PHPScraper/blob/master/playground.php) if you prefer learning by doing. You get it up and running with:\n\n```bash\n$ git clone git@github.com:spekulatius/PHPScraper.git \u0026\u0026 composer update\n```\n\n:muscle: Roadmap\n----------------\n\nThe future development is organized into [milestones](https://github.com/spekulatius/PHPScraper/milestones?direction=asc\u0026sort=title). Releases follow [semver](https://semver.org/).\n\n### v1: [Building the first stable version](https://github.com/spekulatius/PHPScraper/milestone/4?closed=1)\n\n- Improve documentation and examples.\n- Organize code better (move websites into separate repos, etc.)\n- Add support for feeds and some typical file types.\n\n### v2: Service Upgrade:\n\n- Switch from Goutte to [Symfony BrowserKit](https://symfony.com/doc/current/components/browser_kit.html). Goutte has been archived.\n\n### v3: [Expand the functionality and cover more 'types'](https://github.com/spekulatius/PHPScraper/milestone/5)\n\n- Expand to parse a wider range of types, elements, embeds, etc.\n- Improve performance with caching and concurrent fetching of assets\n- Minor improvements for parsing methods\n\n### v4: [Expand to provide more guidance on building custom scrapers on top of PHPScraper](https://github.com/spekulatius/PHPScraper/milestone/6)\n\nTBC.\n\n\n:heart_eyes: Sponsors\n---------------------\n\nPHPScraper is sponsored by:\n\n\u003ca href=\"https://bringyourownideas.com\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e\u003cimg src=\"https://bringyourownideas.com/images/byoi-logo.jpg\" height=\"100px\"\u003e\u003c/a\u003e\n\nWith your support, PHPScraper can became the *PHP swiss army knife for the web*. If you find PHPScraper useful to your work, please consider a [sponsorship](https://github.com/sponsors/spekulatius) or [donation](https://www.buymeacoffee.com/spekulatius). Thank you :muscle:\n\n\n:gear: Configuration (optional)\n-------------------------------\n\nIf needed, you can use the following configuration options:\n\n### User Agent\n\nYou can set the browser agent using `setConfig`:\n\n```php\n$web-\u003esetConfig([\n  'agent' =\u003e 'Mozilla/5.0 (X11; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0'\n]);\n```\n\nIt defaults to `Mozilla/5.0 (compatible; PHP Scraper/1.x; +https://phpscraper.de)`.\n\n### Proxy Support\n\nYou can configure proxy support with `setConfig`:\n\n```php\n$web-\u003esetConfig(['proxy' =\u003e 'http://user:password@127.0.0.1:3128']);\n```\n\n### Timeout\n\nYou can set the `timeout` using `setConfig`:\n\n```php\n$web-\u003esetConfig(['timeout' =\u003e 15]);\n```\n\nSetting the timeout to zero will disable it.\n\n### Disabling SSL\n\nWhile unrecommended, it might be required to disable SSL checks. You can do so using:\n\n```php\n$web-\u003esetConfig(['disable_ssl' =\u003e true]);\n```\n\nYou can call `setConfig` multiple times. It stores the config and merges it with previous settings. This should be kept in mind in the unlikely use-case when unsetting values.\n\n\n:rocket: Installation with Composer\n-----------------------------------\n\n```bash\ncomposer require spekulatius/phpscraper\n```\n\nAfter the installation, the package will be picked up by the Composer autoloader. If you are using a common PHP application or framework such as Laravel or Symfony you can start scraping now :rocket:\n\nIf not or you are building a standalone-scraper, please include the autoloader in `vendor/` at the top of your file:\n\n```php\n\u003c?php\n\nrequire __DIR__ . '/vendor/autoload.php';\n\n// ...\n```\n\nNow you can now use any of the examples on the documentation website or from the [`tests/`-folder](https://github.com/spekulatius/PHPScraper/tree/master/tests).\n\nPlease consider supporting PHPScraper with a star or [sponsorship](https://github.com/sponsors/spekulatius):\n\n```bash\ncomposer thanks\n```\n\nThank you :muscle:\n\n\n:white_check_mark: Testing\n--------------------------\n\nThe library comes with a PHPUnit test suite. To run the tests, run the following command from the project folder:\n\n```bash\ncomposer test\n```\n\nYou can find the tests [here](https://github.com/spekulatius/PHPScraper/tree/master/tests). The test pages are [publicly available](https://github.com/spekulatius/phpscraper-test-pages).\n\n## MISC: [Issues](https://github.com/spekulatius/PHPScraper/issues), [Ideas](https://github.com/spekulatius/PHPScraper/milestones), [Contributing](https://github.com/spekulatius/PHPScraper/blob/master/CONTRIBUTING.md), [CHANGELOG](https://github.com/spekulatius/PHPScraper/blob/master/CHANGELOG.md), [UPGRADING](https://github.com/spekulatius/PHPScraper/blob/master/UPGRADING.md), [LICENSE](https://github.com/spekulatius/PHPScraper/blob/master/LICENSE.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspekulatius%2Fphpscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspekulatius%2Fphpscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspekulatius%2Fphpscraper/lists"}