{"id":15026680,"url":"https://github.com/crispy-computing-machine/phpcrawl","last_synced_at":"2025-10-03T23:32:39.810Z","repository":{"id":55460557,"uuid":"114872725","full_name":"crispy-computing-machine/phpcrawl","owner":"crispy-computing-machine","description":"PHPCrawl Web Crawler PHP 8","archived":true,"fork":true,"pushed_at":"2023-06-02T10:58:15.000Z","size":586,"stargazers_count":9,"open_issues_count":0,"forks_count":4,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-01-04T06:45:51.177Z","etag":null,"topics":["crawl","crawler","php","php74","sphider"],"latest_commit_sha":null,"homepage":"https://github.com/crispy-computing-machine/phpcrawl","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"mmerian/phpcrawl","license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crispy-computing-machine.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-12-20T10:10:34.000Z","updated_at":"2023-06-02T11:00:56.000Z","dependencies_parsed_at":"2023-01-22T17:15:16.947Z","dependency_job_id":null,"html_url":"https://github.com/crispy-computing-machine/phpcrawl","commit_stats":null,"previous_names":[],"tags_count":27,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crispy-computing-machine%2Fphpcrawl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crispy-computing-machine%2Fphpcrawl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crispy-computing-machine%2Fphpcrawl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crispy-computing-machine%2Fphpcrawl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crispy-computing-machine","download_url":"https://codeload.github.com/crispy-computing-machine/phpcrawl/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235204448,"owners_count":18952326,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawl","crawler","php","php74","sphider"],"created_at":"2024-09-24T20:04:52.994Z","updated_at":"2025-10-03T23:32:34.426Z","avatar_url":"https://github.com/crispy-computing-machine.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n## Now archived due to fundamental issues. Replaced by [SuperSimpleCrawler](https://github.com/crispy-computing-machine/SuperSimpleCrawler)\n\n\n# phpcrawl\n[![Latest Stable Version](https://poser.pugx.org/brittainmedia/phpcrawl/v/stable)](https://packagist.org/packages/brittainmedia/phpcrawl) [![Total Downloads](https://poser.pugx.org/brittainmedia/phpcrawl/downloads)](https://packagist.org/packages/brittainmedia/phpcrawl) [![License](https://poser.pugx.org/brittainmedia/phpcrawl/license)](https://packagist.org/packages/brittainmedia/phpcrawl)\n\n```sh\ncomposer require brittainmedia/phpcrawl\n```\n\n```php\nuse PHPCrawl\\Enums\\PHPCrawlerAbortReasons;\nuse PHPCrawl\\Enums\\PHPCrawlerMultiProcessModes;\nuse PHPCrawl\\Enums\\PHPCrawlerUrlCacheTypes;\nuse PHPCrawl\\PHPCrawler;\nuse PHPCrawl\\PHPCrawlerDocumentInfo;\n\n// New custom crawler\n$crawler = new class() extends PHPCrawler {\n\n    /**\n     * @param $PageInfo\n     * @return int\n     */\n    function handleDocumentInfo($PageInfo): int\n    {\n        // Print the URL of the document\n        echo \"URL: \" . $PageInfo-\u003eurl . PHP_EOL;\n\n        // Print the http-status-code\n        echo \"HTTP-statuscode: \" . $PageInfo-\u003ehttp_status_code . PHP_EOL;\n\n        // Print the number of found links in this document\n        echo \"Links found: \" . count($PageInfo-\u003elinks_found_url_descriptors) . PHP_EOL;\n\n        // ..\n\n        // continue crawling\n        return 1;\n    }\n};\n\n$crawler-\u003esetURL($url = 'https://bbc.co.uk/news');\n\n// Optional\n//$crawler-\u003esetProxy($proxy_host, $proxy_port, $proxy_username, $proxy_password);\n\n// Only receive content of files with content-type \"text/html\"\n$crawler-\u003eaddContentTypeReceiveRule('#text/html#');\n\n// Ignore links to ads...\n$advertFilterRule = \"/\\bads\\b|2o7|a1\\.yimg|ad(brite|click|farm|revolver|server|tech|vert)|at(dmt|wola)|banner|bizrate|blogads|bluestreak|burstnet|casalemedia|coremetrics|(double|fast)click|falkag|(feedster|right)media|googlesyndication|hitbox|httpads|imiclk|intellitxt|js\\.overture|kanoodle|kontera|mediaplex|nextag|pointroll|qksrv|speedera|statcounter|tribalfusion|webtrends/\";\n$crawler-\u003eaddURLFilterRule($advertFilterRule);\n\n// Store and send cookie-data like a browser does\n$crawler-\u003eenableCookieHandling(true);\n\n// Limits set, successfully retrieved only\n$crawler-\u003esetRequestLimit(1);\n\n/**\n * 3 - The crawler only follows links to pages or files located in or under the same path like the one of the root-url.\u003c/b\u003e\n * E.g. if the root-url is\n * \"http://www.foo.com/bar/index.html\",\n * the crawler will follow links to \"http://www.foo.com/bar/page.html\" and \"http://www.foo.com/bar/path/index.html\",\n * but not links to \"http://www.foo.com/page.html\".\n *\n */\n$crawler-\u003esetFollowMode(3);\n\n// Keep going until resolved\n$crawler-\u003esetFollowRedirectsTillContent(TRUE);\n\n// tmp directory\n$crawler-\u003esetWorkingDirectory(sys_get_temp_dir() . DIRECTORY_SEPARATOR . 'phpcrawl' .DIRECTORY_SEPARATOR);\n\n// Cache\n$crawler-\u003esetUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_MEMORY);\n\n// File crawling - Store to file or set limit for large files\n#$crawler-\u003eaddStreamToFileContentType('##');\n#$crawler-\u003esetContentSizeLimit(500000); // Google only crawls pages 500kb and below?\n\n//Decides whether the crawler should obey \"nofollow\"-tags, we will obey\n$crawler-\u003eobeyNoFollowTags(true);\n\n//Decides whether the crawler should obey robot.txt, we will not obey!\n$crawler-\u003eobeyRobotsTxt(false);\n\n// Delay to stop blocking\n$crawler-\u003esetRequestDelay(0.5);\n\n// fake browser or use fake robot one\n$crawler-\u003esetUserAgentString('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0');\n\n// Multiprocess (optional) - Forces PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE use, must have link priorities!\n$crawler-\u003eaddLinkPriority(\"/news/\", 10);\n$crawler-\u003eaddLinkPriority(\"/\\.jpeg/\", 5);\n$crawler-\u003egoMultiProcessed(PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE);\n\n// Thats enough, now here we go\n$crawler-\u003ego();\n\n// At the end, after the process is finished, we print a short\n// report (see method getProcessReport() for more information)\n$report = $crawler-\u003egetProcessReport();\n\necho 'Finished crawling site: ' . $url . PHP_EOL;\necho 'Summary:' . PHP_EOL;\necho 'Links followed: ' . $report-\u003elinks_followed . PHP_EOL;\necho 'Documents received: ' . $report-\u003efiles_received . PHP_EOL;\necho 'Bytes received: ' . $report-\u003ebytes_received . ' bytes' . PHP_EOL;\necho 'Process runtime: ' . $report-\u003eprocess_runtime . ' sec' . PHP_EOL;\necho 'Process memory: ' . $report-\u003ememory_peak_usage . ' sec' . PHP_EOL;\necho 'Server connect time: ' . $report-\u003eavg_server_connect_time . ' sec' . PHP_EOL;\necho 'Server response time: ' . $report-\u003eavg_server_response_time . ' sec' . PHP_EOL;\necho 'Server transfer rate: ' . $report-\u003eavg_proc_data_transfer_rate . ' bytes' . PHP_EOL;\n\n$abortReason = $report-\u003eabort_reason;\nswitch ($abortReason) {\n    case PHPCrawlerAbortReasons::ABORTREASON_PASSEDTHROUGH:\n        echo 'Crawling-process aborted because everything is done/passed through.' . PHP_EOL;\n        break;\n    case PHPCrawlerAbortReasons::ABORTREASON_TRAFFICLIMIT_REACHED:\n        echo 'Crawling-process aborted because the traffic limit set by user was reached.' . PHP_EOL;\n        break;\n    case PHPCrawlerAbortReasons::ABORTREASON_FILELIMIT_REACHED:\n        echo 'Crawling-process aborted because the file limit set by user was reached.' . PHP_EOL;\n        break;\n    case PHPCrawlerAbortReasons::ABORTREASON_USERABORT:\n        echo 'Crawling-process aborted because the handleDocumentInfo-method returned a negative value.' . PHP_EOL;\n        break;\n    default:\n        echo 'Unknown abort reason.' . PHP_EOL;\n        break;\n\n}\n```\n\nInitially just a copy of http://phpcrawl.cuab.de/ forked from [mmerian](https://github.com/mmerian/phpcrawl) for using with composer.\n\n *Due to the [main project](https://sourceforge.net/projects/phpcrawl/files/PHPCrawl/) now seemingly being abandoned (having no updates for 4 years) I am going to proceed to make any changes/fixes in this repository.*\n\n### Latest updates\n- 0.9 compatible PHP 7 Only.\n- 0.10 compatible PHP 8. ([Submit issues](https://github.com/crispy-computing-machine/phpcrawl/issues))\n- Introduced namespaces\n- Lots of bug fixes\n- Refactored various class sections\n\nNow archived...\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrispy-computing-machine%2Fphpcrawl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrispy-computing-machine%2Fphpcrawl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrispy-computing-machine%2Fphpcrawl/lists"}