{"id":26940263,"url":"https://github.com/panakour/pkscraper","last_synced_at":"2026-02-19T06:32:17.608Z","repository":{"id":57035338,"uuid":"191734123","full_name":"panakour/pkscraper","owner":"panakour","description":"Extract structured data from the web","archived":false,"fork":false,"pushed_at":"2023-12-18T12:39:42.000Z","size":68,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-31T02:57:56.081Z","etag":null,"topics":["crawler","crawling","scraper","scraping","scraping-websites","webcrawler"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/panakour.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-06-13T09:37:35.000Z","updated_at":"2022-01-30T12:37:38.000Z","dependencies_parsed_at":"2025-07-18T01:56:44.850Z","dependency_job_id":"214ce6ed-7ec6-4c28-b3aa-ed8848b3cd4d","html_url":"https://github.com/panakour/pkscraper","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/panakour/pkscraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/panakour%2Fpkscraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/panakour%2Fpkscraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/panakour%2Fpkscraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/panakour%2Fpkscraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/panakour","download_url":"https://codeload.github.com/panakour/pkscraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/panakour%2Fpkscraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29604790,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-19T05:11:50.834Z","status":"ssl_error","status_checked_at":"2026-02-19T05:11:38.921Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawling","scraper","scraping","scraping-websites","webcrawler"],"created_at":"2025-04-02T15:17:25.824Z","updated_at":"2026-02-19T06:32:17.591Z","avatar_url":"https://github.com/panakour.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![ci](https://github.com/panakour/pkscraper/actions/workflows/ci.yml/badge.svg)](https://github.com/panakour/pkscraper/actions/workflows/ci.yml)\n![Code Coverage Badge](https://raw.githubusercontent.com/panakour/pkscraper/image-data/coverage.svg)\n\n ## Installation\n `composer require panakour/pkscraper`\n\n## Examples\n\n### Create http client with proxy and headers\n```php\n$httpClient = new \\Pkscraper\\Http\\GuzzleClient();\n$httpClient-\u003esetProxy('socks5://172.17.0.1:9050', 'socks5://172.17.0.1:9050');\n$httpClient-\u003esetHeaders(['User-Agent' =\u003e 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36']);\n$httpClient-\u003enewClient();\n```\n\n### Get a text from single url\n```php\n\n$resp = $httpClient-\u003edoGetRequest(\"https://example.com/\");\n$con = new Text(\"img\", new SymfonyDomCrawler($resp-\u003egetBody()-\u003egetContents()), \"//meta[@property='og:image']/@content\");\n$con-\u003ebuild();\n\\Pkscraper\\ToolBox::debugResult($con-\u003egetExtractedValue());\n```\n\n### Concurrent requests and group multiple fields\n```php\n$urls = UrlExtractor::extract($httpClient, 'https://www.example.com/feed', \"//item/link\");\n\n$pool = $httpClient-\u003econcurrentRequests($urls);\nforeach ($pool as $index =\u003e $response) {\n    if ($response instanceof \\GuzzleHttp\\Exception\\RequestException) {\n        dd('something went wrong');\n    }\n    $domCrawler = new SymfonyDomCrawler($response-\u003egetBody()-\u003egetContents());\n    $bags[$index] = new Bag($urls[$index]);\n    $titleItem = new Text('title', $domCrawler, \"//article/div[@class='box']/h2/a\");\n    $featuredImage = new Text('featuredImage', $domCrawler, '//meta[@property=\"og:image\"]/@content');\n    $htmlContentItem = new SafeHtml('mainContent', $domCrawler, \"//article/div[@class='box']\");\n    $storeTitles = new TextArray('tags', $domCrawler, \"//div[@class='box']/div[@class='cp-admin-row']//a[@rel='tag']\");\n    $storeTitles-\u003esetRequired(false);\n    $bags[$index]-\u003esetItems($featuredImage, $titleItem, $htmlContentItem, $storeTitles);\n    $bags[$index]-\u003ebuild();\n}\nToolBox::debugResult($bags);\n```\n\n### More advanced example:\n```php\n\n    $pool = $httpClient-\u003econcurrentRequests($urls);\n    $bags = [];\n    foreach ($pool as $index =\u003e $response) {\n        try {\n            if ($response instanceof \\GuzzleHttp\\Exception\\RequestException) {\n                continue;\n            }\n            $domCrawler = new SymfonyDomCrawler($response-\u003egetBody()-\u003egetContents());\n            $bags[$index] = new Bag($urls[$index]);\n\n            $titleItem = new Text('title', $domCrawler, \"//div[@class='grayTopCnt topInfo ']/div[@class='row'][2]/div[@class='col col12']/div[@class='title']/h1\");\n            $featuredImage = new Text('featuredImage', $domCrawler, \"//div[@class='imgWrp']/div[@class='topImg mainVideo']/div[@class='item']/picture/img[@class='lazyload']/@data-src\");\n            $safeHtmlContent = new \\Pkscraper\\Items\\SafeHtml('contentTest', new SymfonyDomCrawler($resp-\u003egetBody()-\u003egetContents()), \"//div[@id='main-post']/div[@class='post']/div[@class='blog-standard']/div[@class='cntTxt']\");\n            $safeHtmlContent-\u003eaddTransformer(new \\Pkscraper\\Transform\\ImageRelativeSourceToAbsoluteTransformer($httpClient-\u003egetCurrentUrlWithoutPath()));\n            $safeHtmlContent-\u003eaddRemover(new \\Pkscraper\\Remove\\ElementByTagByIndexRemover('img', 0));\n            $safeHtmlContent-\u003eaddRemover(new \\Pkscraper\\Remove\\ElementByTagByIndexRemover('a', 0));\n            $safeHtmlContent-\u003eaddCleaner(new \\Pkscraper\\Clean\\TextCleaner('\u003cp\u003e                Loading...                \t\t\t\t\t\t\u003c/p\u003e', ''));\n            $safeHtmlContent-\u003eaddRemover(new \\Pkscraper\\Remove\\ElementsByTagRemover('footer'));\n            $safeHtmlContent-\u003eaddTransformer(new \\Pkscraper\\Transform\\ImageRelativeSourceToAbsoluteTransformer($httpClient-\u003egetCurrentUrlWithoutPath()));\n            $safeHtmlContent-\u003eaddCleaner(new \\Pkscraper\\Clean\\RegExCleaner('/\u003c\\\\/?a(\\\\s+.*?\u003e|\u003e(?1))/', ''));\n            $safeHtmlContent-\u003eaddCleaner(new \\Pkscraper\\Clean\\RegExCleaner('/\u003c\\\\/?img(\\\\s+.*?\u003e|\u003e)(?1)/', ''));\n            $safeHtmlContent-\u003eaddRemover(new \\Pkscraper\\Remove\\ElementByIdRemover('jp-post-flair'));\n            $safeHtmlContent-\u003eaddRemover(new \\Pkscraper\\Remove\\ElementByClassByIndexRemover('size-full', 0));\n            $safeHtmlContent-\u003eaddCleaner(new \\Pkscraper\\Clean\\TextCleaner('\";                        i.innerHTML=l};                      //]]\u0026gt;                    ', ''));\n            $safeHtmlContent-\u003eaddRemover(new \\Pkscraper\\Remove\\ElementsContainsClassRemover('post-'));\n            $safeHtmlContent-\u003eaddTransformer(new ImageRelativeSourceToAbsoluteTransformer($httpClient-\u003egetCurrentUrlWithoutPath()));\n            $domRunnerBeforePurify = function () {\n                foreach ($this-\u003egetAttributesValue('img', 'data-src') as $index =\u003e $imgLink) {\n                    $paths = \\Pkscraper\\ToolBox::getUrlPathComponents($imgLink);\n\n                    if (isset($paths[3]) \u0026\u0026 $paths[3] === \"YouTube\") { //this let me find which img element is used for youtube and fix them\n                        $youtubeId = substr($paths[4], 0, -4);\n\n                        $iframe = $this-\u003eDOMDocument-\u003ecreateElement('iframe');\n                        $iframe-\u003esetAttribute('src', \"https://www.youtube.com/embed/$youtubeId\");\n\n                        $elementToBeReplaced = $this-\u003egetNodeList('img')-\u003eitem($index);\n                        if ($elementToBeReplaced) {\n                            $this-\u003ereplaceElement($elementToBeReplaced, $iframe);\n                        }\n                    }\n                }\n                foreach ($this-\u003egetAttributesValue('img', 'data-src') as $index =\u003e $imgLink) { //the rest is not a youtube but only image\n                    $this-\u003ereplaceImagesAttributes(\"\", $imgLink);\n                }\n\n            };\n\n\n            $htmlContent = new SafeHtml('mainContent', $domCrawler,\n                \"//div[@class='main withShare']/div[@class='content details']/div[@class='cntTxt']\", [\n                    'h1',\n                    'h2',\n                    'h3',\n                    'h4',\n                    'h5',\n                    'h6',\n                    'div',\n                    'a',\n                    'em',\n                    'strong',\n                    'b',\n                    'cite',\n                    'blockquote',\n                    'ul',\n                    'ol',\n                    'li',\n                    'dl',\n                    'dt',\n                    'dd',\n                    'img',\n                    'br',\n                    'p',\n                    'center',\n                    'span',\n                    'table',\n                    'thead',\n                    'tbody',\n                    'td',\n                    'th',\n                    'tr',\n                    'sub',\n                    'sup',\n                ], $domRunnerBeforePurify);\n\n            $bags[$index]-\u003esetItems($featuredImage, $titleItem, $htmlContent);\n            $bags[$index]-\u003ebuild();\n        } catch (\\Exception $e) {\n            print 'ok';\n        }\n    }\n    echo(json_encode(Collector::collect($bags), JSON_UNESCAPED_UNICODE));\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpanakour%2Fpkscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpanakour%2Fpkscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpanakour%2Fpkscraper/lists"}