{"id":33990364,"url":"https://github.com/gidlov/copycat","last_synced_at":"2025-12-13T06:11:01.550Z","repository":{"id":8444289,"uuid":"10036426","full_name":"gidlov/copycat","owner":"gidlov","description":"A PHP Scraping Class","archived":false,"fork":false,"pushed_at":"2017-09-03T08:26:10.000Z","size":76,"stargazers_count":73,"open_issues_count":0,"forks_count":13,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-10-30T10:47:06.262Z","etag":null,"topics":["copycat","imdb","laravel","php","regular-expression","scraper","scraping"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gidlov.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-05-13T16:52:12.000Z","updated_at":"2024-08-03T21:12:03.000Z","dependencies_parsed_at":"2022-08-23T16:40:47.104Z","dependency_job_id":null,"html_url":"https://github.com/gidlov/copycat","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/gidlov/copycat","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gidlov%2Fcopycat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gidlov%2Fcopycat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gidlov%2Fcopycat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gidlov%2Fcopycat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gidlov","download_url":"https://codeload.github.com/gidlov/copycat/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gidlov%2Fcopycat/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27701425,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-13T02:00:09.769Z","response_time":147,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["copycat","imdb","laravel","php","regular-expression","scraper","scraping"],"created_at":"2025-12-13T06:11:00.863Z","updated_at":"2025-12-13T06:11:01.542Z","avatar_url":"https://github.com/gidlov.png","language":"PHP","readme":"Copycat - A PHP Scraping Class\r\n=====================\r\n[![Latest Stable Version](https://poser.pugx.org/gidlov/copycat/v/stable.svg)](https://github.com/gidlov/copycat/releases)\r\n[![Total Downloads][ico-downloads]][link-packagist]\r\n[![Monthly Downloads][ico-m-downloads]][link-packagist]\r\n[![Reference Status][ico-references]][link-references]\r\n[![Software License][ico-license]](LICENSE.txt)\r\n\r\nYou may find more info on [gidlov.com/en/code/copycat](https://gidlov.com/en/code/copycat)\r\n\r\n### For Laravel 5/4 Developers\r\n\r\nIn the `require` key of `composer.json` file add the following:\r\n\r\n```\r\n\"gidlov/copycat\": \"1.*\"\r\n```\r\n\r\nRun the Composer `update` command.\r\n\r\n#### For Laravel 5 Developers\r\n\r\nAdd to `providers` in `app/config/app.php`.\r\n\r\n```\r\nGidlov\\Copycat\\CopycatServiceProvider::class,\r\n```\r\n\r\nand to `aliases` in the same file.\r\n\r\n```\r\n'Copycat' =\u003e Gidlov\\Copycat\\Copycat::class,\r\n```\r\n\r\n#### For Laravel 4 Developers\r\n\r\nAdd to `providers` in `app/config/app.php`.\r\n\r\n```\r\n'Gidlov\\Copycat\\CopycatServiceProvider',\r\n```\r\n\r\nand to `aliases` in the same file.\r\n\r\n```\r\n'Copycat' =\u003e 'Gidlov\\Copycat\\Copycat',\r\n```\r\n\r\n## Yet another scraping class\r\nI didn’t do much research before I wrote this class, so there is probably something similar out there, and certainly some more decent solution. _A Python version of this class is under development_.\r\n\r\nBut still, I needed a class that could pick out selected pieces from a web page, with regular expression, show or save it. I also needed to be able to save files and or pictures, and also specify or complete a current file name.\r\n\r\nIt is also possible to use a search engine to look up an address to extract data from. Assuming you has entered an expression for that particular page.\r\n\r\n\r\n## Briefly\r\n\r\n - Uses regular expression, match one or all.\r\n - Can download and save files with custom file names.\r\n - Possible to search through one or several tens of thousands of pages in sequence.\r\n - Can use search engines to find out the right page.\r\n - Also possible to apply callback functions for all items.\r\n\r\n## How to use this class\r\n\r\nInclude the class and initiate your object with some custom [cURL parameters](http://php.net/manual/en/function.curl-setopt.php), if you need/like.\r\n```php\r\nrequire_once('copycat.php');\r\n$cc = new Copycat;\r\n$cc-\u003esetCURL(array(\r\n  CURLOPT_RETURNTRANSFER =\u003e 1,\r\n  CURLOPT_CONNECTTIMEOUT =\u003e 5,\r\n  CURLOPT_HTTPHEADER, \"Content-Type: text/html; charset=iso-8859-1\",\r\n  CURLOPT_USERAGENT =\u003e 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',\r\n));\r\n```\r\n\r\n**I use [IMDb](http://imdb.com/) as our target source in these examples.**\r\n\r\nSay we want to retrieve a particular film score, for simplicity, we happen to know the address of this very film, [Donnie Darko](http://www.imdb.com/title/tt0246578/). This is how the code could look like.\r\n\r\n```php\r\n$cc-\u003ematch(array(\r\n    'score' =\u003e '/itemprop=\"ratingValue\"\u003e(.*?)\u003c/ms',))\r\n  -\u003eURLs('http://imdb.com/title/tt0246578/')\r\n```\r\n\r\nIt’s basically everything. We specify what has to be matched, and a name for this, and we enter an address. Our answer array will look as follows:\r\n\r\n```\r\nArray (\r\n  [0] =\u003e Array (\r\n    [score] =\u003e 8.1\r\n  )\r\n)\r\n```\r\n\r\nIf we were to give the method `URLs()` an associative array instead of a string `array('Donnie Darko' =\u003e 'http://imdb.com/title/tt0246578/')` the answer would be:\r\n\r\n```\r\nArray (\r\n  [Donnie Darko] =\u003e Array (\r\n    [score] =\u003e 8.1\r\n  )\r\n)\r\n```\r\n\r\nAlso note that I’m using **method chaining**, it is supported, but it’s a matter of taste.\r\n\r\nBut it’s unlikely that we know or can guess IMDb’s choice of URL for a particular movie, so we’ll Binging it when we don’t know it *(Google tends to interrupt the sequence after an unknown number of inquiries, therefore I chose Bing)*.\r\n\r\n```php\r\n$cc-\u003ematch(array(\r\n    'score' =\u003e '/itemprop=\"ratingValue\"\u003e(.*?)\u003c/ms',))\r\n  -\u003efillURLs(array(\r\n    'query' =\u003e 'http://www.bing.com/search?q=',\r\n    'regex' =\u003e '/\u003ca href=\"(http:\\/\\/www.imdb.com\\/title\\/tt.*?\\/)\".*?\u003e.*?\u003c\\/a\u003e/ms',\r\n    'to' =\u003e 'match',\r\n    'keywords' =\u003e array(\r\n      'imdb+donnie+darko',)))\r\n```\r\n\r\nNow we have introduced `fillURLs()` which consists of a search query, a regular expression to match our destination page and keywords that represent the search. The result is the same as in the first example.\r\n\r\nLet’s catch more about this film. Original title, rating and votes, release year, director, starring actors and of course we save the cover image. Original file name of the image is something like MV5BMTczMzE4Nzk3N15BMl5BanBnXkFtZTcwNDg5Mjc4NA @ @. _V1_SX214_.jpg, So we rename it to the title instead.\r\n\r\n```php\r\n$cc-\u003ematch(array(\r\n    'title' =\u003e '/\u003ctitle\u003e(.*?)\\(.*?\u003c\\/title\u003e/ms',\r\n    'description' =\u003e '/itemprop=\"description\"\u003e(.*?)\u003c/ms',\r\n    'score' =\u003e '/itemprop=\"ratingValue\"\u003e(.*?)\u003c/ms',\r\n    'votes' =\u003e '/itemprop=\"ratingCount\"\u003e(.*?)\u003c/ms',\r\n    'year' =\u003e '/class=\"nobr\"\u003e.*?\u003e(.*?)\u003c/ms',\r\n    'file' =\u003e array(\r\n      'key' =\u003e 'title',\r\n      'directory' =\u003e 'poster',\r\n      'after_key' =\u003e '.jpg',\r\n      'regex' =\u003e '/img_primary\"\u003e.*?src=\"(.*?)\".*?\u003c\\/td\u003e/ms',)))\r\n  -\u003ematchAll(array(\r\n    'actors' =\u003e '/itemprop=\"actor.*?itemprop=\"name\"\u003e(.*?)\u003c/ms',))\r\n  -\u003efillURLs(array(\r\n    'query' =\u003e 'http://www.bing.com/search?q=',\r\n    'regex' =\u003e '/\u003ca href=\"(http:\\/\\/www.imdb.com\\/title\\/tt.*?\\/)\".*?\u003e.*?\u003c\\/a\u003e/ms',\r\n    'to' =\u003e 'match',\r\n    'keywords' =\u003e array(\r\n      'imdb+donnie+darko',\r\n      'imdb+stay')))\r\n```\r\n\r\nAnd the result of such an operation would provide:\r\n\r\n```\r\nArray (\r\n  [0] =\u003e Array (\r\n    [title] =\u003e Donnie Darko\r\n    [description] =\u003e A troubled teenager is plagued by visions of a large bunny rabbit that manipulates him to commit a series of crimes, after narrowly escaping a bizarre accident.\r\n    [score] =\u003e 8.1\r\n    [votes] =\u003e 363,099\r\n    [year] =\u003e 2001\r\n    [Donnie Darko.jpg] =\u003e http://ia.media-imdb.com/images/M/MV5BMTczMzE4Nzk3N15BMl5BanBnXkFtZTcwNDg5Mjc4NA@@._V1_SX214_.jpg\r\n    [actors] =\u003e Array (\r\n      [0] =\u003e Jake Gyllenhaal\r\n      [1] =\u003e Jake Gyllenhaal\r\n      [2] =\u003e Holmes Osborne\r\n      [3] =\u003e Maggie Gyllenhaal\r\n      [4] =\u003e Daveigh Chase\r\n      [5] =\u003e Mary McDonnell\r\n      [6] =\u003e James Duval\r\n      [7] =\u003e Arthur Taxier\r\n      [8] =\u003e Patrick Swayze\r\n      [9] =\u003e Mark Hoffman\r\n      [10] =\u003e David St. James\r\n      [11] =\u003e Tom Tangen\r\n      [12] =\u003e Jazzie Mahannah\r\n      [13] =\u003e Jolene Purdy\r\n      [14] =\u003e Stuart Stone\r\n      [15] =\u003e Gary Lundy\r\n    )\r\n  )\r\n  [1] =\u003e Array (\r\n    [title] =\u003e Stay\r\n    [description] =\u003e This movie focuses on the attempts of a psychiatrist to prevent one of his patients from committing suicide while trying to maintain his own grip on reality.\r\n    [score] =\u003e 6.7\r\n    [votes] =\u003e 43,222\r\n    [year] =\u003e 2005\r\n    [Stay.jpg] =\u003e http://ia.media-imdb.com/images/M/MV5BMTIzODM1NjE4N15BMl5BanBnXkFtZTcwNzY4NDE5MQ@@._V1_SY317_CR6,0,214,317_.jpg\r\n    [actors] =\u003e Array (\r\n      [0] =\u003e Ewan McGregor\r\n      [1] =\u003e Ewan McGregor\r\n      [2] =\u003e Ryan Gosling\r\n      [3] =\u003e Kate Burton\r\n      [4] =\u003e Naomi Watts\r\n      [5] =\u003e Elizabeth Reaser\r\n      [6] =\u003e Bob Hoskins\r\n      [7] =\u003e Janeane Garofalo\r\n      [8] =\u003e BD Wong\r\n      [9] =\u003e John Tormey\r\n      [10] =\u003e JosÃ© RamÃ³n Rosario\r\n      [11] =\u003e Becky Ann Baker\r\n      [12] =\u003e Lisa Kron\r\n      [13] =\u003e Gregory Mitchell\r\n      [14] =\u003e John Dominici\r\n      [15] =\u003e Jessica Hecht\r\n    )\r\n  )\r\n)\r\n```\r\n\r\nApply your callback functions on all value items and view the results.\r\n\r\n```php\r\n  -\u003ecallback(array(\r\n    '_all_' =\u003e array('trim'\r\n    ),\r\n  );\r\n\r\n$result = $cc-\u003eget();\r\n```\r\n\r\nTo apply functions on selected elements, replace `_all_` with your key value, like this:\r\n\r\n```php\r\n  -\u003ecallback(array(\r\n    '_all_' =\u003e array('trim'),\r\n    'title' =\u003e array(\r\n      function($string) {\r\n        return str_replace(' ', '_', $string);\r\n      },\r\n    ),\r\n    'actors' =\u003e array(\r\n      function($string) {\r\n        return $string.', ';\r\n      },\r\n    ),\r\n  ));\r\n```\r\n\r\nNote that it is fine to use **anonymous functions** too.\r\n\r\n## Drawbacks\r\n\r\nPHP itself is not suitable for long time-consuming operations, since the process is interrupted as soon as the user closes the web page, or when PHP's time limit is reached *(however `set_time_limit(0)` is utilized in the construct method so right there should not be a problem)*.\r\n\r\n## Requirements\r\n\r\n - PHP 5.3\r\n - cURL extension\r\n\r\n## License\r\n\r\nCopycat is released under [LGPL](http://www.gnu.org/licenses/lgpl-3.0-standalone.html).\r\n\r\n## Thanks\r\n\r\nIf this library is useful for you, say thanks [buying me a coffee](https://www.paypal.me/gidlov) :coffee:!\r\n\r\n[ico-downloads]: https://poser.pugx.org/gidlov/copycat/downloads\r\n[ico-m-downloads]: https://poser.pugx.org/gidlov/copycat/d/monthly\r\n[ico-references]: https://www.versioneye.com/php/gidlov:copycat/reference_badge.svg?style=flat\r\n[ico-license]: https://poser.pugx.org/gidlov/copycat/license\r\n\r\n[link-packagist]: https://packagist.org/packages/gidlov/copycat\r\n[link-references]: https://www.versioneye.com/php/gidlov:copycat/references\r\n","funding_links":["https://www.paypal.me/gidlov"],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgidlov%2Fcopycat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgidlov%2Fcopycat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgidlov%2Fcopycat/lists"}