{"id":15543663,"url":"https://github.com/michaeluno/php-simple-web-scraper","last_synced_at":"2025-04-23T17:28:24.031Z","repository":{"id":62528109,"uuid":"153796017","full_name":"michaeluno/php-simple-web-scraper","owner":"michaeluno","description":"A PHP application which runs on Heroku and dumps web site outputs including JavaScript generated contents.","archived":false,"fork":false,"pushed_at":"2021-06-22T14:28:31.000Z","size":1464,"stargazers_count":20,"open_issues_count":0,"forks_count":19,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-03T12:32:57.044Z","etag":null,"topics":["cross-domain","cross-domain-request","cross-domain-solution","cross-origin","cross-origin-resource-sharing","cross-site","cross-site-scripting","crowler","heroku","heroku-application","phantomjs","php","proxy","scraper","web-scraper"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/michaeluno.png","metadata":{"files":{"readme":"README.md","changelog":"ChangeLog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-10-19T14:30:11.000Z","updated_at":"2023-11-17T08:15:41.000Z","dependencies_parsed_at":"2022-11-02T14:16:57.662Z","dependency_job_id":null,"html_url":"https://github.com/michaeluno/php-simple-web-scraper","commit_stats":null,"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaeluno%2Fphp-simple-web-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaeluno%2Fphp-simple-web-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaeluno%2Fphp-simple-web-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaeluno%2Fphp-simple-web-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/michaeluno","download_url":"https://codeload.github.com/michaeluno/php-simple-web-scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250479781,"owners_count":21437431,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cross-domain","cross-domain-request","cross-domain-solution","cross-origin","cross-origin-resource-sharing","cross-site","cross-site-scripting","crowler","heroku","heroku-application","phantomjs","php","proxy","scraper","web-scraper"],"created_at":"2024-10-02T12:27:54.913Z","updated_at":"2025-04-23T17:28:24.010Z","avatar_url":"https://github.com/michaeluno.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"PHP Simple Web Scraper\n==============================\nA PHP application for Heroku, which can dump web site outputs including JavaScript generated contents.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"_asset/image/screenshot.jpg\" width=\"600\" title=\"screenshot\"\u003e \n\u003c/p\u003e\n\nDemo\n----\n\nVisit [here](https://php-simple-web-scraper.herokuapp.com/). If the server is sleeping, it takes several seconds to wake up. \n\nUsage\n----\n\n### Basic Usage\nPerform an HTTP request with the `url` query parameter and encoded URL as a value.\n\n```\nhttp(s)://{app-address}/?url={encoded target url}\n```\n\n#### Example\n```\nhttp(s)://{app-address}/?url=https%3A%2F%2Fgithub.com\n```\n\n### Parameters\n#### output\nDetermines the output type, which includes `html`, `json`, `screenshot`.\n\n##### _html_ (default)\n\nHTML source code of the target web site. JavaScript generated contents are also retrieved and dumped.\n\n##### _json_\n\n`output=json`\n\nHTTP response data as JSON. Useful for cross domain communications with JSONP.\n\n###### Example\n```\nhttp(s)://{app-address}/?url=https%3A%2F%2Fgithub.com\u0026output=json\n```\n\n##### _screenshot_\n\n`output=screenshot`\n\nA jpeg image of the site snapshot.\n\n###### Example\n```\nhttp(s)://{app-address}/?url=https%3A%2F%2Fgithub.com\u0026output=screenshot\n```\n\n#### file-type\n\nWhen `screenshot` is given for the `output` parameter, the output file type can be set with the `file-type` parameter. Default: `jpg`. \n\nIt accepts the following values: `pdf`, `png`, `jpg`, `jpeg`, `bmp`, `ppm`.\n\n#### width\nWhen `screenshot` is given for the `output` parameter, `width` sets the screenshot image width.  \n\n#### height\nWhen `screenshot` is given for the `output` parameter, `height` sets the screenshot image height. Leave it unset to get full height. The default minimum height is `720` pixels.\n\n###### Example\n```\nhttp(s)://{app-address}/?url=https%3A%2F%2Fgithub.com\u0026output=screenshot\u0026file-type=png\n```\n\n#### user-agent\nSets a custom user agent. By default, the client's user agent accessing the app will be used. This can be changed by specifying the value with this parameter.\n\nIf `random` is given, the user-agent will be randomly assigned. \n\n##### Example\nTo set a user agent, `Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100102 Firefox/57.0`, \n```\nhttp(s)://{app-address}/?url=https%3A%2F%2Fwww.whatismybrowser.com%2Fdetect%2Fwhat-http-headers-is-my-browser-sending\u0026user-agent=Mozilla/5.0%20(Windows%20NT%206.1;%20Win64;%20x64;%20rv:57.0)%20Gecko/20100102%20Firefox/57.0\n```\n```\nhttp(s)://{app-address}/?url=https%3A%2F%2Fwww.whatismybrowser.com%2Fdetect%2Fwhat-http-headers-is-my-browser-sending\u0026user-agent=random\n```\n\n#### load-images\nDecides whether to load images. By default, this is disabled for the `html` and `json` output types. Enabled for the `screenshot` output type. \n\nAccepts a boolean value `true`, `false`, or `1`, `0`.\n\n##### Example \n```\nhttp(s)://{app-address}/?url=https%3A%2F%2Fwww.whatismybrowser.com%2Fdetect%2Fwhat-http-headers-is-my-browser-sending\u0026user-agent=Mozilla/5.0%20(Windows%20NT%206.1;%20Win64;%20x64;%20rv:57.0)%20Gecko/20100102%20Firefox/57.0\n```\n\n#### output-encoding\nSets the encoding used for the output. Default: `utf8`\n\n#### cache-lifespan\nAll requests are cached for 20 minutes by default. This detemines how long the cache should be retained. If you do not want a cached result or want to renew the cache, pass `0`. Default: `1200`.    \n\n#### headers\nSets a custom HTTP headers. Accepts the value as an array.\n\n##### Example\nTo set `DNT` value, \n```\nhttp(s)://{app-address}/?url=https%3A%2F%2Fwww.whatismybrowser.com%2Fdetect%2Fwhat-http-headers-is-my-browser-sending\u0026headers[DNT]=1\n```\n\n#### method\nHTTP request method. Default: `GET`. Accepts the followings. \n - OPTIONS\n - GET\n - HEAD\n - POST\n - PUT\n - DELETE\n - PATCH\n\nWhen using `POST`, give sending post data with the `data` request key. The program checks `$_REQUEST[ 'data' ]` to send POST data.\n##### Example  \n```\nhttp(s)://{app-address}/?url=http%3A%2F%2Fhttpbin.org%2Fpost\u0026method=POST\u0026data[foo]=bar\n```\n\nRun as Heroku Application\n----\nThis is a Heroku application and meant to be deployed to a [Heroku](https://dashboard.heroku.com/) application instance.\n\n### Requirements\n- Heroku account\n- [Heroku CLI](https://devcenter.heroku.com/articles/heroku-command-line)\n- Git\n\n### Steps to Deploy\n\n#### a) Quick Deploy\nYou may simply use the following button to deploy this application: \n\n[![Deploy](https://www.herokucdn.com/deploy/button.png)](https://heroku.com/deploy)\n\n\n#### b) Manual Deploy\n1. Clone this repository to your local machine. Create a directory and from there, in a console window, type the following.\n```\ngit clone https://github.com/michaeluno/php-simple-web-scraper.git\n```\nThis will download the repository files.\n\n2. Change the working directory to the cloned one.\n```\ncd php-simple-web-scraper\n```\n\n3. Login to Heroku from Heroku CLI. \n```\nheroku login\n```\n\n4. Create a new Heroku app.\n```\nheroku create\n```\nThis gives somehing like this with a random app name. `glacial-basin-46381` is the app name in the below example.\n```\nhttps://glacial-basin-46381.herokuapp.com/ | https://git.heroku.com/glacial-basin-46381.git\n```\n\n5. Type the following. Replace `{heroku-app-name}` with your app name given in the above step.\n```\nheroku git:remote -a {heroku-app-name}\n```\n\n6. Upload the files to Heroku.\n```\ngit push heroku master\n```\n\n7. Open the app in your browser.\n```\nheroku open\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaeluno%2Fphp-simple-web-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmichaeluno%2Fphp-simple-web-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaeluno%2Fphp-simple-web-scraper/lists"}