{"id":13850727,"url":"https://github.com/epiqueras/getsy","last_synced_at":"2025-04-13T08:26:48.782Z","repository":{"id":57250679,"uuid":"88805660","full_name":"epiqueras/getsy","owner":"epiqueras","description":"A simple browser/client-side web scraper.","archived":false,"fork":false,"pushed_at":"2017-04-24T21:47:39.000Z","size":130,"stargazers_count":241,"open_issues_count":2,"forks_count":15,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-10-29T11:32:46.704Z","etag":null,"topics":["browser","client-side","scraper","web-scraper"],"latest_commit_sha":null,"homepage":"http://www.getgetsy.com","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epiqueras.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-04-20T01:08:13.000Z","updated_at":"2024-07-28T05:26:00.000Z","dependencies_parsed_at":"2022-08-24T16:52:03.045Z","dependency_job_id":null,"html_url":"https://github.com/epiqueras/getsy","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epiqueras%2Fgetsy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epiqueras%2Fgetsy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epiqueras%2Fgetsy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epiqueras%2Fgetsy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epiqueras","download_url":"https://codeload.github.com/epiqueras/getsy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248683011,"owners_count":21144862,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["browser","client-side","scraper","web-scraper"],"created_at":"2024-08-04T20:01:28.490Z","updated_at":"2025-04-13T08:26:48.764Z","avatar_url":"https://github.com/epiqueras.png","language":"TypeScript","funding_links":[],"categories":["TypeScript"],"sub_categories":[],"readme":"# Getsy\n\u003e A simple browser/client-side web scraper.\n\u003e Try it out in a REPL:\n[http://www.getgetsy.com](http://www.getgetsy.com)\n\n\u003e\u003e TODOS:\n\u003e\u003e + [x] Support for websites with infinite scroll.\n\u003e\u003e + [ ] Support for websites with click pagination.\n\n\u003cbr /\u003e\n\n## Installation options:\n+ Run `npm install --save getsy` or `yarn add getsy`\n+ Download the [umd](https://github.com/epiqueras/getsy/releases/download/v0.9.1/getsy.js) build and link it using a script tag\n\n\u003cbr /\u003e\n\n## How to use:\nThis library exposes a single function:\n`getsy(url: string, optionsObject?: options): Promise\u003cGetsy\u003e`\n\n**parameters:**\n+ `url`: The url of the website you wish to scrape.\n\n+ `optionsObject`*(optional)*:\n\n  + `corsProxy`*(optional string)*: The endpoint of the corsProxy you wish to use. *(Read corsProxy for more info)*\n\n  + `resolveURLs`*(optional boolean)*: Wether you want getsy to resolve all relative urls in the resource to absolute urls so they don't fail when they load in another page. *(defaults to true)*\n\n  + `iframe`: A boolean or object with width and height properties indicating if getsy should start in iframeMode or not. iframe mode will wait for the resource to be mounted in a hidden iframe so you can extract more data through pagination or infinite scrolling. *(defaults to false)*\n\n\nThe function returns a promise that resolves to a Getsy object on success and rejects if it was unable to load the requested page.\n\nGetsy objects have a method `getMe` for scraping the resource's contents. This method is just a wrapper over the jQuery function so you can chain other jQuery methods on it. If you need to use the raw data you can access it's `content` property. *(More on Getsy below)*\n\n\n### Example (Promises):\n\n```js\nimport getsy from 'getsy'\n\ngetsy('https://en.wikipedia.org/wiki/\"Hello,_World!\"_program').then(myGetsy =\u003e {\n  console.log(myGetsy.getMe('#firstHeading').text())\n})\n```\n\n\n### Example (Async/Await):\n\n```js\nimport getsy from 'getsy'\n\nasync function testing() {\n  const myGetsy = await getsy('https://en.wikipedia.org/wiki/\"Hello,_World!\"_program')\n\n  console.log(myGetsy.getMe('#firstHeading').text())\n}\n\ntesting()\n```\n\n\n### Here's how you might use it with a website that has infinite scrolling:\n\n```js\nasync function infiniteScrape() {\n  myGetsy = await getsy('http://scrollmagic.io/examples/advanced/infinite_scrolling.html', { iframe: true })\n  \n  console.log(`${myGetsy.getMe('.box1').length} boxes.`)\n  \n  const { succesfulTimes, totalRetries } = await myGetsy.scroll(10)\n  \n  console.log(`New content loaded ${succesfulTimes} times with ${totalRetries} total retries.`)\n  console.log(`${myGetsy.getMe('.box1').length} boxes.`) // More content!\n}\n\ninfiniteScrape()\n```\n\n\u003cbr /\u003e\n\n## The Getsy Object:\nThe Getsy object has the following properties and methods:\n\n+ `corsProxy`: The same one passed from the options object or the default value.\n\n+ `content`: The original string data received from the request.\n\n+ `iframe`: A reference to its iframe element if in iframe mode.\n\n+ `iframeDoc`: A reference to its iframe's document if in iframe mode.\n\n+ `content`: The original string data received from the request.\n\n+ `getMe(selector: string): JQuery`: Query the resource's DOM or the iframe if in iframe mode with a jQuery selector. Returns a JQuery object.\n\n+ `scroll(numberOfTimes: number, element?: HTMLElement, interval?: number, retries?: number): Promise\u003cscrollResolve\u003e`: Scroll to the bottom of an `element` *(defaults to body)* to load new data a specified `numberOfTimes`. The `interval` *(defaults to 2000)* is the time in milliseconds that Getsy waits before checking if new content has loaded. If no new content has loaded it will retry as many times as specified by `retries` *(defaults to 5)*. If no new content has loaded and `scroll` is out of retries then it will resolve the Promise early to avoid waiting for the remaining `numberOfTimes`. Note: retries reset to 0 on every succesful content load. Returns a Promise that resolves to an object with the number of `.succesfulTimes` that new content was loaded and the `.totalRetries`.\n\n+ `hideFrame(): void`: Hides the iframe if applicable.\n\n+ `showFrame(): void`: Shows the iframe if applicable.\n\n\u003cbr /\u003e\n\n## CorsProxy:\nThis library uses a corsProxy to get by the CORS Origin issue.\nIf you don't provide one it will default to: `https://crossorigin.me/`.\n\nSome node CorsProxy servers:\n+ [cors-anywhere](https://github.com/Rob--W/cors-anywhere)\n+ [CORS-Proxy](https://github.com/gr2m/CORS-Proxy)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepiqueras%2Fgetsy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepiqueras%2Fgetsy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepiqueras%2Fgetsy/lists"}