{"id":18679851,"url":"https://github.com/bsctl/wscraper","last_synced_at":"2025-04-12T03:30:55.750Z","repository":{"id":4423261,"uuid":"5561102","full_name":"bsctl/wscraper","owner":"bsctl","description":"A web scraper agent written in node.js","archived":true,"fork":false,"pushed_at":"2012-08-26T17:21:26.000Z","size":6099,"stargazers_count":16,"open_issues_count":2,"forks_count":13,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-04T17:50:15.627Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bsctl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-08-26T15:24:21.000Z","updated_at":"2023-11-10T19:10:37.000Z","dependencies_parsed_at":"2022-09-21T16:00:23.627Z","dependency_job_id":null,"html_url":"https://github.com/bsctl/wscraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bsctl%2Fwscraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bsctl%2Fwscraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bsctl%2Fwscraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bsctl%2Fwscraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bsctl","download_url":"https://codeload.github.com/bsctl/wscraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248512484,"owners_count":21116612,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T09:46:03.963Z","updated_at":"2025-04-12T03:30:51.856Z","avatar_url":"https://github.com/bsctl.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# wscraper\n\nwscraper.js is a web scraper agent written in node.js and based on [cheerio.js][0] a fast, flexible, and lean implementation of core jQuery;\nIt is built on top of [request.js][1] and inspired by [http-agent.js][2];\n\n## Usage \n\nThere are two ways to use wscraper: http agent mode and local mode. \n\n### HTTP Agent mode\nIn HTTP Agent mode, pass it a host, a list of URLs to visit and a scraping JS script. For each URLs, the agent makes a request, gets the response, runs the scraping script and returns the result of the scraping. Valid usage is:\n\n```js\n// scrape a single page from a web site\nvar agent = wscraper.createAgent();\nagent.start('google.com', '/finance', script);\n\n// scrape multiple pages from a website\nwscraper.start('google.com', ['/', '/finance', '/news'], script);\n```\n\nThe URLs should be passed as an array of strings. In case only one page needs to be scraped, the URL can be passed as a single string. Null or empty URLs are treated as  root '/'. Suppose you want to scrape from http://google.com/finance website the stocks price of the following companies: Apple, Cisco and Microsoft.\n\n```js\n// load node.js libraries\nvar\tutil = require('util');\nvar\twscraper = require('wscraper');\nvar\tfs = require('fs');\n\n// load the scraping script from a file\nvar script = fs.readFileSync('/scripts/googlefinance.js');\n\nvar companies = ['/finance?q=apple', '/finance?q=cisco', '/finance?q=microsoft'];\n\n// create a web scraper agent instance\nvar agent = wscraper.createAgent();\n\nagent.on('start', function (n) {\n\tutil.log('[wscraper.js] agent has started; ' + n + ' path(s) to visit');\n});\n\nagent.on('done', function (url, price) {\n\tutil.log('[wscraper.js] data from ' + url);\n\t// display the results\t\n\tutil.log('[wscraper.js] current stock price is ' + price + ' USD');\n\t// next item to process if any\n\tagent.next();\t\t\n});\n\nagent.on('stop', function (n) {\n\tutil.log('[wscraper.js] agent has ended; ' + n + ' path(s) remained to visit');\n});\n\nagent.on('abort', function (e) {\n\tutil.log('[wscraper.js] getting a FATAL ERROR [' + e + ']');\n\tutil.log('[wscraper.js] agent has aborted');\n\tprocess.exit();\n});\n\n// run the web scraper agent\nagent.start('www.google.com', companies, script);\n```\n\nThe scraping script should be pure client JavaScript, including JQuery selectors. See [cheerio.js][0] for details. I should return a valid JavaScript object.\nThe scraping script is passed as a string and usually is read from a file. You can scrape different websites without change any line of the main code: only write different JavaScript scripts.\nThe scraping script is executed in a sandbox using a separate VM context and the script errors are caught without crash of the main code.\n\nAt time of writing, google.com/finance website reports financial data of public companies as in the following html snippet:\n\n```html\n...\n\u003cdiv id=\"price-panel\" class=\"id-price-panel goog-inline-block\"\u003e\n  \u003cdiv\u003e\n    \u003cspan class=\"pr\"\u003e\n  \t\u003cspan id=\"ref_22144_l\"\u003e656.06\u003c/span\u003e\n    \u003c/span\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n...\n```\nBy using JQuery selectors, we design the scraping script \"googlefinance.js\" to find the current value of a company stocks and return it as a text:\n\n```js\n/*\n\ngooglefinance.js\n\n$ -\u003e is the DOM document to be parsed\nresult -\u003e is the object containing the result of parsing\n*/\n\nresult = {};\nprice = $('div.id-price-panel').find('span.pr').children().text();\nresult.price = price;\n\n// result is '656.06'\n```\n\n### Local mode\nSometimes, you need to scrape local html files without make a request to a remote server. Wscraper can be used as inline scraper. It takes an html string and a JS scraping script. The scraper runs the scraping script and returns the result of the scraping. Valid usage is:\n\n```js\nvar scraper = wscraper.createScraper();\nscraper.run(html, script);\n```\n\nOnly as trivial example, suppose you want to replace the class name of \u003cdiv\u003e elements only containing an image with a given class. Create a scraper:\n\n```js\n// load node.js libraries\nvar\tutil = require('util');\nvar\tfs = require('fs');\nvar\twscraper = require('wscraper');\n\n// load your html page\nvar html = fs.readFileSync('/index.html');\n\n// load the scraping script from a file\nvar script = fs.readFileSync('/scripts/replace.js');\n\n// create the scraper\nvar scraper = wscraper.createScraper();\n\nscraper.on('done', function(result) {\n\t// do something with the result\n\tutil.log(result)\n});\n\nscraper.on('abort', function(e) {\n\tutil.log('Getting error in parsing: ' + e)\n});\n\n// run the scraper\nscraper.run(html, script);\n```\n\nBy using JQuery selectors, we design the scraping script \"replace.js\" to find the \u003cdiv\u003e elements containing images with class=\"MyPhotos\" and replace each of them with a \u003cdiv\u003e element having class=\"Hidden\" without any image inside.\n\n```js\n/*\nreplace.js\n\n$ -\u003e is the DOM document to be parsed\nresult -\u003e is the final JSON string containing the result of parsing\nuse var js-obj = JSON.parse(result) to get a js object from the json string\nuse JSON.stringify(js-obj) to get back a json string from the js object\n*/\n\nresult = {};\nvar imgs = $('img.MyPhotos').toArray();\n$.each(imgs, function(index, elem) {\n\tvar parentdiv = $(elem).parent();\n\tvar newdiv = $('\u003cdiv class=\"Hidden\"/\u003e\u003c/div\u003e');\n\t$(elem).parent().replaceWith(newdiv)\n});\n\nresult.replaced = $.html() || '';\n```\n\nHappy scraping!\n\n### Author: kalise © 2012 MIT Licensed;\n\n[0]: https://github.com/MatthewMueller/cheerio\n[1]: https://github.com/mikeal/request\n[2]: https://github.com/indexzero/http-agent\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbsctl%2Fwscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbsctl%2Fwscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbsctl%2Fwscraper/lists"}