{"id":26502238,"url":"https://github.com/d-oliveros/nest","last_synced_at":"2025-03-20T17:39:32.339Z","repository":{"id":24647488,"uuid":"28057326","full_name":"d-oliveros/nest","owner":"d-oliveros","description":"High-level, robust framework for web scraping in Node.js","archived":false,"fork":false,"pushed_at":"2017-09-28T17:20:10.000Z","size":595,"stargazers_count":23,"open_issues_count":0,"forks_count":3,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-04-14T06:08:17.481Z","etag":null,"topics":["node-scraper","scraper","scraping"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/d-oliveros.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-12-15T21:34:34.000Z","updated_at":"2024-03-06T00:59:11.000Z","dependencies_parsed_at":"2022-07-13T23:50:42.575Z","dependency_job_id":null,"html_url":"https://github.com/d-oliveros/nest","commit_stats":{"total_commits":142,"total_committers":8,"mean_commits":17.75,"dds":0.07042253521126762,"last_synced_commit":"0f3a56e1a276a95b864d5033fd0783fcf21f9c48"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d-oliveros%2Fnest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d-oliveros%2Fnest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d-oliveros%2Fnest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d-oliveros%2Fnest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/d-oliveros","download_url":"https://codeload.github.com/d-oliveros/nest/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244664748,"owners_count":20490202,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["node-scraper","scraper","scraping"],"created_at":"2025-03-20T17:39:31.832Z","updated_at":"2025-03-20T17:39:32.332Z","avatar_url":"https://github.com/d-oliveros.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\nNest\n==============\n\n[![Build Status](https://travis-ci.org/d-oliveros/nest.svg?branch=master)](https://travis-ci.org/d-oliveros/nest)\n[![Dependencies Status](https://david-dm.org/d-oliveros/nest.svg)](https://david-dm.org/d-oliveros/nest)\n\nNest is a high-level, robust framework for web scraping.\n\n\n## Features\n\n* Dynamic Scraping with a headless browser (Puppeteer)\n* Static scraping with direct HTTP requests without JS evaluation (cheerio)\n* Parallel scraping, worker queue\n* MongoDB integration. State is persisted to Mongo after each operation\n* Minimal API and dead-easy to use\n\n\n## Requirements\n\n  * MongoDB up and running\n  * Node\n\n\n## Installation\n\n[Install MongoDB](https://docs.mongodb.com/manual/installation/#mongodb-community-edition).\n\nAlso install node-nest in your project:\n\n```shell\nnpm install node-nest\n```\n\n## Usage\n\n```js\n// Instanciates a new Nest object\nvar Nest = require('node-nest');\nvar nest = new Nest();\n\n// Register routes\nvar someRoute = require('./routes/some-route');\nvar anotherRoute = require('./routes/another-route');\nnest.addRoute(someRoute);\nnest.addRoute(anotherRoute);\n\n// Queues scraping operations\nnest.queue('some-route', { priority: 90, query: { userId: 123 } });\nnest.queue('another-route', { query: { someVar: 'something' } });\n\n// Starts the engine\nnest.start();\n```\n\n### Example\n\n* You can find this example's [full code here](https://github.com/d-oliveros/nest-hackernews).\n\nIn this guide, we'll scrape Hackernews articles. To use Nest, you first need to initialize a Nest object:\n\n```js\nvar Nest = require('node-nest');\nvar nest = new Nest();\n```\n\nBy default, Nest will use the same amount of workers as you have CPU cores. It will also try to connect to a MongoDB running at `127.0.0.1:27017`. You can configure these parameters by doing:\n\n```js\nvar Nest = require('node-nest');\n\nvar nest = new Nest({\n  workers: 4,         // Set the amount of workers scraping in parallel to 4\n  mongo: {\n    db: 'nest',       // Use the 'nest' mongo collection\n    host: '127.0.0.1' // Connect to the Mongo process running at localhost\n    port: '27017'     // Connect to the Mongo process running at port 27017\n  }\n});\n```\n\nThen you must define some routes. A route is a definition of a site's section, for example a profile page, a post page, or a search results page.\n\n#### Route\n\nA route defines the URL pattern that matches a particular site section, and describes how the data should be structured out of this page, by explicitly defining a scraping function.\n\nDepending on the returned data from the scraping function, Nest will store the structured scraped data in the mongo database and/or queue more URLs to be scraped.\n\nYou can add more routes by using the method `nest.addRoute()`. Let's define how the \"hackernews homepage\" route should be scraped:\n\n```js\nnest.addRoute({\n\n  // This is the route ID\n  key: 'hackernews-homepage',\n\n  // This is the URL pattern corresponding to this route\n  url: 'https://news.ycombinator.com',\n\n  // This is the scraper function, defining how this route should be scraped\n  scraper: function($) {\n\n    // You should return an object with the following properties:\n    // - items:       `Array` Items to save in the database.\n    // - jobs:        `Array` New scraping jobs to add to the scraper worker queue\n    // - hasNextPage: `Boolean` If true, Nest will scrape the \"next page\"\n    var data = {\n      items: []\n    };\n\n    // The HTML is already loaded and wrapped with Cheerio in '$',\n    // meaning you can get data from the page, jQuery style:\n    $('tr.athing').each((i, row) =\u003e {\n      data.items.push({\n        title: $(row).find('a.storylink').text(),\n        href: $(row).find('a.storylink').attr('href'),\n        postedBy: $(row).find('a.hnuser').text(),\n\n        // this is the only required property in an item object\n        key: $(row).attr('id')\n      });\n    });\n\n    // In this example, Nest will only save the objects\n    // stored in 'data.items', into the mongo database\n    return data;\n  }\n});\n```\n\nThen, you need to queue some scraping operations, and start the engine:\n\n```js\nnest.queue('hackernews-homepage');\nnest.start().then(() =\u003e console.log('Engine started!'));\n```\n\nTo run this example, just run it with Node. Let's say you called this file \"scrape-hackernews.js\":\n\n```shell\nnode scrape-hackernews\n```\n\nAfter running this example, your database will contain 30 scraped items from hackernews, with the following structure:\n\n```js\n{\n  \"_id\" : ObjectId(\"5797199075c2d900da9e3a3e\"),\n  \"key\" : \"12160127\",\n  \"routeWeight\" : 50,\n  \"routeId\" : \"hackernews-homepage\",\n  \"href\" : \"https://github.com/jisaacso/DeepHeart\",\n  \"title\" : \"DeepHeart: A Neural Network for Predicting Cardiac Health\"\n},\n{\n  \"_id\" : ObjectId(\"5797199075c2d900da9e3a3d\"),\n  \"key\" : \"12160374\",\n  \"routeWeight\" : 50,\n  \"routeId\" : \"hackernews-homepage\",\n  \"href\" : \"http://www.wsj.com/articles/apple-taps-bob-mansfield-to-oversee-car-project-1469458580\",\n  \"title\" : \"Apple Taps Bob Mansfield to Oversee Car Project\"\n},\n...etc\n```\n\nTry looking at the scraped data using mongo's native REPL:\n\n```shell\nmongo nest\n\u003e db.items.count()\n\u003e db.items.find().pretty()\n```\n\n* You will see multiple \"There are no pending jobs. Retrying in 1s\" messages. This is fine. It means that the engine finished processing all the queued jobs, and the workers are just waiting for new jobs.\n\nWhen running this program again, the route \"hackernews-homepage\" will not be scraped again, because the state is persisted in Mongo, and Nest doesn't re-scrapes individual URLs that have already been scraped.\n\nYou will notice this route is not that helpful, as it is just getting superficial data from each item (The title and the href), and it's only scraping the first page of hackernews.\n\nLet's create a \"hackernews post\" route, and a new \"hackernews articles\" route. The new articles route should scrape the first 10 pages of hackernews, and queue a scraping job to \"hackernews post\" for each scraped article in the articles list. The items in the database will be updated by the new information, after scraping their post pages.\n\nThe [full example](https://github.com/d-oliveros/nest-hackernews) looks as follows:\n\n```js\n// in scrape-hackernews.js\n\nvar Nest = require('node-nest');\n\nvar nest = new Nest();\n\nnest.addRoute({\n  key: 'hackernews-post',\n\n  // Route url strings are passed to lodash's 'template' function.\n  // You can also provide a function that should return the newly built URL\n  // @see https://lodash.com/docs#template\n  url: 'https://news.ycombinator.com/item?id=\u003c%= query.id %\u003e',\n\n  scraper: function($) {\n    var $post = $('tr.athing').first();\n\n    return {\n      items: [{\n        key: $post.attr('id'),\n        title: $post.find('.title a').text(),\n        href: $post.find('.title a').attr('href'),\n        postedBy: $post.find('.hnuser').text(),\n\n        // for the sake of this tutorial let's just save most voted comment\n        bestComment: $('.comment').first().text()\n      }]\n    };\n  }\n});\n\nnest.addRoute({\n  key: 'hackernews-articles',\n\n  // the scraping state is available in the URL generator function's scope\n  // we can use the \"currentPage\" property to enable pagination\n  url: 'https://news.ycombinator.com/news?p=\u003c%= state.currentPage %\u003e',\n\n  scraper: function($) {\n    var currentPage = $('.rank').last().text() / 30;\n\n    var data = {\n      items: [],\n\n      // by returning data through the 'jobs' property,\n      // you are queueing new scraping operations for the workers to pick up\n      jobs: [],\n\n      // if this property is true, the scraper will re-scrape the route,\n      // but with the 'state.currentPage' parameter incremented by 1\n      //\n      // for the sake of this tutorial, let's just scrape the first 5 pages\n      hasNextPage: currentPage \u003c 5\n    };\n\n    // for each article\n    $('tr.athing').each((i, row) =\u003e {\n\n      // create superficial hackernews article items in the database\n      data.items.push({\n        key: $(row).attr('id'),\n        title: $(row).find('a.storylink').text(),\n        href: $(row).find('a.storylink').attr('href'),\n        postedBy: $(row).find('a.hnuser').text()\n      });\n\n      // also, queue scraping jobs to the \"hackernews-post\" route, defined above\n      data.jobs.push({\n        routeId: 'hackernews-post', // defines which route to be used\n        query: { // defines the \"query\" object, used to build the final URL\n          id: $(row).attr('id')\n        }\n      });\n    });\n\n    // Nest will save the objects in 'data.items' and queue jobs in 'data.jobs'\n    // Nest won't repeat URLs that have already been scraped\n    return data;\n  }\n});\n\nnest.queue('hackernews-articles');\n\nnest.start();\n```\n\nAfter running the example, the first worker will go to the articles feed, scrape the 30 articles in the list, store those scraped items in the database, and queue scraping jobs to those articles by their article ID. Then, it will paginate and scrape the next page of the feed.\n\nMeanwhile, the other workers will pick the jobs in the queue, scrape the article pages, and update the article in the database by their article ID.\n\nRemember you can find the [full example's code here](https://github.com/d-oliveros/nest-hackernews).\n\n#### Nest will avoid scraping URLs that have already been scraped\n\nRemember, URLs that have already been scraped _will not be scraped again_. So, if you make changes to a finite route and want to test your new route, or if you want to repeat your routes, you can delete the finished scraped URLs from the 'jobs' collection by doing:\n\n```shell\nmongo nest\n\n# This will delete all the finished URLs\n\u003e db.jobs.remove({ 'state.finished': true })\n\n# This will only delete finished jobs for a particular route\n\u003e db.jobs.remove({ 'state.finished': true, 'routeId': 'my-route-key' })\n\n# WARNING: This will delete every item and job in your database\n\u003e db.dropDatabase()\n```\n\n_process.env.NEST_DUMP_BROWSER_IO_TO_STD_OUT=1 to dump puppeteer io to stdout._\n\n## Engine\n\nBy default, Nest will create x amount of workers, where x is the amount of CPU cores you have. Each worker will query for an operation, sorted by priority, run that operation (and spawn a bunch of other operations), and query for another operation again.\n\nOnly 1 worker will be querying for an operation at a given time. That is to avoid having multiple workers working on the same op. If there are no unfinished operations, the worker will keep on querying for new ops every second or so.\n\n\n## Tests\n\n```\nnpm run test\n```\n\n\nCheers.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fd-oliveros%2Fnest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fd-oliveros%2Fnest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fd-oliveros%2Fnest/lists"}