{"id":26315175,"url":"https://github.com/jimmylaurent/node-crawling-framework","last_synced_at":"2025-03-15T12:17:36.701Z","repository":{"id":80547257,"uuid":"139486588","full_name":"JimmyLaurent/node-crawling-framework","owner":"JimmyLaurent","description":"✨ NodeJs crawling \u0026 scraping framework heavily inspired by Scrapy","archived":false,"fork":false,"pushed_at":"2018-07-20T23:17:06.000Z","size":232,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-15T12:17:33.874Z","etag":null,"topics":["crawler","crawling","crawling-framework","elasticsearch","headless-chrome","middleware","mongodb","nodejs-framework","scraper","scraping","scraping-framework","scrapy","spider"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JimmyLaurent.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-07-02T19:42:33.000Z","updated_at":"2023-09-08T17:42:23.000Z","dependencies_parsed_at":"2023-05-22T19:00:32.779Z","dependency_job_id":null,"html_url":"https://github.com/JimmyLaurent/node-crawling-framework","commit_stats":{"total_commits":4,"total_committers":3,"mean_commits":"1.3333333333333333","dds":0.5,"last_synced_commit":"fd84f7d31ddc2025b14acff6a0224c5ba69756e7"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JimmyLaurent%2Fnode-crawling-framework","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JimmyLaurent%2Fnode-crawling-framework/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JimmyLaurent%2Fnode-crawling-framework/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JimmyLaurent%2Fnode-crawling-framework/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JimmyLaurent","download_url":"https://codeload.github.com/JimmyLaurent/node-crawling-framework/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243725640,"owners_count":20337670,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawling","crawling-framework","elasticsearch","headless-chrome","middleware","mongodb","nodejs-framework","scraper","scraping","scraping-framework","scrapy","spider"],"created_at":"2025-03-15T12:17:35.758Z","updated_at":"2025-03-15T12:17:36.687Z","avatar_url":"https://github.com/JimmyLaurent.png","language":"JavaScript","readme":"# node-crawling-framework\n\nCurrent stage: aplha (Work in progress)\n\n\"node-crawling-framework\" is a crawling \u0026 scraping framework for NodeJs heavily inspired by [Scrapy](https://scrapy.org/).\n\nA node job server is also in motion (kinda scrapyd equivalent based on BullJs).\n\n## Features (not fully tested and finalized)\n\nThe core is working: Crawler, Scraper, Spider, item processors (pipeline), DownloadManager, downloader.\n\n- Modular and easily extendable architecture through middlewares and class inheritance: \n  * add your own middlewares for spiders, item-processors, and downloaders.\n  * extend framework spiders and get some features for free.\n\n- DownloadManager: delay and concurency limit settings,\n- RequestDownloader: downloader based on request package,\n- Downloader middlewares: \n  * cookie: handle cookie storage between requests,\n  * defaultHeaders: add default headers to each request,\n  * retry: retry requests on error,\n  * stats: collect some stats during the crawling (requests \u0026 errors count, ...)\n- Spiders:\n  * BaseSpider: every spider must inherhit from this one,\n  * Sitemap: parse sitemap and feed the spider with found urls,\n  * Elasticsearch: feed spider urls with elasticsearch\n- Spider middlewares:\n  * cheerio: cheerio helper on response to get a cheerio object,\n  * scrapeUtils: cheerio + some helpers to facilitate the scraping (methods: scrape, scrapeUrl, scrapeRequest, ...),\n  * filterDomains: filter non authorized domains\n- Item processor middlewares:\n  * printConsole: log items to the console,\n  * jsonLineFileExporter: write scraped items to a json file, one line = one json (easier to parse atferwards, smaller memory footprint),\n  * logger: log items to the logger,\n  * elasticsearchExporter: export items to elasticsearch\n- Logger: configurable logger (default: console)\n\n## Project example\n\nSee [Quotesbot](https://github.com/jimmylaurent/quotesbot)\n\n## Spider example\n\n```js\nconst { BaseSpider } = require('node-crawling-framework');\n\nclass CssSpider extends BaseSpider {\n  constructor() {\n    super();\n    this.startUrls = ['http://quotes.toscrape.com'];\n  }\n\n  *parse(response) {\n    const quotes = response.scrape('div.quote');\n    for (let quote of quotes) {\n      yield {\n        text: quote.scrape('span.text').text(),\n        author: quote.scrape('small.author').text(),\n        tags: quote.scrape('div.tags \u003e a.tag').text()\n      };\n    }\n    yield response.scrapeRequest({ selector: '.next \u003e a' });\n  }\n}\n\nmodule.exports = CssSpider;\n```\n\n## Crawler configuration example\n\n```js\nmodule.exports = {\n  settings: {\n    maxDownloadConcurency: 1, // maximum download concurrency, default: 1\n    filterDuplicateRequests: true, // filter already scraped requests, default: true\n    delay: 100, // delay in ms between requests, default: 0\n    maxConcurrentScraping: 500, // maximum concurrent scraping, default: 500\n    maxConcurrentItemsProcessingPerResponse: 100, // maximum concurrent item processing per response, default: 100\n    autoCloseOnIdle: true // auto close crawler when crawling is finished, default:true\n  },\n  logger: null, // logger, must implement console interface, default: console\n  spider: {\n    type: '', // spider to use for crawling, search spider in ${cwd} or ${cwd}/spiders, can also be a class definition object\n    options: {}, // spider constructor args\n    middlewares: {\n      scrapeUtils: {}, // add utils methods to the response, ex: \"response.scrape()\"\n      filterDomains: {} // avoid unwanted domain requests from being scheduled\n    }\n  },\n  itemProcessor: {\n    middlewares: {\n      jsonLineFileExporter: {}, // write scraped items to a json file, one line = one json (easier to parse atferwards, smaller memory footprint)\n      logger: {} // log scraped items through the crawler logger\n    }\n  },\n  downloader: {\n    type: 'RequestDownloader', // downloader to use, can also be a class definition object\n    options: {}, // downloader constructor args\n    middlewares: {\n      stats: {}, // give some stats about requests, ex: number of requests/errors\n      retry: {}, // retry on failed requests\n      cookie: {} // store cookie between requests\n    }\n  }\n};\n\n```\n\n## Crawler instantiation example\n\n```js\nconst { createCrawler } = require('node-crawling-framework');\n\nconst config = require('./config');\nconst crawler = createCrawler(config, 'CssSpider');\n\ncrawler.crawl().then(() =\u003e {\n  console.log('✨  Crawling done');\n});\n\n```\n\n## TODO list\n\n- Add unit tests\n- Add documentation\n- Add MongoDb feeder/exporter\n- Make some benchmarks ?\n- Finish formRequest scraping ( add clickables elements)\n- Add Puppeteer downloader\n- Split plugins/middlewares in packages\n- Command line tool, \"nfc-cli\"\n  * scaffolding: create project (with wizard), spider, any middleware\n  * crawl: launch crawl\n  * deploy: deploy to node-job-server\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimmylaurent%2Fnode-crawling-framework","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjimmylaurent%2Fnode-crawling-framework","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimmylaurent%2Fnode-crawling-framework/lists"}