{"id":15527379,"url":"https://github.com/danielnieto/scrapman","last_synced_at":"2025-04-23T12:26:27.383Z","repository":{"id":57250659,"uuid":"70870275","full_name":"danielnieto/scrapman","owner":"danielnieto","description":"Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs","archived":false,"fork":false,"pushed_at":"2018-04-13T18:54:33.000Z","size":36,"stargazers_count":22,"open_issues_count":1,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-11T17:16:26.017Z","etag":null,"topics":["electron","javascript","javascript-tools","scrap","scraper","scraping","scraping-websites"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/danielnieto.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-10-14T03:23:02.000Z","updated_at":"2024-10-14T18:02:02.000Z","dependencies_parsed_at":"2022-08-24T16:52:00.734Z","dependency_job_id":null,"html_url":"https://github.com/danielnieto/scrapman","commit_stats":null,"previous_names":["danielnieto/getreal"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielnieto%2Fscrapman","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielnieto%2Fscrapman/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielnieto%2Fscrapman/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danielnieto%2Fscrapman/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/danielnieto","download_url":"https://codeload.github.com/danielnieto/scrapman/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250264885,"owners_count":21402002,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["electron","javascript","javascript-tools","scrap","scraper","scraping","scraping-websites"],"created_at":"2024-10-02T11:05:54.053Z","updated_at":"2025-04-23T12:26:27.346Z","avatar_url":"https://github.com/danielnieto.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Scrapman\n\n\u003e*Ski-bi dibby dib yo da dub dub*\u003cbr\u003e\n*Yo da dub dub*\u003cbr\u003e\n*Ski-bi dibby dib yo da dub dub*\u003cbr\u003e\n*Yo da dub dub*\u003cbr\u003e\u003cbr\u003e\n***I'm the Scrapman!***\n\n### THE FASTEST SCRAPPER EVER\\*... AND IT SUPPORTS PARALLEL REQUESTS \u003csmall\u003e(\\*arguably)\u003c/small\u003e\n\nScrapman is a blazingly fast **real (with Javascript executed)** HTML scrapper, built from the ground up to support parallel fetches, with this you can get the HTML code for 50+ URLs in seconds (~30 seconds).\n\nOn NodeJS you can easily use `request` to fetch the HTML from a page, but what if the page you are trying to load is *NOT* a static HTML page, but it has dynamic content added with Javascript? What do you do then? Well, you use ***The Scrapman***.\n\nIt uses [Electron](http://electron.atom.io) to dynamically load web pages into several `\u003cwebview\u003e` within a single Chromium instance. This is why it fetches the HTML exactly as you would see it if you inspect the page with DevTools.\n\nThis is **NOT** an browser automation tool (yet), it's a node module that gives you the processed HTML from an URL, it focuses on multiple parallel operations and speed.\n\n## USAGE\n\n1.- Install it\n\n`npm install scrapman -S`\n\n2.- Require it\n\n`var scrapman = require(\"scrapman\");`\n\n3.- Use it (as many times as you need)\n\nSingle URL request\n\n```javascript\nscrapman.load(\"http://google.com\", function(results){\n\t//results contains the HTML obtained from the url\n\tconsole.log(results);\n});\n```\nParallel URL requests\n\n```javascript\n//yes, you can use it within a loop.\nfor(var i=1; i\u003c=50; i++){\n    scrapman.load(\"https://www.website.com/page/\" + i, function(results){\n        console.log(results);\n    });\n}\n```\n\n## API\n\n### - scrapman.load(url, callback)\n\n#### url\nType: `String`\u003cbr\u003e\n\nThe URL from which the HTML code is going to be obtained.\n\n#### callback(results)\nType: `Function`\u003cbr\u003e\n\nThe callback function to be executed when the loading is done. The loaded HTML will be in the `results` parameter.\n\n### - scrapman.configure(config)\n\n#### config\nThe configuration object can set the following values\n\n* `maxConcurrentOperations`: Integer - The intensity of processing, how many URLs can be loaded at the same time, default: 50\n\n* `wait`: Integer - The amount of milliseconds to wait before returning the HTML code of a webpage after it has been completely loaded, default: 0\n\n\n## Questions\nFeel free to open Issues to ask questions about using this package, PRs are very welcomed and encouraged.\n\n**SE HABLA ESPAÑOL**\n\n## License\n\nMIT © Daniel Nieto\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielnieto%2Fscrapman","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanielnieto%2Fscrapman","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielnieto%2Fscrapman/lists"}