{"id":18648300,"url":"https://github.com/antlafarge/webscraper","last_synced_at":"2025-11-05T07:30:35.496Z","repository":{"id":65075572,"uuid":"570720157","full_name":"antlafarge/WebScraper","owner":"antlafarge","description":"Grab links in websites and download files matching some filters (reg exp pattern, file size...) ","archived":false,"fork":false,"pushed_at":"2025-01-09T16:12:45.000Z","size":48,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-09T17:28:02.106Z","etag":null,"topics":["download","files","filters","scraper","web"],"latest_commit_sha":null,"homepage":"https://hub.docker.com/r/antlafarge/webscraper","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/antlafarge.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-25T23:16:38.000Z","updated_at":"2025-01-09T16:12:48.000Z","dependencies_parsed_at":"2024-11-07T06:31:54.747Z","dependency_job_id":"ba5ebadc-6600-4c34-892c-3bd43ee8a80b","html_url":"https://github.com/antlafarge/WebScraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antlafarge%2FWebScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antlafarge%2FWebScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antlafarge%2FWebScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antlafarge%2FWebScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/antlafarge","download_url":"https://codeload.github.com/antlafarge/WebScraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239450246,"owners_count":19640688,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["download","files","filters","scraper","web"],"created_at":"2024-11-07T06:30:02.754Z","updated_at":"2025-02-18T10:23:47.585Z","avatar_url":"https://github.com/antlafarge.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"WebScraper\n==========\n\nCheck for links in html pages and download files which match required filters to `./downloads/` folder.  \nThe scraper will search for links in these html tags :\n- `\u003ca href=\"..\"\u003e\u003c/a\u003e`\n- `\u003cimg src=\"..\" /\u003e`\n- `\u003cvideo src=\"..\"\u003e\u003c/video\u003e`\n- `\u003csource src=\"..\" /\u003e`\n\n```bash\ndocker run -v \"\u003cdownloadsDirectory\u003e:/usr/src/app/downloads/\" -e \"WEBSCRAPER_LOG_LEVEL=DEBUG\" --name wsp antlafarge/webscraper \"\u003curl\u003e\" \"\u003cdownloadRegExp\u003e\" \"\u003cexcludeRegExp\u003e\" \u003cminSize\u003e \u003cmaxSize\u003e \u003cdeep\u003e \u003cdelay\u003e \"\u003csameOrigin\u003e\" \"\u003cadditionalHeaders\u003e\"\n\nnode main.js \"\u003curl\u003e\" \"\u003cdownloadRegExp\u003e\" \"\u003cexcludeRegExp\u003e\" \u003cminSize\u003e \u003cmaxSize\u003e \u003cdeep\u003e \u003cdelay\u003e \"\u003csameOrigin\u003e\" \"\u003cadditionalHeaders\u003e\"\n```\n\n## Parameters\n\n- `url` : Url to start scraping (mandatory).\n- `downloadRegExp` : File urls to download must match this regular expression (default is `\".\"` to match all).\n- `excludeRexExp` : File urls to download must not match this regular expression (default is `\"a^\"` to match nothing).\n- `minSize` : Files to download must be more than this size (default is `0` to ignore).\n- `maxSize` : Files to download must be less than this size (default is `0` to ignore).\n- `deep` : How many links to follow and parse from the original url (default is `0` to parse the first page only).\n- `delay` : Delay between two successive http requests (default is `200` to wait 200 ms).\n- `sameOrigin` : File urls to download must have the same origin as the orifinal url (default is `\"true\"`).\n- `additionalHeaders` : Additional headers to add on every HTTP requets headers in JSON format (default is `{}`).\n\n## Environment variables\n\n- `WEBSCRAPER_LOG_LEVEL` : Logs level (default `DEBUG` in Dockerfile).\n    - `TRACE` : Display all logs.\n    - `DEBUG` : Display error, warning, essential and progress logs only.\n    - `INFO` : Display error, warning and essential logs only.\n    - `WARN` : Display error and waning logs only.\n    - `ERROR` : Display error logs only.\n    - `TTY_ONLY` : Display temporary logs on TTY only.\n    - `NO_LOGS` : Display no logs.\n- `WEBSCRAPER_DOWNLOAD_SEGMENTS_SIZE` : Max segments size (in bytes) for downloading big files when http server supports ranges (default `10485760` for 10 MBytes).\n- `WEBSCRAPER_REPLACE_DIFFERENT_SIZE_FILES` : Allow files to be deleted and replaced when file size is different (default is `\"false\"`).\n- `WEBSCRAPER_DOCUMENT_TIMEOUT` : Override http requests timeout for getting documents (default is `10000` ms, 10 seconds)\n- `WEBSCRAPER_DOWNLOAD_TIMEOUT` : Override http requests timeout for downloading a file segment (default is `100000` ms, 100 seconds, giving a minimal download speed of 0.1 MByte/s for 10 MB file segments).\n\n# Examples\n\n## Simple\n\nThis example downloads from [http://www.example.com/](http://www.example.com/) every image files (*.jpg).\n\n```bash\ndocker run -v \"/hdd/downloads/:/usr/src/app/downloads/\" --name wsp antlafarge/webscraper \"http://www.example.com/\" \"\\.jpg$\"\n```\n\n```bash\nnode main.js \"http://www.example.com/\" \"\\.jpg$\"\n```\n\n## Advanced\n\nThis example downloads from [http://www.example.com/](http://www.example.com/) every image files between 100 Bytes and 1 MByte (1024 * 1024 Bytes), exclude html files, recurse on all links 1 time, wait 200 milliseconds to fetch each file, allow to scrap urls with a different host url, and use basic http authentication in additional headers.\n\n```bash\ndocker run -d --rm -v \"/hdd/downloads/:/usr/src/app/downloads/\" -e \"WEBSCRAPER_LOG_LEVEL=DEBUG\" --name wsp antlafarge/webscraper \"http://www.example.com/\" \"\\.(jpe?g|png|webp|gif)[^\\/]*$\" \"\\.htm(l|l5)?[^\\/]*$\" 100 1048576 1 200 \"true\" \"{\\\"Authorization\\\":\\\"Basic YWxhZGRpbjpvcGVuc2VzYW1l\\\"}\"\n```\n*Add the `-d` (for detached) after `docker run` to start the script in background.*  \n*Add the `--rm` (for remove) after `docker run` to auto remove the container on termination.*\n\n```bash\nnode main.js \"http://www.example.com/\" \"\\.(jpe?g|png|webp|gif)[^\\/]*$\" \"\\.htm(l|l5)?[^\\/]*$\" 100 1048576 1 200 \"true\" \"{\\\"Authorization\\\":\\\"Basic YWxhZGRpbjpvcGVuc2VzYW1l\\\"}\"\n```\n\n*Note: `[^\\/]*` is used at end of regular expressions to ignore query parameters at the end of file urls.*\n\n# Logs\n\n```bash\ndocker logs --follow --tail 100 wsp\n```\n\n### Logs example\n```log\n[2022-11-25T11:35:08.000Z] Scrap [8/9|3|1] \"http://www.example.com/\"\n[2022-11-25T11:35:09.000Z] Handle [1] \"http://www.example.com/file.zip\"\n[2022-11-25T11:35:10.000Z] Download [1] \"http://www.example.com/file.zip\"\n[2022-11-25T11:35:12.000Z]     Progress :  10 % ( 10.00 / 100.00 MB) [1.00 MB/s] 1m 30s...\n```\n\n### Explanation\n\n[`Date`] Scrap [`8th out of 9 documents` | `3 urls awaiting analysis` | `Recurse 1 time from this document` ] \"`Parsed page url`\"  \n[`Date`] Handle [`Recurse 1 time from this url`] \"`Handle file url`\"  \n[`Date`] Download [`Downloads count`] \"`Download file url`\"\n\n# Install Node.js\n\n## Windows\n\nhttps://nodejs.org/en/download/\n\n## Linux\n\n```\nsudo apt update \u0026\u0026 sudo apt install -y nodejs npm\n```\n\n## Docker\n\n```\ndocker build --rm -t webscraper .\ndocker run -d --rm -v \"/hdd/downloads/:/usr/src/app/downloads/\" --name wsp webscraper \"http://www.example.com/\" \"\" \"\" 0 0 0 500 \"false\"\n```\n*Omit the `--rm` option to follow the logs by using `docker logs --follow --tail 100 wsp`*\n\nIf you want to run the node.js commands manually :\n```\ndocker run -it --name mynodecontainer node npm install -g npm -y \u0026\u0026 docker commit mynodecontainer mynode \u0026\u0026 docker rm -f mynodecontainer \u0026\u0026 docker rmi node\n```\n\nHow to start a `npm` or `node` command through docker :\n```\ndocker run -it --rm --name mynode -v \"$PWD\":/usr/src/app -w /usr/src/app mynode npm install\ndocker run -it --rm --name mynode -v \"$PWD\":/usr/src/app -w /usr/src/app mynode node script.js\n```\n*Note: `$PWD` targets to current directory, so be sure your current directory is the project directory.*\n\n# Test Node.js is working\n\n```\nnode --version\n```\n\n# Update node packages manager\n\n```\nnpm isntall -g npm\n```\n\n# Change your current directory to target the project directory\n\n```\ncd /WebScraper\n```\n\n# Install the packages\n\n```\nnpm install\n```\n\nYou are ready !\n\n## Node.js commands reminder\n\n```\nnpm isntall -g npm\nnpm init -y\nnpm install --save jsdom node-fetch\nnpm install --save\nnode main.js\n```\n\n# Build dockerhub image\n\n```\ndocker buildx ls\ndocker buildx rm mybuilder\ndocker buildx create --name mybuilder\ndocker buildx use mybuilder\ndocker buildx inspect --bootstrap\ndocker buildx build --platform linux/amd64,linux/arm/v7,linux/arm64/v8,linux/ppc64le,linux/s390x -t antlafarge/webscraper:latest -f Dockerfile --push .\n```\n\n# Troubleshooting\n\nIf you have timeout errors on file downloads because of low download speed, you should reduce the file segments size (Environment variable `WEBSCRAPER_DOWNLOAD_SEGMENTS_SIZE`).  \nEach segment size is `10485760` (for 10 MiB, 10 * 1024 * 1024 bytes) by default, and has a unmodifiable `10 minutes` timeout delay to complete.  \nYou can try to reduce the file segments size to `1048576` (for 1 MiB, 1024 * 1024 bytes).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fantlafarge%2Fwebscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fantlafarge%2Fwebscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fantlafarge%2Fwebscraper/lists"}