{"id":20709940,"url":"https://github.com/oxylabs/web-crawler","last_synced_at":"2025-08-01T09:03:41.471Z","repository":{"id":186915999,"uuid":"621276956","full_name":"oxylabs/web-crawler","owner":"oxylabs","description":"Web Crawler is a tool used to discover target URLs, select the relevant content, and have it delivered in bulk. It crawls websites in real-time and at scale to quickly deliver all content or only the data you need based on your chosen criteria.","archived":false,"fork":false,"pushed_at":"2025-02-11T13:00:24.000Z","size":47,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-29T22:12:06.029Z","etag":null,"topics":["api","crawler","github-python","scraper","web-crawler","web-crawler-python","web-scraping","web-scraping-api","webscraping"],"latest_commit_sha":null,"homepage":"https://oxylabs.io/products/scraper-api","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oxylabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-30T10:44:32.000Z","updated_at":"2025-02-11T13:00:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"fbae4fcf-d9cf-4c8b-a271-1d70744793ca","html_url":"https://github.com/oxylabs/web-crawler","commit_stats":null,"previous_names":["oxylabs/web-crawler"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oxylabs","download_url":"https://codeload.github.com/oxylabs/web-crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250372939,"owners_count":21419722,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","crawler","github-python","scraper","web-crawler","web-crawler-python","web-scraping","web-scraping-api","webscraping"],"created_at":"2024-11-17T02:09:08.936Z","updated_at":"2025-04-23T04:48:16.746Z","avatar_url":"https://github.com/oxylabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# How to Crawl a Website Using Web Crawler?\n\n[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7\u0026aff_id=877\u0026url_id=112)\n\n[![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq)\n\n- [How to Crawl a Website Using Web Crawler?](#how-to-crawl-a-website-using-web-crawler)\n  * [What can Web Crawler do?](#what-can-web-crawler-do)\n  * [Web Crawler settings overview](#web-crawler-settings-overview)\n    + [Endpoints](#endpoints)\n      - [Create a new job](#create-a-new-job)\n      - [Get sitemap](#get-sitemap)\n      - [Get the list of aggregate result chunks](#get-the-list-of-aggregate-result-chunks)\n      - [Get a chunk of the aggregate result](#get-a-chunk-of-the-aggregate-result)\n    + [Query parameters](#query-parameters)\n  * [Using Web Crawler in Postman](#using-web-crawler-in-postman)\n  * [Using Web Crawler in Python](#using-web-crawler-in-python)\n    + [Getting a list of URLs](#getting-a-list-of-urls)\n    + [Getting parsed results](#getting-parsed-results)\n    + [Getting HTML results](#getting-html-results)\n\n\nWeb Crawler is a built-in feature of our [\u003cu\u003eScraper\nAPIs\u003c/u\u003e](https://oxylabs.io/products/scraper-api). It’s a tool used to\ndiscover target URLs, select the relevant content, and have it delivered\nin bulk. It crawls websites in real-time and at scale to quickly deliver\nall content or only the data you need based on your chosen criteria.\n\n## What can Web Crawler do?\n\nThere are three main tasks Web Crawler can do:\n\n- Perform URL discovery;\n\n- Crawl all pages on a site;\n\n- Index all URLs on a domain.\n\nUse it when you need to crawl through the site and receive parsed data\nin bulk, as well as to collect a list of URLs in a specific category or\nfrom an entire website.\n\nThere are three data output types you can receive when using Web\nCrawler: a list of URLs, parsed results, and HTML files. If needed, you\ncan set Web Crawler to upload the results to your cloud storage.\n\n## Web Crawler settings overview\n\nYou can easily control the crawling scope by adjusting its width and\ndepth with\n[\u003cu\u003efilters\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#filters).\nWeb Crawler can also use various scraping parameters, such as\ngeo-location and user agent, to increase the success rate of crawling\njobs. Most of these scraping parameters depend on the Scraper API you\nuse.\n\n### Endpoints\n\nTo control your crawling job, you need to use different endpoints. You\ncan initiate, stop and resume your job, get job info, get the list of\nresult chunks, and get the results. Below are the endpoints we’ll use in\nthis crawling tutorial. For more information and output examples, visit\n[\u003cu\u003eour\ndocumentation\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#endpoints).\n\n#### Create a new job\n\n- Endpoint: `https://ect.oxylabs.io/v1/jobs`\n\n- Method: `POST`\n\n- Authentication: `Basic`\n\n- Request headers: `Content-Type: application/json`\n\n#### Get sitemap\n\nThis endpoint will deliver the list of URLs found while processing the\njob.\n\n- Endpoint: `https://ect.oxylabs.io/v1/jobs/{id}/sitemap`\n\n- Method: `GET`\n\n- Authentication: `Basic`\n\n#### Get the list of aggregate result chunks\n\n- Endpoint: `https://ect.oxylabs.io/v1/jobs/{id}/aggregate`\n\n- Method: `GET`\n\n- Authentication: `Basic`\n\nThe aggregate results can consist of a lot of data, so we split them\ninto multiple chunks based on the chunk size you specify. Use this\nendpoint to get a list of chunk files available.\n\n#### Get a chunk of the aggregate result\n\n- Endpoint: `https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}`\n\n- Method: `GET`\n\n- Authentication: `Basic`\n\nWith this endpoint, you can download a particular chunk of the aggregate\nresult. The contents of the response body depend on the [\u003cu\u003eoutput\ntype\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#output)\nyou choose.\n\nThe result can be one of the following:\n\n- An index (a list of URLs)\n\n- An aggregate JSON file with all parsed results\n\n- An aggregate JSON file with all HTML results\n\n### Query parameters\n\nFor your convenience, we’ve put all the available parameters you can use\nin the table below. It can also be found in [\u003cu\u003eour\ndocumentation\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#query-parameters).\n\n| **Parameter**                 | **Description**                                                                                                                                                                                                                                                                                                   | **Default Value** |\n|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|\n| **url**                       | The URL of the starting point                                                                                                                                                                                                                                                                                     | \\-                |\n| `filters`                       | These parameters are used to configure the breadth and depth of the crawling job, as well as determine which URLs should be included in the end result. See [\u003cu\u003ethis section\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#filters) for more information.                                           | \\-                |\n| `filters:crawl`                 | Specifies which URLs Web Crawler will include in the end result. See [\u003cu\u003ethis section\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#crawl) for more information.                                                                                                                                    | \\-                |\n| `filters:process`               | Specifies which URLs Web Crawler will scrape. See [\u003cu\u003ethis section\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#process) for more information.                                                                                                                                                     | \\-                |\n| `filters:max_depth`             | Determines the max length of URL chains Web Crawler will follow. See [\u003cu\u003ethis section\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#max_depth) for more information.                                                                                                                                | `1`                |\n| `scrape_params`                 | These parameters are used to fine-tune the way we perform the scraping jobs. For instance, you may want us to execute Javascript while crawling a site, or you may prefer us to use proxies from a particular location.                                                                                           | \\-                |\n| `scrape_params:source`          | See [\u003cu\u003ethis section\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#source) for more information.                                                                                                                                                                                                    | \\-                |\n| `scrape_params:geo_location`    | The geographical location that the result should be adapted for. See [\u003cu\u003ethis section\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#geo_location) for more information.                                                                                                                             | \\-                |\n| `scrape_params:user_agent_type` | Device type and browser. See [\u003cu\u003ethis section\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#user_agent_type) for more information.                                                                                                                        | `desktop`           |\n| `scrape_params:render`          | Enables JavaScript rendering. Use when the target requires JavaScript to load content. If you want to use this feature, set the parameter value to html. See [\u003cu\u003ethis section\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#render) for more information. | \\-                |\n| `output:type\\_`                 | The output type. We can return a sitemap (list of URLs found while crawling) or an aggregate file containing HTML results or parsed data. See [\u003cu\u003ethis section\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#type) for more information.                                                           | \\-                |\n| `upload`                        | These parameters are used to describe the cloud storage location where you would like us to put the result once we're done. See [\u003cu\u003ethis section\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#upload) for more information.       | \\-                |\n| `upload:storage_type`           | Define the cloud storage type. The only valid value is s3 (for AWS S3). gcs (for Google Cloud Storage) is coming soon.                                                                                                                                                                                            | \\-                |\n| `upload:storage_url`            | The storage bucket URL.                                                                                                                                                                                                                                                                                           | \\-                |\n\nUsing these parameters is straightforward, as you can pass them with the\nrequest payload. Below you can find code examples in Python.\n\n## Using Web Crawler in Postman\n\nFor simplicity, you can use [\u003cu\u003ePostman\u003c/u\u003e](https://www.postman.com/)\nto make crawling requests. Download [\u003cu\u003ethis Postman\ncollection\u003c/u\u003e](https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F2rQmOOBAUtwOAvCfMh8d%2Fuploads%2FS2DFMWXL97IXOiWdsL2y%2FScraper%20API%20-%20Crawler.postman_collection.json?alt=media\u0026token=5adbfc28-bc27-4fd1-8a4b-b2b1a0533b5a)\nto try out all the endpoints of Web Crawler. Here’s a step-by-step video\ntutorial you can follow:\n\n[\u003cu\u003eHow to Crawl a Website: Step-by-step\nGuide\u003c/u\u003e](https://www.youtube.com/watch?v=2sg03flHWMI)\n\n## Using Web Crawler in Python\n\nTo make HTTP requests in Python, we’ll use the [\u003cu\u003eRequests\nlibrary\u003c/u\u003e](https://pypi.org/project/requests/). Install it by entering\nthe following in your terminal:\n\n```shell\npip install requests\n```\n\nTo deal with HTML results, we’ll use the [\u003cu\u003eBeautifulSoup4\nlibrary\u003c/u\u003e](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to\nparse the results and make them more readable. This step is optional,\nbut you can install this library with:\n\n```shell\npip install BeautifulSoup4\n```\n\n### Getting a list of URLs\n\nIn the following example, we use the `sitemap` parameter to create a job\nthat crawls the Amazon homepage and gets a list of URLs found within the\nstarting page. With the `crawl` and `process` parameters being set to `“.\\*”`,\nWeb Crawler will follow and return any Amazon URL. These two parameters\nuse regular expressions (regex) to determine what URLs should be crawled\nand processed. Be sure to visit our\n[\u003cu\u003edocumentation\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler#regex-value-examples)\nfor more details and useful resources.\n\nWe don’t need to include the `source` parameter because we aren’t scraping\ncontent from the URLs yet. Using the `json` module, we write the data into\na **.json** file, and then, with the `pprint` module, we print the\nstructured content. Let’s see the example:\n\n```python\nimport requests, json\nfrom pprint import pprint\n\n# Set the content type to JSON.\nheaders = {\"Content-Type\": \"application/json\"}\n\n# Crawl all URLs inside the target URL.\npayload = {\n    \"url\": \"https://www.amazon.com/\",\n    \"filters\": {\n        \"crawl\": [\".*\"],\n        \"process\": [\".*\"],\n        \"max_depth\": 1\n    },\n    \"scrape_params\": {\n        \"user_agent_type\": \"desktop\",\n    },\n    \"output\": {\n        \"type_\": \"sitemap\"\n    }\n}\n\n# Create a job and store the JSON response.\nresponse = requests.request(\n    'POST',\n    'https://ect.oxylabs.io/v1/jobs',\n    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here.\n    headers=headers,\n    json=payload,\n)\n\n# Write the decoded JSON response to a .json file.\nwith open('job_sitemap.json', 'w') as f:\n    json.dump(response.json(), f)\n\n# Print the decoded JSON response.\npprint(response.json())\n```\n\nDepending on the request size, the process might take a bit of time. You\ncan make sure the job is finished by checking the **job information**.\nWhen it’s done, send another request to the **sitemap endpoint**\n`https://ect.oxylabs.io/v1/jobs/{id}/sitemap` to return a list of URLs.\nFor example:\n\n```python\nimport requests, json\nfrom pprint import pprint\n\n# Store the JSON response containing URLs (sitemap).\nsitemap = requests.request(\n    'GET',\n    'https://ect.oxylabs.io/v1/jobs/{id}/sitemap', # Replace {id] with the job ID.\n    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here.\n)\n\n# Write the decoded JSON response to a .json file.\nwith open('sitemap.json', 'w') as f:\n    json.dump(sitemap.json(), f)\n\n# Print the decoded JSON response.\npprint(sitemap.json())\n```\n\n### Getting parsed results\n\nTo get parsed content, use the `parsed` parameter. Using the example\nbelow, we can crawl all URLs found on [\u003cu\u003ethis Amazon\npage\u003c/u\u003e](https://www.amazon.com/s?i=electronics-intl-ship\u0026bbn=16225009011\u0026rh=n%3A502394%2Cn%3A281052\u0026dc\u0026qid=1679564333\u0026rnid=502394\u0026ref=sr_pg_1)\nand then parse the content of each URL. This time, we’re using the\n`amazon` source as we’re scraping content from the specified Amazon page.\nSo, let’s see all of this put together in Python:\n\n```python\nimport requests, json\nfrom pprint import pprint\n\n# Set the content type to JSON.\nheaders = {\"Content-Type\": \"application/json\"}\n\n# Parse content from the URLs found in the target URL.\npayload = {\n    \"url\": \"https://www.amazon.com/s?i=electronics-intl-ship\u0026bbn=16225009011\u0026rh=n%3A502394%2Cn%3A281052\u0026dc\u0026qid\"\n           \"=1679564333\u0026rnid=502394\u0026ref=sr_pg_1\",\n    \"filters\": {\n        \"crawl\": [\".*\"],\n        \"process\": [\".*\"],\n        \"max_depth\": 1\n    },\n    \"scrape_params\": {\n        \"source\": \"amazon\",\n        \"user_agent_type\": \"desktop\"\n    },\n    \"output\": {\n        \"type_\": \"parsed\"\n    }\n}\n\n# Create a job and store the JSON response.\nresponse = requests.request(\n    'POST',\n    'https://ect.oxylabs.io/v1/jobs',\n    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here.\n    headers=headers,\n    json=payload,\n)\n\n# Write the decoded JSON response to a .json file.\nwith open('job_parsed.json', 'w') as f:\n    json.dump(response.json(), f)\n\n# Print the decoded JSON response.\npprint(response.json())\n```\n\nNote that if you want to use the `geo_location` parameter when scraping\nAmazon pages, you must set its value to the preferred location’s\nzip/postal code. For more information, visit [\u003cu\u003ethis\npage\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/e-commerce-scraper-api/features/geo-location#amazon)\nin our documentation.\n\nOnce the job is complete, you can check how many chunks your request has\ngenerated and then download the content from each chunk with this\nendpoint: `https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}`. For\ninstance, with the following code snippet, we’re printing the first\nchunk:\n\n```python\nimport requests, json\nfrom pprint import pprint\n\n# Store the JSON response containing parsed results.\nparsed_results = requests.request(\n    'GET',\n    'https://ect.oxylabs.io/v1/jobs/{id}/aggregate/1',  # Replace {id] with the job ID.\n    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here.\n)\n\n# Write the decoded JSON response to a .json file.\nwith open('parsed_results_1.json', 'w') as f:\n    json.dump(parsed_results.json(), f)\n\n# Print the decoded JSON response.\npprint(parsed_results.json())\n```\n\n### Getting HTML results\n\nThe code to get HTML results doesn’t differ much from the code in the\nprevious section. The only difference is that we’ve set the `type_`\nparameter to `html`. Let’s see the code sample:\n\n```python\nimport requests, json\nfrom pprint import pprint\n\n# Set the content type to JSON.\nheaders = {\"Content-Type\": \"application/json\"}\n\n# Index HTML results of URLs found in the target URL. \npayload = {\n    \"url\": \"https://www.amazon.com/s?i=electronics-intl-ship\u0026bbn=16225009011\u0026rh=n%3A502394%2Cn%3A281052\u0026dc\u0026qid\"\n           \"=1679564333\u0026rnid=502394\u0026ref=sr_pg_1\",\n    \"filters\": {\n        \"crawl\": [\".*\"],\n        \"process\": [\".*\"],\n        \"max_depth\": 1\n    },\n    \"scrape_params\": {\n        \"source\": \"universal\",\n        \"user_agent_type\": \"desktop\"\n    },\n    \"output\": {\n        \"type_\": \"html\"\n    }\n}\n\n# Create a job and store the JSON response.\nresponse = requests.request(\n    'POST',\n    'https://ect.oxylabs.io/v1/jobs',\n    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here\n    headers=headers,\n    json=payload,\n)\n\n# Write the decoded JSON response to a .json file.\nwith open('job_html.json', 'w') as f:\n    json.dump(response.json(), f)\n\n# Print the decoded JSON response.\npprint(response.json())\n```\n\nAgain, you’ll need to make a request to retrieve each chunk of the\nresult. We’ll use the BeautifulSoup4 library to parse HTML, but this\nstep is optional. We then write the parsed content to an **.html** file.\nThe code example below downloads content from the first chunk:\n\n```python\nimport requests\nfrom bs4 import BeautifulSoup\n\n# Store the JSON response containing HTML results.\nhtml_response = requests.request(\n    'GET',\n    'https://ect.oxylabs.io/v1/jobs/{id}/aggregate/1',  # Replace {id] with the job ID.\n    auth=('USERNAME', 'PASSWORD'),  # Your credentials go here.\n)\n\n# Parse the HTML content.\nsoup = BeautifulSoup(html_response.content, 'html.parser')\nhtml_results = soup.prettify()\n\n# Write the HTML results to an .html file.\nwith open('html_results.html', 'w') as f:\n    f.write(html_results)\n\n# Print the HTML results.\nprint(html_results)\n```\n\nYou can modify the code files as needed per your requirements.\n\nThis tutorial covered the fundamental aspects of using Web Crawler. We\nrecommend looking at [\u003cu\u003eour\ndocumentation\u003c/u\u003e](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/web-crawler)\nfor more information on using the endpoints and query parameters. In\ncase you have any questions, you can always contact us at\n[\u003cu\u003ehello@oxylabs.io\u003c/u\u003e](mailto:hello@oxylabs.io) or via live chat on\nour [\u003cu\u003ewebsite\u003c/u\u003e](https://oxylabs.io/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fweb-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foxylabs%2Fweb-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fweb-crawler/lists"}