{"id":20710207,"url":"https://github.com/oxylabs/asynchronous-web-scraping-python","last_synced_at":"2026-02-28T08:33:00.265Z","repository":{"id":231539548,"uuid":"665565475","full_name":"oxylabs/asynchronous-web-scraping-python","owner":"oxylabs","description":"A comparison of asynchronous and synchronous web scraping methods with practical examples.","archived":false,"fork":false,"pushed_at":"2024-04-04T13:27:55.000Z","size":8735,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-17T02:11:42.411Z","etag":null,"topics":["async","asynchronous","data-acquisition","python","synchronous","tutorial","web-scraping","web-scraping-python","web-scraping-tutorials"],"latest_commit_sha":null,"homepage":"https://oxylabs.io/blog/asynchronous-web-scraping-python-aiohttp","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oxylabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-07-12T13:38:54.000Z","updated_at":"2024-11-04T09:22:22.000Z","dependencies_parsed_at":"2024-04-04T14:51:09.903Z","dependency_job_id":null,"html_url":"https://github.com/oxylabs/asynchronous-web-scraping-python","commit_stats":null,"previous_names":["oxylabs/asynchronous-web-scraping-python"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fasynchronous-web-scraping-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fasynchronous-web-scraping-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fasynchronous-web-scraping-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fasynchronous-web-scraping-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oxylabs","download_url":"https://codeload.github.com/oxylabs/asynchronous-web-scraping-python/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234429241,"owners_count":18831240,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["async","asynchronous","data-acquisition","python","synchronous","tutorial","web-scraping","web-scraping-python","web-scraping-tutorials"],"created_at":"2024-11-17T02:10:30.190Z","updated_at":"2025-09-27T11:30:28.410Z","avatar_url":"https://github.com/oxylabs.png","language":"Python","readme":"# Asynchronous Web Scraping With Python \u0026 AIOHTTP\n\n[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877\u0026utm_medium=affiliate\u0026groupid=877\u0026utm_content=asynchronous-web-scraping-python-github\u0026transaction_id=102f49063ab94276ae8f116d224b67)\n\n[![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge\u0026theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge\u0026logo=youtube\u0026logoColor=white)](https://www.youtube.com/@oxylabs)\n\n- [Sending asynchronous HTTP requests](#sending-asynchronous-http-requests)\n  * [1. Create an empty Python file with a main function](#1-create-an-empty-python-file-with-a-main-function)\n  * [2. Track script execution time](#2-track-script-execution-time)\n  * [3. Create a loop](#3-create-a-loop)\n  * [4. Create a scrape functionality](#4-create-a-scrape-functionality)\n  * [5. Add save_product function](#5-add-save_product-function)\n  * [6. Run the script](#6-run-the-script)\n- [Sending synchronous HTTP requests](#sending-synchronous-http-requests)\n  * [1. Create a Python file with a main function](#1-create-a-python-file-with-a-main-function)\n  * [2. Track script execution time](#2-track-script-execution-time-1)\n  * [3. Create a loop](#3-create-a-loop-1)\n  * [4. Create a scrape function](#4-create-a-scrape-function)\n  * [5. Add save_product function](#5-add-save_product-function-1)\n  * [6. Run the script](#6-run-the-script-1)\n- [Comparing the performance of sync and async](#comparing-the-performance-of-sync-and-async)\n\nIn this tutorial, we will focus on scraping multiple URLs using the asynchronous method, and by comparing it to the synchronous one, we will demonstrate why it can be more beneficial. See the [\u003cu\u003efull blog\npost\u003c/u\u003e](https://oxylabs.io/blog/asynchronous-web-scraping-python-aiohttp)\nfor more information on asynchronous web scraping.\n\nYou can also check out [\u003cu\u003eone of our videos\u003c/u\u003e](https://www.youtube.com/watch?v=Raa9f5kpvtE) for a visual representation of\nthe same web scraping tutorial.\n\n## Sending asynchronous HTTP requests\n\nLet’s take a look at the asynchronous Python tutorial. For this\nuse-case, we will use the `aiohttp` module.\n\n### 1. Create an empty Python file with a main function\n\nNote that the main function is marked as asynchronous. We use asyncio\nloop to prevent the script from exiting until the main function\ncompletes.\n```python\nimport asyncio\n\n\nasync def main():\n    print('Saving the output of extracted information')\n\n\nloop = asyncio.get_event_loop()\nloop.run_until_complete(main())\n```\n\nOnce again, it is a good idea to track the performance of your script.\nFor that purpose, let's write a code that tracks script execution time.\n\n### 2. Track script execution time\n\nAs with the first example, record the time at the start of the script.\nThen, type in any code that you need to measure (currently a single\n`print` statement). Finally, calculate how much time has passed by taking\nthe current time and subtracting the time at the start of the script.\nOnce we have how much time has passed, we print it while rounding the\nresulting float to the last 2 decimals.\n\n```python\nimport asyncio\nimport time\n\n\nasync def main():\n    start_time = time.time()\n\n    print('Saving the output of extracted information')\n\n    time_difference = time.time() - start_time\n    print(f'Scraping time: %.2f seconds.' % time_difference)\n\n\nloop = asyncio.get_event_loop()\nloop.run_until_complete(main())\n```\n\nTime to read the csv file that contains URLs. The file will contain a\nsingle column called `url`. There, you will see all the URLs that need to\nbe scraped for data.\n\n![CSV file with a list of URLs](images/url_list.png)\n\n### 3. Create a loop\n\nNext, we open up urls.csv, then load it using csv module and loop over\neach and every URL in the csv file. Additionally, we need to create an\nasync task for every URL we are going to scrape.\n\n```python\nimport asyncio\nimport csv\nimport time\n\n\nasync def main():\n    start_time = time.time()\n\n    with open('urls.csv') as file:\n        csv_reader = csv.DictReader(file)\n        for csv_row in csv_reader:\n            # the url from csv can be found in csv_row['url']\n            print(csv_row['url'])\n\n    print('Saving the output of extracted information')\n\n    time_difference = time.time() - start_time\n    print(f'Scraping time: %.2f seconds.' % time_difference)\n\n\nloop = asyncio.get_event_loop()\nloop.run_until_complete(main())\n```\n\nLater in the function we wait for all the scraping tasks to complete\nbefore moving on.\n\n```python\nimport asyncio\nimport csv\nimport time\n\n\nasync def main():\n    start_time = time.time()\n\n    tasks = []\n    with open('urls.csv') as file:\n        csv_reader = csv.DictReader(file)\n        for csv_row in csv_reader:\n            task = asyncio.create_task(scrape(csv_row['url']))\n            tasks.append(task)\n\n    print('Saving the output of extracted information')\n    await asyncio.gather(*tasks)\n\n    time_difference = time.time() - start_time\n    print(f'Scraping time: %.2f seconds.' % time_difference)\n\n\nloop = asyncio.get_event_loop()\nloop.run_until_complete(main())\n```\n\nAll that's left is scraping! But before doing that, remember to take a\nlook at the data you're scraping.\n\nThe [\u003cu\u003etitle of the\nbook\u003c/u\u003e](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html)\ncan be extracted from an `\u003ch1\u003e` tag, that is wrapped by a `\u003cdiv\u003e` tag\nwith a `product_main` class.\n\n![Product title in Developer Tools](images/title.png)\n\nRegarding the production information, it can be found in a table with a\n`table-striped` class.\n\n![Product information in Developer Tools](images/product_information.png)\n\n### 4. Create a scrape functionality\n\nThe scrape function makes a request to the URL we loaded from the csv\nfile. Once the request is done, it loads the response HTML using the\nBeautifulSoup module. Then we use the knowledge about where the data is\nstored in HTML tags to extract the book name into the `book_name` variable\nand collect all product information into a `product_info` dictionary.\n\n```python\nimport asyncio\nimport csv\nimport time\nimport aiohttp as aiohttp\nfrom bs4 import BeautifulSoup\n\n\nasync def scrape(url):\n    async with aiohttp.ClientSession() as session:\n        async with session.get(url) as resp:\n            body = await resp.text()\n            soup = BeautifulSoup(body, 'html.parser')\n            book_name = soup.select_one('.product_main').h1.text\n            rows = soup.select('.table.table-striped tr')\n            product_info = {row.th.text: row.td.text for row in rows}\n\n\nasync def main():\n    start_time = time.time()\n\n    tasks = []\n    with open('urls.csv') as file:\n        csv_reader = csv.DictReader(file)\n        for csv_row in csv_reader:\n            task = asyncio.create_task(scrape(csv_row['url']))\n            tasks.append(task)\n\n    print('Saving the output of extracted information')\n    await asyncio.gather(*tasks)\n\n    time_difference = time.time() - start_time\n    print(f'Scraping time: %.2f seconds.' % time_difference)\n\n\nloop = asyncio.get_event_loop()\nloop.run_until_complete(main())\n```\n\n### 5. Add save_product function\n\nThe URL is scraped; however, no results can be seen. For that, you need\nto add another function – `save_product`.\n\n`save_product` takes two parameters: the book name and the product info\ndictionary. Since the book name contains spaces, we first replace them\nwith underscores. Finally, we create a json file and dump all the info\nwe have into it.\n\n```python\nimport asyncio\nimport csv\nimport json\nimport time\nimport aiohttp\nfrom bs4 import BeautifulSoup\n\n\nasync def save_product(book_name, product_info):\n    json_file_name = book_name.replace(' ', '_')\n    with open(f'data/{json_file_name}.json', 'w') as book_file:\n        json.dump(product_info, book_file)\n\n\nasync def scrape(url):\n    async with aiohttp.ClientSession() as session:\n        async with session.get(url) as resp:\n            body = await resp.text()\n            soup = BeautifulSoup(body, 'html.parser')\n            book_name = soup.select_one('.product_main').h1.text\n            rows = soup.select('.table.table-striped tr')\n            product_info = {row.th.text: row.td.text for row in rows}\n            await save_product(book_name, product_info)\n\n\nasync def main():\n    start_time = time.time()\n\n    tasks = []\n    with open('urls.csv') as file:\n        csv_reader = csv.DictReader(file)\n        for csv_row in csv_reader:\n            task = asyncio.create_task(scrape(csv_row['url']))\n            tasks.append(task)\n\n    print('Saving the output of extracted information')\n    await asyncio.gather(*tasks)\n\n    time_difference = time.time() - start_time\n    print(f'Scraping time: %.2f seconds.' % time_difference)\n\n\nloop = asyncio.get_event_loop()\nloop.run_until_complete(main())\n```\n\n### 6. Run the script\n\nLastly, you can run the script and see the data.\n\n![Asynchronous web scraping output](images/asynchronous_output.png)\n\n## Sending synchronous HTTP requests\n\nIn this tutorial we are going to scrape URLs defined in urls.csv using a\nsynchronous approach. For this particular use case, the Python `requests`\nmodule is an ideal tool.\n\n### 1. Create a Python file with a main function\n```python\ndef main():\n    print('Saving the output of extracted information')\n\nmain()\n```\n\nTracking the performance of your script is always a good idea.\nTherefore, the next step is to add a code that tracks script execution\ntime.\n\n### 2. Track script execution time\n\nFirst, record time at the very start of the script. Then, type in any\ncode that needs to be measured – in this case, we are using a single\n`print` statement. Finally, calculate how much time has passed. This can\nbe done by taking the current time and subtracting the time at the start\nof the script. Once we know how much time has passed, we can print it\nwhile rounding the resulting float to the last 2 decimals.\n\n```python\nimport time\n\n\ndef main():\n    start_time = time.time()\n\n    print('Saving the output of extracted information')\n\n    time_difference = time.time() - start_time\n    print(f'Scraping time: %.2f seconds.' % time_difference)\n\nmain()\n```\n\nNow that the preparations are done, it's time to read the csv file that\ncontains URLs. There, you will see a single column called `url`, which\nwill contain URLs that have to be scraped for data.\n\n### 3. Create a loop\n\nNext, we have to open up urls.csv. After that, load it using the csv\nmodule and loop over each and every URL from the csv file.\n\n```python\nimport csv\nimport time\n\n\ndef main():\n    start_time = time.time()\n\n    print('Saving the output of extracted information')\n    with open('urls.csv') as file:\n        csv_reader = csv.DictReader(file)\n        for csv_row in csv_reader:\n            # the url from csv can be found in csv_row['url']\n            print(csv_row['url'])\n\n    time_difference = time.time() - start_time\n    print(f'Scraping time: %.2f seconds.' % time_difference)\n\nmain()\n```\n\nAt this point, the job is almost done - all that’s left to do is to\nscrape it, although before you do that, look at the data you’re\nscraping.\n\nThe title of the book “A Light in the Attic” can be extracted from an\n`\u003ch1\u003e` tag, that is wrapped by a `\u003cdiv\u003e` tag with a `product_main` class.\n\n![Product title in Developer Tools](images/title.png)\n\nAs for the product information, it can be found in a table with a\n`table-striped` class, which you can see in the developer tools part.\n\n![Product information in Developer Tools](images/product_information.png)\n\n### 4. Create a scrape function\n\nNow, let's use what we've learned and create a `scrape` function.\n\nThe scrape function makes a request to the URL we loaded from the csv\nfile. Once the request is done, it loads the response HTML using the\nBeautifulSoup module. Then, we use the knowledge about where the data is\nstored in HTML tags to extract the book name into the `book_name` variable\nand collect all product information into a `product_info` dictionary.\n\n```python\nimport csv\nimport time\nimport requests as requests\nfrom bs4 import BeautifulSoup\n\n\ndef scrape(url):\n    response = requests.get(url)\n    soup = BeautifulSoup(response.content, 'html.parser')\n    book_name = soup.select_one('.product_main').h1.text\n    rows = soup.select('.table.table-striped tr')\n    product_info = {row.th.text: row.td.text for row in rows}\n\ndef main():\n    start_time = time.time()\n\n    print('Saving the output of extracted information')\n    with open('urls.csv') as file:\n        csv_reader = csv.DictReader(file)\n        for csv_row in csv_reader:\n            scrape(csv_row['url'])\n\n    time_difference = time.time() - start_time\n    print(f'Scraping time: %.2f seconds.' % time_difference)\n\n\nmain()\n```\n\nThe URL is scraped; however, no results are seen yet. For that, it’s\ntime to add yet another function - `save_product`.\n\n### 5. Add save_product function\n\n`save_product` takes two parameters: the book name and the product info\ndictionary. Since the book name contains spaces, we first replace them\nwith underscores. Finally, we create a JSON file and dump all the info\nwe have into it. Make sure you create a data directory in the folder of\nyour script where all the JSON files are going to be saved.\n\n```python\nimport csv\nimport json\nimport time\nimport requests\nfrom bs4 import BeautifulSoup\n\n\ndef save_product(book_name, product_info):\n    json_file_name = book_name.replace(' ', '_')\n    with open(f'data/{json_file_name}.json', 'w') as book_file:\n        json.dump(product_info, book_file)\n\n\ndef scrape(url):\n    response = requests.get(url)\n    soup = BeautifulSoup(response.content, 'html.parser')\n    book_name = soup.select_one('.product_main').h1.text\n    rows = soup.select('.table.table-striped tr')\n    product_info = {row.th.text: row.td.text for row in rows}\n    save_product(book_name, product_info)\n\n\ndef main():\n    start_time = time.time()\n\n    print('Saving the output of extracted information')\n    with open('urls.csv') as file:\n        csv_reader = csv.DictReader(file)\n        for csv_row in csv_reader:\n            scrape(csv_row['url'])\n\n    time_difference = time.time() - start_time\n    print(f'Scraping time: %.2f seconds.' % time_difference)\n\n\nmain()\n```\n\n### 6. Run the script\n\nNow, it's time to run the script and see the data. Here, we can also see\nhow much time the scraping took – in this case it’s 17.54 seconds.\n\n![Synchronous web scraping output](images/synchronous_output.png)\n\n## Comparing the performance of sync and async\n\nNow that we carefully went through the processes of making requests with\nboth synchronous and asynchronous methods, we can run the requests once\nagain and compare the performance of two scripts.\n\nThe time difference is huge – while the async web scraping code was able\nto execute all the tasks in around 3 seconds, it took almost 16 for the\nsynchronous one. This proves that scraping asynchronously is indeed more\nbeneficial due to its noticeable time efficiency.\n\n![Time comparison of synchronous and asynchronous web scraping](images/speed_comparison.png)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fasynchronous-web-scraping-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foxylabs%2Fasynchronous-web-scraping-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fasynchronous-web-scraping-python/lists"}