{"id":30113780,"url":"https://github.com/hjsblogger/web-crawling-with-python","last_synced_at":"2025-08-10T07:30:23.770Z","repository":{"id":299361566,"uuid":"977876731","full_name":"hjsblogger/web-crawling-with-python","owner":"hjsblogger","description":"Demonstration of Web Crawling using Python and Beautiful Soup","archived":false,"fork":false,"pushed_at":"2025-07-03T17:17:54.000Z","size":17,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-03T18:30:01.029Z","etag":null,"topics":["beautifulsoup","beautifulsoup4","lambdatest","python","python3","web-crawler","web-crawling","web-crawling-and-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hjsblogger.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-05T05:40:55.000Z","updated_at":"2025-07-03T17:17:57.000Z","dependencies_parsed_at":"2025-07-03T18:37:01.254Z","dependency_job_id":null,"html_url":"https://github.com/hjsblogger/web-crawling-with-python","commit_stats":null,"previous_names":["hjsblogger/web-crawling-with-python"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hjsblogger/web-crawling-with-python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hjsblogger%2Fweb-crawling-with-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hjsblogger%2Fweb-crawling-with-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hjsblogger%2Fweb-crawling-with-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hjsblogger%2Fweb-crawling-with-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hjsblogger","download_url":"https://codeload.github.com/hjsblogger/web-crawling-with-python/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hjsblogger%2Fweb-crawling-with-python/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269693252,"owners_count":24460223,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-10T02:00:08.965Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","beautifulsoup4","lambdatest","python","python3","web-crawler","web-crawling","web-crawling-and-scraping"],"created_at":"2025-08-10T07:30:20.148Z","updated_at":"2025-08-10T07:30:23.756Z","avatar_url":"https://github.com/hjsblogger.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Crawling with Python\n\n\u003cimg width=\"1000\" height=\"500\" alt=\"cover-image\" src=\"https://github.com/user-attachments/assets/840330c8-b856-4376-9148-5466b57ab3f3\"\u003e\n\u003cdiv align=\"center\"\u003eImage generated using Grok\u003c/a\u003e\u003c/div\u003e\n\u003cbr/\u003e\n\nIn this 'Web Crawling with Python' repo, we have covered the following scenario:\n\nUnique links from [LambdaTest E-commerce Playground](https://ecommerce-playground.lambdatest.io/) are crawled using Beautiful Soup. Content (i.e., product meta-data) from the crawled content is than scraped with Beautiful Soup. I have a detailed blog \u0026 repo on **Web Scraping with Python**, details below:\n\n* [Blog - Web Scraping with Python](https://www.lambdatest.com/blog/web-scraping-with-python/)\n* [Repo - Web Scraping with Python](https://github.com/hjsblogger/web-scraping-with-python)\n\n## Pre-requisites for test execution\n\n**Step 1**\n\nCreate a virtual environment by triggering the *virtualenv venv* command on the terminal\n\n```bash\nvirtualenv venv\n```\n\u003cimg width=\"1418\" alt=\"VirtualEnvironment\" src=\"https://github.com/hjsblogger/web-scraping-with-python/assets/1688653/89beb6af-549f-42ac-a063-e5f715018ef8\"\u003e\n\n**Step 2**\n\nNavigate the newly created virtual environment by triggering the *source venv/bin/activate* command on the terminal\n\n```bash\nsource venv/bin/activate\n```\n\nFollow steps(3) and (4) for performing web scraping on LambdaTest Cloud Grid:\n\n**Step 3**\n\nRun the *make install* command on the terminal to install the desired packages (or dependencies) - Beautiful Soup,urrlib3, etc.\n\n```bash\nmake install\n```\n\n\u003cimg width=\"1413\" alt=\"make-install\" src=\"https://github.com/user-attachments/assets/9780b589-86cc-43d0-ab88-7bbccfef8663\" /\u003e\n\nWith this, all the dependencies and environment variables are set. We are all set for web crawling with Beautiful Soup (bs4).\n\n## Web Crawling using Beautiful Soup\n\nFollow the below mentioned steps to for crawling the [LambdaTest E-commerce Playground](https://ecommerce-playground.lambdatest.io/)\n\n**Step 1**\n\nTrigger the command ```make clean``` to clean the remove _pycache_ folder(s) and .pyc files\n\n\u003cimg width=\"710\" alt=\"cover-image\" src=\"https://github.com/hjsblogger/web-scraping-with-python/assets/1688653/1baf2aeb-fab1-4207-8547-4c07a70074c2\"\u003e\n\u003cbr/\u003e\n\n**Step 2**\n\nTrigger the ```make crawl-ecommerce-playground``` command on the terminal to crawl the LambdaTest E-Commerce Playground\n\n\u003cimg width=\"939\" alt=\"web-crawling-1\" src=\"https://github.com/user-attachments/assets/e748ea89-5e5a-43df-8b19-13ba6d78d5e0\" /\u003e\n\n\u003cimg width=\"1154\" alt=\"web-crawling-2\" src=\"https://github.com/user-attachments/assets/79fbc9d5-a060-4411-96ed-b452b4ebdb19\" /\u003e\n\nAs seen above, the content from LambdaTest E-commerce playground was crawled successfully! Fifty five unique product links are now available to be scraped in the exported JSON file (i.e., ecommerce_crawled_urls.json)\n\n**Step 3**\n\nNow that we have the crawled information, trigger the ```make scrap-ecommerce-playground``` command on the terminal to scrap the product information (i.e., product name, product price, product availability, etc.) from the exported JSON file.\n\n\u003cimg width=\"1181\" alt=\"web-scraping-1\" src=\"https://github.com/user-attachments/assets/238d6d34-388b-4671-9249-e0e1358b90b2\" /\u003e\n\n\u003cimg width=\"1153\" alt=\"web-scraping-2\" src=\"https://github.com/user-attachments/assets/a2f06f81-c2a1-45b1-851e-fb35debb8dcf\" /\u003e\n\nAlso, all the 55 links on are scraped without any issues!\n\n## Have feedback or need assistance?\nFeel free to fork the repo and contribute to make it better! Email to [himanshu[dot]sheth[at]gmail[dot]com](mailto:himanshu.sheth@gmail.com) for any queries or ping me on the following social media sites:\n\n\u003cb\u003eLinkedIn\u003c/b\u003e: [@hjsblogger](https://linkedin.com/in/hjsblogger)\u003cbr/\u003e\n\u003cb\u003eTwitter\u003c/b\u003e: [@hjsblogger](https://www.twitter.com/hjsblogger)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhjsblogger%2Fweb-crawling-with-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhjsblogger%2Fweb-crawling-with-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhjsblogger%2Fweb-crawling-with-python/lists"}