{"id":20710024,"url":"https://github.com/oxylabs/regex-web-scraping","last_synced_at":"2026-04-16T01:33:20.777Z","repository":{"id":134336663,"uuid":"526129353","full_name":"oxylabs/regex-web-scraping","owner":"oxylabs","description":"Web Scraping with RegEx ","archived":false,"fork":false,"pushed_at":"2025-09-24T13:00:46.000Z","size":18,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-27T21:23:14.827Z","etag":null,"topics":["github-python","python","regex","regex-scraping","using-regex-in-python","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oxylabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-08-18T08:45:36.000Z","updated_at":"2025-09-24T13:00:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"328428ed-562b-4fbd-9efe-7d51e668e386","html_url":"https://github.com/oxylabs/regex-web-scraping","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/oxylabs/regex-web-scraping","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fregex-web-scraping","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fregex-web-scraping/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fregex-web-scraping/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fregex-web-scraping/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oxylabs","download_url":"https://codeload.github.com/oxylabs/regex-web-scraping/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fregex-web-scraping/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31867711,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-15T15:24:51.572Z","status":"ssl_error","status_checked_at":"2026-04-15T15:24:39.138Z","response_time":63,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["github-python","python","regex","regex-scraping","using-regex-in-python","web-scraping"],"created_at":"2024-11-17T02:09:35.031Z","updated_at":"2026-04-16T01:33:20.770Z","avatar_url":"https://github.com/oxylabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Scraping With RegEx\n\n[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877\u0026utm_medium=affiliate\u0026groupid=877\u0026utm_content=regex-web-scraping-github\u0026transaction_id=102f49063ab94276ae8f116d224b67)\n\n[![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge\u0026theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge\u0026logo=youtube\u0026logoColor=white)](https://www.youtube.com/@oxylabs)\n\n# Creating virutal environment\n\n```bash\npython3 -m venv scrapingdemo\n```\n\n```bash\nsource ./scrapingdemo/bin/activate\n```\n\n# Installing requirements\n\n```bash\npip install requests\n```\n\n```bash\npip install beautifulsoup4\n```\n\n# Importing the required libraries\n\n```python\nimport requests\nfrom bs4 import BeautifulSoup \nimport re\n```\n\n## Sending the GET request\n\nUse the Requests library to send a request to a web page from which you want to scrape the data. In this case, https://books.toscrape.com/. To commence, enter the following:\n\n```python\npage = requests.get('https://books.toscrape.com/')\n```\n\n## Selecting data\n\nFirst, create a Beautiful Soup object and pass the page content received from your request during the initialization, including the parser type. As you’re working with an HTML code, select `HTML.parser` as the parser type.\n\n![image](https://user-images.githubusercontent.com/95211181/189277235-ae681699-d475-411a-8bb6-7b3fcb2fc031.png)\n\nBy inspecting the elements (right-click and select inspect element) in a browser, you can see that each book title and price are presented inside an `article` element with the class called `product_pod`. Use Beautiful Soup to get all the data inside these elements, and then convert it to a string:\n\n```python\nsoup = BeautifulSoup(page.content, 'html.parser')\ncontent = soup.find_all(class_='product_pod')\ncontent = str(content)\n```\n\n## Processing the data using RegEx\n\nSince the acquired content has a lot of unnecessary data, create two regular expressions to get only the desired data.\n\n![](https://images.prismic.io/oxylabs-sm/YTViYjIyMTItZDczMi00OTVhLTliZDEtY2E2MTZiMDhmMzdh_image3.png?auto=compress,format\u0026rect=0,0,1486,520\u0026w=1486\u0026h=520\u0026fm=webp\u0026q=75)\n\n### Expression # 1\n### Finding the pattern\n\nFirst, inspect the title of the book to find the pattern. You can see above that every title is present after the text `title=` in the format `title=“Titlename”`.\n\n### Generating the expression\n\nThen, create an expression that returns the data inside quotations after the `title=` by specifying `\"(.*?)\"`.\n\nThe first expression is as follows:\n\n```python\nre_titles = r'title=\"(.*?)\"\u003e'\n```\n\n### Expression # 2\n### Finding the pattern\n\nFirst, inspect the price of the book. Every price is present after the text `£` in the format `£=price` before the paragraph tag `\u003c/p\u003e`.\n\n### Generating the expression\n\nThen, create an expression that returns the data inside quotations after the `£=` and before the `\u003c/p\u003e` by specifying `£(.*?)\u003c/p\u003e`.\n\nThe second expression is as follows:\n\n```python\nre_prices = '£(.*?)\u003c/p\u003e'\n```\n\nTo conclude, use the expressions with `re.findall` to find the substrings matching the patterns. Lastly, save them in the variables `title_list` and `price_list`.\n\n```python\ntitles_list = re.findall(re_titles, content)\nprice_list = re.findall(re_prices, content)\n```\n\n## Saving the output\n\nTo save the output, loop over the pairs for the titles and prices and write them to the `output.txt` file.\n\n```python\nwith open(\"output.txt\", \"w\") as f:\n   for title, price in zip(titles_list, price_list):\n       f.write(title + \"\\t\" + price + \"\\n\")\n```\n\n![](https://images.prismic.io/oxylabs-sm/NDQ3OTE2NzItZTQ5MC00YzY5LThiYzAtNDM3MDcwODNkNjBl_image2-1.png?auto=compress,format\u0026rect=0,0,1180,953\u0026w=1180\u0026h=953\u0026fm=webp\u0026q=75)\n\nPutting everything together, this is the complete code that can be run by calling `python demo.py`:\n\n```python\n# Importing the required libraries.\nimport requests\nfrom bs4 import BeautifulSoup\nimport re\n\n# Requesting the HTML from the web page.\npage = requests.get(\"https://books.toscrape.com/\")\n\n# Selecting the data.\nsoup = BeautifulSoup(page.content, \"html.parser\")\ncontent = soup.find_all(class_=\"product_pod\")\ncontent = str(content)\n\n# Processing the data using Regular Expressions.\nre_titles = r'title=\"(.*?)\"\u003e'\ntitles_list = re.findall(re_titles, content)\nre_prices = \"£(.*?)\u003c/p\u003e\"\nprice_list = re.findall(re_prices, content)\n\n#  Saving the output.\nwith open(\"output.txt\", \"w\") as f:\n   for title, price in zip(titles_list, price_list):\n       f.write(title + \"\\t\" + price + \"\\n\")\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fregex-web-scraping","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foxylabs%2Fregex-web-scraping","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fregex-web-scraping/lists"}