{"id":14980402,"url":"https://github.com/oxylabs/web-scraping-powershell","last_synced_at":"2025-06-19T21:33:17.932Z","repository":{"id":134336690,"uuid":"557796341","full_name":"oxylabs/web-scraping-powershell","owner":"oxylabs","description":"A comprehensive guide on using PowerShell and PowerHTML for designing reliable web scrapers.","archived":false,"fork":false,"pushed_at":"2025-02-11T12:57:39.000Z","size":172,"stargazers_count":10,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-09T01:51:18.193Z","etag":null,"topics":["powershell","powershell-core","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"PowerShell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oxylabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-26T10:16:48.000Z","updated_at":"2025-02-11T12:57:42.000Z","dependencies_parsed_at":null,"dependency_job_id":"c6d98852-d3da-475a-9919-265c8a0a10d7","html_url":"https://github.com/oxylabs/web-scraping-powershell","commit_stats":{"total_commits":5,"total_committers":3,"mean_commits":"1.6666666666666667","dds":0.4,"last_synced_commit":"8025ec7418f8c254249eccec16831ba762865cc1"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/oxylabs/web-scraping-powershell","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-powershell","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-powershell/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-powershell/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-powershell/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oxylabs","download_url":"https://codeload.github.com/oxylabs/web-scraping-powershell/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-powershell/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260835368,"owners_count":23070277,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["powershell","powershell-core","web-scraping"],"created_at":"2024-09-24T14:01:43.379Z","updated_at":"2025-06-19T21:33:12.917Z","avatar_url":"https://github.com/oxylabs.png","language":"PowerShell","readme":"# Web Scraping With PowerShell: The Ultimate Guide\r\n\r\n[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7\u0026aff_id=877\u0026url_id=112)`\r\n\r\n\r\n[![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq)\r\n\r\n[\u003cimg src=\"https://img.shields.io/static/v1?label=\u0026message=PowerShell\u0026color=brightgreen\" /\u003e]() [\u003cimg src=\"https://img.shields.io/static/v1?label=\u0026message=PowerHTML\u0026color=yellow\" /\u003e]() [\u003cimg src=\"https://img.shields.io/static/v1?label=\u0026message=Web%20Scraping\u0026color=important\" /\u003e](https://github.com/topics/web-scraping)\r\n\r\n[PowerShell Core](https://github.com/PowerShell/PowerShell) (an advanced version of Windows PowerShell with open-source and cross-platform properties) is a configuration and automation engine for solving tasks and issues designed by Microsoft. PowerShell and its successors consist of a scripting language with object-oriented support and a command line shell. \r\n\r\nPowerShell is often used in the data acquisition field. This guide will let us through several examples to scrape data with PowerShell. We will also see why and how PowerHTML fits in the scraping process- let's get started. \r\n\r\n**Note:** Don't miss reading our detailed guide on [web scraping with PowerShell and PowerHTML](https://oxylabs.io/blog/powershell-web-scraping). \r\n\r\n## Target for Scraping Examples\r\n\r\nThis guide takes [Books to Scrape](https://books.toscrape.com/) as a target for our PowerShell web scraping examples. The target website features hundreds of books under 52 categories. The link to each category is available on the index page, as depicted by the snippet below:\r\n\r\n![Index Page of the Target](IndexPage.png)\r\n\r\n## Scraping Categories URLs\r\n\r\nFirst, create a `.ps1` file to write the `PowerShell` script. Write the following script in this newly created `.ps1` file and run in PowerShell. It scraps all the category URLs on the [target page](https://books.toscrape.com/). \r\n\r\n```powershell\r\n#scraping_book_category_urls.ps1\r\n$scraped_links = (Invoke-WebRequest -Uri 'https://books.toscrape.com/').Links.Href  | Get-Unique \r\n$reg_expression = 'catalogue/category/books/.*'\r\n$all_matches = ($scraped_links | Select-String $reg_expression -AllMatches).Matches\r\n \r\n$urls = foreach ($url in $all_matches){\r\n    $url.Value\r\n}\r\n$urls\r\n```\r\n\r\n## Scraping Single Book Information\r\n\r\nAssume we want to scrape some information (e.g., `Name`, `Price`, `UPC_Id`, `Price`, etc.) from a webpage of a book: [Libertarianism for Beginners](https://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html), we can use the following PowerShell script: \r\n\r\n```powershell\r\n#scraping_single_book_info.ps1\r\n$book_html = Invoke-RestMethod 'https://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html'\r\n \r\n$reg_exp = \u003cli class=\"active\".*\u003e(?\u003cname\u003e.*)\u003c/li\u003e(.|\\n)*\u003cth\u003eUPC\u003c/th\u003e\u003ctd.*\u003e(?\u003cupc_id\u003e.*)\u003c/td\u003e(.|\\n)*\u003cth\u003eProduct Type\u003c/th\u003e\u003ctd.*\u003e(?\u003cproduct_type\u003e.*)\u003c/td\u003e(.|\\n)*\u003cth\u003ePrice.*\u003c/th\u003e\u003ctd.*\u003e(?\u003cprice\u003e.*)\u003c/td\u003e(.|\\n)* \u003cth\u003eAvailability\u003c/th\u003e(.|\\n)*\u003ctd.*\u003e(?\u003cavailability\u003e.*)\u003c/td\u003e'\r\n \r\n$all_matches = ($book_html | Select-String $reg_exp -AllMatches).Matches\r\n \r\n$BookDetails =[PSCustomObject]@{\r\n  'Name' = ($all_matches.Groups.Where{$_.Name -like 'name'}).Value\r\n  'UPC_id' = ($all_matches.Groups.Where{$_.Name -like 'upc_id'}).Value\r\n  'Product Type' = ($all_matches.Groups.Where{$_.Name -like 'product_type'}).Value\r\n  'Price' = ($all_matches.Groups.Where{$_.Name -like 'price'}).Value\r\n  'Availability' = ($all_matches.Groups.Where{$_.Name -like 'availability'}).Value\r\n}\r\n$BookDetails \r\n```\r\n\r\n## Scraping All Books of a Specific Category\r\n\r\nThe following script can be used to scrape the `title` and `price` information of all the books in a specific category:\r\n\r\n```powershell\r\n#scraping_all_book_info.ps1\r\n$category_page_html=Invoke-RestMethod 'https://books.toscrape.com/catalogue/category/books/sports-and-games_17/index.html'\r\n\r\n$reg_exp = '\u003ch3\u003e\u003ca href=.* title=\\\"(?\u003ctitle\u003e.*)\\\"\u003e.*\u003c\\/a\u003e\u003c\\/h3\u003e(\\n.*){13}\u003cp class=\"price_color\"\u003e(?\u003cprice\u003e.*)\u003c\\/p\u003e'\r\n\r\n$all_matches = ($category_page_html | Select-String $reg_exp -AllMatches).Matches\r\n$BookList = foreach ($book in $all_matches)\r\n{\r\n    [PSCustomObject]@{\r\n        'title' = ($book.Groups.Where{$_.Name -like 'title'}).Value\r\n        'price' = ($book.Groups.Where{$_.Name -like 'price'}).Value      \r\n    }\r\n}\r\n$BookList \r\n```\r\n\r\n## Parsing Data With PowerHTML\r\n\r\nUntil now, we’ve been using regular expressions to extract the required information from HTML raw content. Regular expressions are difficult to read and modify; therefore, a more readable and maintainable parser is inevitable. [PowerHTML](https://www.powershellgallery.com/packages/PowerHTML/0.1.7) saves us in this scenario. Being a powerful wrapper over the [HtmlAgilityPack](https://html-agility-pack.net/), it supports  *[XPath](https://oxylabs.io/blog/xpath-vs-css)* syntax to parse the HTML, thereby helping us parse the raw contents easily even in absence of the HTML [Document Object Model (DOM)](https://www.w3.org/TR/WD-DOM/introduction.html).\r\n\r\n### Installing PowerHTML\r\n\r\nWe can install the PowerHTML module using the following command:\r\n\r\n```powershell\r\nInstall-Module -Name PowerHTML \r\n```\r\n\r\n### Using PowerHTML to scrape the single book information\r\n\r\nThe following script uses PowerHTML to Scrape book information (e.g., `Name`, `UPC`, `Product_Type`, etc.) from a book page: [A Light in the Attic](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html).\r\n\r\n```powershell\r\n#scraping_single_book_info_with_PowerHTML.ps1\r\n$web_page = Invoke-WebRequest 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'\r\n \r\n$html = ConvertFrom-Html $web_page\r\n \r\n$BookDetails=[System.Collections.ArrayList]::new()\r\n \r\n$name_of_book =$html.SelectNodes('//li') | Where-Object { $_.HasClass('active') }\r\n$name=$name_of_book.ChildNodes[0].innerText\r\n$n = New-Object -TypeName psobject\r\n$n | Add-Member -MemberType NoteProperty -Name Name -Value $name\r\n$BookDetails+=$n\r\n \r\n$table = $html.SelectNodes('//table') | Where-Object { $_.HasClass('table-striped') }\r\n \r\nforeach ($row in $table.SelectNodes('tr'))\r\n{\r\n    $cnt += 1\r\n    $name=$row.SelectSingleNode('th').innerText.Trim() \r\n    $value=$row.SelectSingleNode('td').innerText.Trim() -replace \"\\?\", \" \"\r\n    $new_obj = New-Object -TypeName psobject\r\n    $new_obj | Add-Member -MemberType NoteProperty -Name $name -Value $value\r\n    $BookDetails+=$new_obj \r\n}\r\n \r\nWrite-Output 'Extracted Table Information'\r\n$table\r\n \r\nWrite-Output 'Extracted Book Details Parsed from HTML table'\r\n$BookDetails\r\n```\r\n\r\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fweb-scraping-powershell","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foxylabs%2Fweb-scraping-powershell","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fweb-scraping-powershell/lists"}