https://github.com/theohbrothers/pssitescraper
Cmdlets for scraping a site.
https://github.com/theohbrothers/pssitescraper
html powershell pwsh scrape site sitemap uri uri-scheme url website
Last synced: 6 months ago
JSON representation
Cmdlets for scraping a site.
- Host: GitHub
- URL: https://github.com/theohbrothers/pssitescraper
- Owner: theohbrothers
- License: mit
- Created: 2016-11-08T16:50:16.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2025-02-13T11:24:48.000Z (8 months ago)
- Last Synced: 2025-03-26T07:22:00.706Z (7 months ago)
- Topics: html, powershell, pwsh, scrape, site, sitemap, uri, uri-scheme, url, website
- Language: PowerShell
- Homepage:
- Size: 84 KB
- Stars: 5
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PSSiteScraper
[](https://github.com/theohbrothers/PSSiteScraper/actions)
[](https://github.com/theohbrothers/PSSiteScraper/releases/)
[](https://www.powershellgallery.com/packages/PSSiteScraper/)Cmdlets for scraping a site.
## Agenda
- Get a site's sitemaps
- Get a site's published URLs from sitemaps
- Get URIs from HTML## Install
Open [`powershell`](https://docs.microsoft.com/en-us/powershell/scripting/windows-powershell/install/installing-windows-powershell?view=powershell-5.1) or [`pwsh`](https://github.com/powershell/powershell#-powershell) and type:
```powershell
Install-Module -Name PSSiteScraper -Repository PSGallery -Scope CurrentUser -Verbose
```If prompted to trust the repository, hit `Y` and `enter`.
## Usage
```powershell
Import-Module PSSiteScraper# Get child sitemaps of a parent sitemap.
Get-Sitemaps -Uri https://example.com/sitemap.xml# Get URLs from a sitemap
Get-SitemapUris -Uri https://example.com/sitemap-child.xml# Get URIs from all tags' attributes of given HTML
Get-HtmlUris -Html $html
# Get URIs from all tags' attributes of given HTML of scheme 'foo'. E.g. URI 'foo://bar/baz'
Get-HtmlUris -Html $html -UriScheme foo
# Get URIs from all tag's attributes of given HTML
Get-HtmlUris -Html $html -Tag a -UriScheme https
# Get URIs from alltag's 'srcset' attribute of given HTML
Get-HtmlUris -Html $html -Tag img -Attribute srcset -UriScheme https
```