Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/theohbrothers/pssitescraper
Cmdlets for scraping a site.
https://github.com/theohbrothers/pssitescraper
html powershell pwsh scrape site sitemap uri uri-scheme url website
Last synced: 2 months ago
JSON representation
Cmdlets for scraping a site.
- Host: GitHub
- URL: https://github.com/theohbrothers/pssitescraper
- Owner: theohbrothers
- License: mit
- Created: 2016-11-08T16:50:16.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2021-05-30T03:27:39.000Z (over 3 years ago)
- Last Synced: 2023-03-07T01:07:05.783Z (almost 2 years ago)
- Topics: html, powershell, pwsh, scrape, site, sitemap, uri, uri-scheme, url, website
- Language: PowerShell
- Homepage:
- Size: 80.1 KB
- Stars: 4
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PSSiteScraper
[![github-actions](https://github.com/theohbrothers/PSSiteScraper/workflows/ci-master-pr/badge.svg)](https://github.com/theohbrothers/PSSiteScraper/actions)
[![github-release](https://img.shields.io/github/v/release/theohbrothers/PSSiteScraper?style=flat-square)](https://github.com/theohbrothers/PSSiteScraper/releases/)
[![powershell-gallery-release](https://img.shields.io/powershellgallery/v/PSSiteScraper?logo=powershell&logoColor=white&label=PSGallery&labelColor=&style=flat-square)](https://www.powershellgallery.com/packages/PSSiteScraper/)Cmdlets for scraping a site.
## Agenda
- Get a site's sitemaps
- Get a site's published URLs from sitemaps
- Get URIs from HTML## Install
Open [`powershell`](https://docs.microsoft.com/en-us/powershell/scripting/windows-powershell/install/installing-windows-powershell?view=powershell-5.1) or [`pwsh`](https://github.com/powershell/powershell#-powershell) and type:
```powershell
Install-Module -Name PSSiteScraper -Repository PSGallery -Scope CurrentUser -Verbose
```If prompted to trust the repository, hit `Y` and `enter`.
## Usage
```powershell
Import-Module PSSiteScraper# Get child sitemaps of a parent sitemap.
Get-Sitemaps -Uri https://example.com/sitemap.xml# Get URLs from a sitemap
Get-SitemapUris -Uri https://example.com/sitemap-child.xml# Get URIs from all tags' attributes of given HTML
Get-HtmlUris -Html $html
# Get URIs from all tags' attributes of given HTML of scheme 'foo'. E.g. URI 'foo://bar/baz'
Get-HtmlUris -Html $html -UriScheme foo
# Get URIs from all tag's attributes of given HTML
Get-HtmlUris -Html $html -Tag a -UriScheme https
# Get URIs from all tag's 'srcset' attribute of given HTML
Get-HtmlUris -Html $html -Tag img -Attribute srcset -UriScheme https
```