https://github.com/alexleen/scrape-x
Simple .NET library that provides generic web scraping abilities using XPaths.
https://github.com/alexleen/scrape-x
chsarp dotnet dotnetstandard scraper scraping scraping-websites
Last synced: 5 months ago
JSON representation
Simple .NET library that provides generic web scraping abilities using XPaths.
- Host: GitHub
- URL: https://github.com/alexleen/scrape-x
- Owner: alexleen
- License: mit
- Created: 2018-07-03T14:59:11.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2019-01-02T04:04:30.000Z (over 7 years ago)
- Last Synced: 2025-10-11T05:42:38.496Z (8 months ago)
- Topics: chsarp, dotnet, dotnetstandard, scraper, scraping, scraping-websites
- Language: C#
- Homepage:
- Size: 560 KB
- Stars: 4
- Watchers: 2
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# scrape-x
[](https://ci.appveyor.com/project/alexleen/scrape-x/branch/master)
[](https://sonarcloud.io/dashboard?id=alexleen_scrape-x)
[](https://sonarcloud.io/dashboard?id=alexleen_scrape-x)
[](https://www.nuget.org/packages/ScrapeX/)
Simple .NET library that provides generic web scraping abilities using XPaths.
Basic features:
- Fluent interface
- Pagination
- Throttling
- [HttpClient](https://docs.microsoft.com/en-us/dotnet/api/system.net.http.httpclient?view=netframework-4.7.2) injection
## Wiki
For how-to's, examples, and documentation, please see [the wiki](https://github.com/alexleen/scrape-x/wiki).
## Example Usage
```cs
private static void Main(string[] args)
{
IScraperFactory scraperFactory = new ScraperFactory();
//Set up a new scraper to scrape Austin's craigslist
IPaginatingScraper scraper = scraperFactory.CreatePaginatingScraper("https://austin.craigslist.org");
//Set the URL for the results page. In this case, "apts/housing for rent".
scraper.SetResultsStartPage("/search/apa")
//Set the XPath for search result nodes
.SetIndividualResultNodeXPath("//*[@id=\"sortable-results\"]/ul/li")
//Sets the XPath for search result links relative to result node
.SetIndividualResultLinkXPath("a/@href")
//Sets a predicate that decides whether or not an individual result should be visited or not.
//In this case, results are only visited if their "housing" span contains "1br".
//This saves considerable bandwidth.
.SetResultVisitPredicate(housing => housing.Contains("1br"), "p/span[2]/span[2]")
//Sets "Next" button link XPath
.SetNextLinkXPath("//*[@id=\"searchform\"]/div[3]/div[3]/span[2]/a[3]/@href")
//Sets XPaths used for retrieving data from the target page.
//Keys are used to identify the data in the callback to the Go method.
.SetTargetPageXPaths(new Dictionary
{
{ "latitude", "//*[@id=\"map\"]/@data-latitude" },
{ "longitude", "//*[@id=\"map\"]/@data-longitude" },
{ "price", "/html/body/section/section/h2/span[2]/span[1]" },
{ "br", "/html/body/section/section/section/div[1]/p[1]/span[1]/b[1]" },
{ "sqft", "/html/body/section/section/section/div[1]/p[1]/span[2]/b" }
})
//Go!
//Everytime a target page is scraped this callback is called.
.Go(OnResultRetrieved);
}
private static void OnResultRetrieved(string link, IDictionary results)
{
//Do something with the results...
Console.WriteLine(results["br"]);
}
```
## Thanks!
[JetBrains Rider](https://www.jetbrains.com/rider/)
[AppVeyor](https://ci.appveyor.com/)
[HtmlAgilityPack](https://html-agility-pack.net/)
[SonarCloud](https://sonarcloud.io/)