Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yhat/scrape
A simple, higher level interface for Go web scraping.
https://github.com/yhat/scrape
Last synced: 3 months ago
JSON representation
A simple, higher level interface for Go web scraping.
- Host: GitHub
- URL: https://github.com/yhat/scrape
- Owner: yhat
- License: bsd-2-clause
- Created: 2015-05-18T18:20:30.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-11-28T14:46:10.000Z (about 8 years ago)
- Last Synced: 2024-07-31T14:09:58.662Z (6 months ago)
- Language: Go
- Homepage:
- Size: 9.77 KB
- Stars: 1,511
- Watchers: 41
- Forks: 101
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- my-awesome-github-stars - yhat/scrape - A simple, higher level interface for Go web scraping. (Go)
README
# scrape
A simple, higher level interface for Go web scraping.
When scraping with Go, I find myself redefining tree traversal and other
utility functions.This package is a place to put some simple tools which build on top of the
[Go HTML parsing library](https://godoc.org/golang.org/x/net/html).For the full interface check out the godoc
[![GoDoc](https://godoc.org/github.com/yhat/scrape?status.svg)](https://godoc.org/github.com/yhat/scrape)## Sample
Scrape defines traversal functions like `Find` and `FindAll` while attempting
to be generic. It also defines convenience functions such as `Attr` and `Text`.```go
// Parse the page
root, err := html.Parse(resp.Body)
if err != nil {
// handle error
}
// Search for the title
title, ok := scrape.Find(root, scrape.ByTag(atom.Title))
if ok {
// Print the title
fmt.Println(scrape.Text(title))
}
```## A full example: Scraping Hacker News
```go
package mainimport (
"fmt"
"net/http""github.com/yhat/scrape"
"golang.org/x/net/html"
"golang.org/x/net/html/atom"
)func main() {
// request and parse the front page
resp, err := http.Get("https://news.ycombinator.com/")
if err != nil {
panic(err)
}
root, err := html.Parse(resp.Body)
if err != nil {
panic(err)
}// define a matcher
matcher := func(n *html.Node) bool {
// must check for nil values
if n.DataAtom == atom.A && n.Parent != nil && n.Parent.Parent != nil {
return scrape.Attr(n.Parent.Parent, "class") == "athing"
}
return false
}
// grab all articles and print them
articles := scrape.FindAll(root, matcher)
for i, article := range articles {
fmt.Printf("%2d %s (%s)\n", i, scrape.Text(article), scrape.Attr(article, "href"))
}
}
```