Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/yhat/scrape

A simple, higher level interface for Go web scraping.
https://github.com/yhat/scrape

Last synced: 3 months ago
JSON representation

A simple, higher level interface for Go web scraping.

Awesome Lists containing this project

README

        

# scrape

A simple, higher level interface for Go web scraping.

When scraping with Go, I find myself redefining tree traversal and other
utility functions.

This package is a place to put some simple tools which build on top of the
[Go HTML parsing library](https://godoc.org/golang.org/x/net/html).

For the full interface check out the godoc
[![GoDoc](https://godoc.org/github.com/yhat/scrape?status.svg)](https://godoc.org/github.com/yhat/scrape)

## Sample

Scrape defines traversal functions like `Find` and `FindAll` while attempting
to be generic. It also defines convenience functions such as `Attr` and `Text`.

```go
// Parse the page
root, err := html.Parse(resp.Body)
if err != nil {
// handle error
}
// Search for the title
title, ok := scrape.Find(root, scrape.ByTag(atom.Title))
if ok {
// Print the title
fmt.Println(scrape.Text(title))
}
```

## A full example: Scraping Hacker News

```go
package main

import (
"fmt"
"net/http"

"github.com/yhat/scrape"
"golang.org/x/net/html"
"golang.org/x/net/html/atom"
)

func main() {
// request and parse the front page
resp, err := http.Get("https://news.ycombinator.com/")
if err != nil {
panic(err)
}
root, err := html.Parse(resp.Body)
if err != nil {
panic(err)
}

// define a matcher
matcher := func(n *html.Node) bool {
// must check for nil values
if n.DataAtom == atom.A && n.Parent != nil && n.Parent.Parent != nil {
return scrape.Attr(n.Parent.Parent, "class") == "athing"
}
return false
}
// grab all articles and print them
articles := scrape.FindAll(root, matcher)
for i, article := range articles {
fmt.Printf("%2d %s (%s)\n", i, scrape.Text(article), scrape.Attr(article, "href"))
}
}
```